1. Introduction to Machine Learning

Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. The field emerged from pattern recognition and computational learning theory, with roots tracing back to Alan Turing's seminal question: "Can machines think?"

Formally, a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E (Mitchell, 1997).

Prerequisites for This Tutorial
  • Linear algebra: vectors, matrices, eigenvalues
  • Calculus: derivatives, gradients, optimization
  • Probability and statistics: distributions, Bayes' theorem
  • Basic programming concepts

The Machine Learning Pipeline

A typical ML workflow consists of:

  1. Data Collection: Gathering relevant data from various sources
  2. Data Preprocessing: Cleaning, normalization, feature engineering
  3. Model Selection: Choosing appropriate algorithm(s)
  4. Training: Fitting the model to training data
  5. Validation: Tuning hyperparameters using validation set
  6. Testing: Evaluating final performance on held-out test data
  7. Deployment: Putting the model into production

2. Machine Learning Taxonomy

Supervised Learning

The algorithm learns from labeled training data, mapping inputs to known outputs. Applications include:

Unsupervised Learning

The algorithm finds patterns in unlabeled data without predefined outputs:

Reinforcement Learning

The algorithm learns through interaction with an environment, receiving rewards or penalties for actions. Used in robotics, game playing, and autonomous systems.

3. The Bias-Variance Tradeoff

Understanding the bias-variance tradeoff is fundamental to building effective ML models. For a given point x, the expected prediction error can be decomposed as:

Bias-Variance Decomposition
E[(y - f̂(x))2] = Bias[f̂(x)]2 + Var[f̂(x)] + σ2
Where:
Bias[f̂(x)] = E[f̂(x)] - f(x) (systematic error)
Var[f̂(x)] = E[(f̂(x) - E[f̂(x)])2] (variance in predictions)
σ2 = irreducible error (noise in the data)

The goal is to find the optimal model complexity that minimizes total error.

4. Linear Regression

Linear regression models the relationship between a dependent variable y and one or more independent variables X by fitting a linear equation to observed data.

Simple Linear Regression

Simple Linear Regression Model
y = β0 + β1x + ε
Where β0 is the intercept, β1 is the slope, and ε ~ N(0, σ2) is the error term.

Multiple Linear Regression

Matrix Form
y = Xβ + ε

β̂ = (XTX)-1XTy
The Ordinary Least Squares (OLS) estimator minimizes the sum of squared residuals: minβ ||y - Xβ||2

Assumptions of Linear Regression

  1. Linearity: The relationship between X and y is linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Constant variance of residuals
  4. Normality: Residuals are normally distributed
  5. No multicollinearity: Predictors are not perfectly correlated

Regularization

To prevent overfitting, regularization adds a penalty term to the loss function:

Ridge Regression (L2)
L(β) = ||y - Xβ||2 + λ||β||22
Lasso Regression (L1)
L(β) = ||y - Xβ||2 + λ||β||1
Lasso can perform feature selection by shrinking some coefficients to exactly zero.

5. Logistic Regression

Despite its name, logistic regression is a classification algorithm that models the probability of a binary outcome.

Logistic Function (Sigmoid)
P(y=1|x) = σ(z) = 1 / (1 + e-z)

where z = β0 + β1x1 + ... + βpxp

Maximum Likelihood Estimation

Log-Likelihood Function
l(β) = ∑i [yi log(pi) + (1-yi) log(1-pi)]
Parameters are estimated by maximizing the log-likelihood using gradient descent or Newton-Raphson method.

Interpretation: Odds Ratios

The coefficients β have a natural interpretation in terms of odds ratios:

odds = P(y=1) / P(y=0) = eβ0 + β1x
A one-unit increase in xj multiplies the odds by eβj.

6. Decision Trees

Decision trees are non-parametric models that recursively partition the feature space into regions, making predictions based on the majority class (classification) or mean value (regression) within each region.

Splitting Criteria

Gini Impurity
Gini(t) = 1 - ∑k pk2
Where pk is the proportion of class k samples in node t. Gini = 0 indicates perfect purity.
Information Gain (Entropy)
H(t) = -∑k pk log2(pk)

IG = H(parent) - ∑children (nc/n) H(c)

Tree Pruning

To prevent overfitting, trees can be pruned using:

7. Support Vector Machines

SVMs find the optimal hyperplane that maximizes the margin between classes. They are particularly effective in high-dimensional spaces.

SVM Optimization Problem
minw,b (1/2)||w||2
subject to: yi(w · xi + b) ≥ 1 for all i
The margin is 2/||w||, so minimizing ||w|| maximizes the margin.

Soft Margin SVM

For non-linearly separable data, slack variables ξi allow some misclassification:

min (1/2)||w||2 + C ∑i ξi
subject to: yi(w · xi + b) ≥ 1 - ξi, ξi ≥ 0
C is the regularization parameter controlling the tradeoff between margin width and misclassification.

The Kernel Trick

Kernels enable SVMs to learn non-linear decision boundaries by implicitly mapping data to higher-dimensional feature spaces:

8. Ensemble Methods

Ensemble methods combine multiple models to produce better predictions than any single model.

Bagging (Bootstrap Aggregating)

Train multiple models on bootstrap samples and average predictions (regression) or vote (classification).

Random Forests

Random forests extend bagging for decision trees by adding feature randomization:

Random Forest Feature Importance

Two common measures:

  • Mean Decrease in Impurity (MDI): Average reduction in Gini impurity across all trees
  • Mean Decrease in Accuracy (MDA): Reduction in OOB accuracy when feature values are permuted

Boosting

Boosting builds models sequentially, with each new model focusing on examples the previous models misclassified.

AdaBoost

AdaBoost Algorithm
H(x) = sign(∑t αt ht(x))
Where ht are weak learners and αt = (1/2)ln((1-εt)/εt) with εt being the weighted error rate.

Gradient Boosting

Gradient boosting generalizes boosting by optimizing a differentiable loss function:

Fm(x) = Fm-1(x) + γm hm(x)
Each weak learner hm fits the negative gradient (pseudo-residuals) of the loss function.

9. Clustering Algorithms

K-Means Clustering

K-Means partitions n observations into k clusters by minimizing within-cluster variance.

K-Means Objective
J = ∑k=1Kx∈Ck ||x - μk||2
Where μk is the centroid of cluster Ck.

Algorithm Steps:

  1. Initialize k centroids randomly
  2. Assign each point to nearest centroid
  3. Recompute centroids as mean of assigned points
  4. Repeat steps 2-3 until convergence

Hierarchical Clustering

Builds a hierarchy of clusters using either:

DBSCAN

Density-Based Spatial Clustering of Applications with Noise identifies clusters as dense regions separated by sparse regions. Advantages include:

10. Dimensionality Reduction

Principal Component Analysis (PCA)

PCA finds orthogonal directions (principal components) that maximize variance.

PCA via Eigendecomposition
Σv = λv
Where Σ is the covariance matrix, v are eigenvectors (principal components), and λ are eigenvalues (variance explained). Sort by λ and keep top k components.

t-SNE

t-Distributed Stochastic Neighbor Embedding is a nonlinear technique for visualization, preserving local structure by matching probability distributions in high and low dimensions.

11. Neural Networks

Perceptron and Multi-Layer Networks

Single Neuron
output = σ(∑i wixi + b)
Where σ is an activation function (sigmoid, ReLU, tanh, etc.)

Activation Functions

Backpropagation

Backpropagation computes gradients efficiently using the chain rule:

∂L/∂wij = ∂L/∂aj · ∂aj/∂zj · ∂zj/∂wij

12. Model Evaluation

Classification Metrics

Metric Formula Use Case
Accuracy(TP+TN)/(TP+TN+FP+FN)Balanced classes
PrecisionTP/(TP+FP)Minimize false positives
RecallTP/(TP+FN)Minimize false negatives
F1 Score2 · (Precision · Recall)/(Precision + Recall)Balance precision/recall
AUC-ROCArea under ROC curveOverall ranking ability

Regression Metrics

Cross-Validation

K-fold cross-validation provides robust performance estimates:

  1. Split data into k equal folds
  2. For each fold: train on k-1 folds, evaluate on held-out fold
  3. Average performance across all folds

References and Further Reading

  1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd Edition. Springer.
  2. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
  3. Murphy, K.P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
  4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  5. James, G., et al. (2021). An Introduction to Statistical Learning, 2nd Edition. Springer.
  6. Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32.