1. Introduction to Machine Learning

Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. The field emerged from pattern recognition and computational learning theory, with roots tracing back to Alan Turing's seminal question: "Can machines think?"

Formally, a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E (Mitchell, 1997).

Prerequisites for This Tutorial

Linear algebra: vectors, matrices, eigenvalues
Calculus: derivatives, gradients, optimization
Probability and statistics: distributions, Bayes' theorem
Basic programming concepts

The Machine Learning Pipeline

A typical ML workflow consists of:

Data Collection: Gathering relevant data from various sources
Data Preprocessing: Cleaning, normalization, feature engineering
Model Selection: Choosing appropriate algorithm(s)
Training: Fitting the model to training data
Validation: Tuning hyperparameters using validation set
Testing: Evaluating final performance on held-out test data
Deployment: Putting the model into production

2. Machine Learning Taxonomy

Supervised Learning

The algorithm learns from labeled training data, mapping inputs to known outputs. Applications include:

Regression: Predicting continuous values (e.g., price prediction, demand forecasting)
Classification: Predicting categorical labels (e.g., spam detection, medical diagnosis)

Unsupervised Learning

The algorithm finds patterns in unlabeled data without predefined outputs:

Clustering: Grouping similar data points (e.g., customer segmentation)
Dimensionality Reduction: Reducing feature space while preserving information
Anomaly Detection: Identifying unusual patterns

Reinforcement Learning

The algorithm learns through interaction with an environment, receiving rewards or penalties for actions. Used in robotics, game playing, and autonomous systems.

3. The Bias-Variance Tradeoff

Understanding the bias-variance tradeoff is fundamental to building effective ML models. For a given point x, the expected prediction error can be decomposed as:

Bias-Variance Decomposition

E[(y - f̂(x))²] = Bias[f̂(x)]² + Var[f̂(x)] + σ²

Where:
Bias[f̂(x)] = E[f̂(x)] - f(x) (systematic error)
Var[f̂(x)] = E[(f̂(x) - E[f̂(x)])²] (variance in predictions)
σ² = irreducible error (noise in the data)

High Bias (Underfitting): Model too simple, fails to capture underlying patterns
High Variance (Overfitting): Model too complex, fits noise in training data

The goal is to find the optimal model complexity that minimizes total error.

4. Linear Regression

Linear regression models the relationship between a dependent variable y and one or more independent variables X by fitting a linear equation to observed data.

Simple Linear Regression

Simple Linear Regression Model

y = β₀ + β₁x + ε

Where β₀ is the intercept, β₁ is the slope, and ε ~ N(0, σ²) is the error term.

Multiple Linear Regression

Matrix Form

y = Xβ + ε

β̂ = (X^TX)^-1X^Ty

The Ordinary Least Squares (OLS) estimator minimizes the sum of squared residuals: min_β ||y - Xβ||²

Assumptions of Linear Regression

Linearity: The relationship between X and y is linear
Independence: Observations are independent
Homoscedasticity: Constant variance of residuals
Normality: Residuals are normally distributed
No multicollinearity: Predictors are not perfectly correlated

Regularization

To prevent overfitting, regularization adds a penalty term to the loss function:

Ridge Regression (L2)

L(β) = ||y - Xβ||² + λ||β||²₂

Lasso Regression (L1)

L(β) = ||y - Xβ||² + λ||β||₁

Lasso can perform feature selection by shrinking some coefficients to exactly zero.

5. Logistic Regression

Despite its name, logistic regression is a classification algorithm that models the probability of a binary outcome.

Logistic Function (Sigmoid)

P(y=1|x) = σ(z) = 1 / (1 + e^-z)

where z = β₀ + β₁x₁ + ... + β_px_p

Maximum Likelihood Estimation

Log-Likelihood Function

l(β) = ∑_i [y_i log(p_i) + (1-y_i) log(1-p_i)]

Parameters are estimated by maximizing the log-likelihood using gradient descent or Newton-Raphson method.

Interpretation: Odds Ratios

The coefficients β have a natural interpretation in terms of odds ratios:

odds = P(y=1) / P(y=0) = e^{β₀ + β₁x}

A one-unit increase in x_j multiplies the odds by e^β_j.

6. Decision Trees

Decision trees are non-parametric models that recursively partition the feature space into regions, making predictions based on the majority class (classification) or mean value (regression) within each region.

Splitting Criteria

Gini Impurity

Gini(t) = 1 - ∑_k p_k²

Where p_k is the proportion of class k samples in node t. Gini = 0 indicates perfect purity.

Information Gain (Entropy)

H(t) = -∑_k p_k log₂(p_k)

IG = H(parent) - ∑_children (n_c/n) H(c)

Tree Pruning

To prevent overfitting, trees can be pruned using:

Pre-pruning: Stop growing early (max depth, min samples per leaf)
Post-pruning: Grow full tree, then remove branches that don't improve validation performance

7. Support Vector Machines

SVMs find the optimal hyperplane that maximizes the margin between classes. They are particularly effective in high-dimensional spaces.

SVM Optimization Problem

min_w,b (1/2)||w||²
subject to: y_i(w · x_i + b) ≥ 1 for all i

The margin is 2/||w||, so minimizing ||w|| maximizes the margin.

Soft Margin SVM

For non-linearly separable data, slack variables ξ_i allow some misclassification:

min (1/2)||w||² + C ∑_i ξ_i
subject to: y_i(w · x_i + b) ≥ 1 - ξ_i, ξ_i ≥ 0

C is the regularization parameter controlling the tradeoff between margin width and misclassification.

The Kernel Trick

Kernels enable SVMs to learn non-linear decision boundaries by implicitly mapping data to higher-dimensional feature spaces:

Linear: K(x, x') = x · x'
Polynomial: K(x, x') = (γ x · x' + r)^d
RBF (Gaussian): K(x, x') = exp(-γ||x - x'||²)

8. Ensemble Methods

Ensemble methods combine multiple models to produce better predictions than any single model.

Bagging (Bootstrap Aggregating)

Train multiple models on bootstrap samples and average predictions (regression) or vote (classification).

Random Forests

Random forests extend bagging for decision trees by adding feature randomization:

Build many trees on bootstrap samples
At each split, consider only a random subset of features (typically √p for classification, p/3 for regression)
Aggregate predictions by majority vote or averaging

Random Forest Feature Importance

Two common measures:

Mean Decrease in Impurity (MDI): Average reduction in Gini impurity across all trees
Mean Decrease in Accuracy (MDA): Reduction in OOB accuracy when feature values are permuted

Boosting

Boosting builds models sequentially, with each new model focusing on examples the previous models misclassified.

AdaBoost

AdaBoost Algorithm

H(x) = sign(∑_t α_t h_t(x))

Where h_t are weak learners and α_t = (1/2)ln((1-ε_t)/ε_t) with ε_t being the weighted error rate.

Gradient Boosting

Gradient boosting generalizes boosting by optimizing a differentiable loss function:

F_m(x) = F_m-1(x) + γ_m h_m(x)

Each weak learner h_m fits the negative gradient (pseudo-residuals) of the loss function.

9. Clustering Algorithms

K-Means Clustering

K-Means partitions n observations into k clusters by minimizing within-cluster variance.

K-Means Objective

J = ∑_k=1^K ∑_{x∈C_k} ||x - μ_k||²

Where μ_k is the centroid of cluster C_k.

Algorithm Steps:

Initialize k centroids randomly
Assign each point to nearest centroid
Recompute centroids as mean of assigned points
Repeat steps 2-3 until convergence

Hierarchical Clustering

Builds a hierarchy of clusters using either:

Agglomerative (bottom-up): Start with each point as a cluster, merge closest pairs
Divisive (top-down): Start with one cluster, recursively split

DBSCAN

Density-Based Spatial Clustering of Applications with Noise identifies clusters as dense regions separated by sparse regions. Advantages include:

No need to specify number of clusters
Can find arbitrarily-shaped clusters
Robust to outliers

10. Dimensionality Reduction

Principal Component Analysis (PCA)

PCA finds orthogonal directions (principal components) that maximize variance.

PCA via Eigendecomposition

Σv = λv

Where Σ is the covariance matrix, v are eigenvectors (principal components), and λ are eigenvalues (variance explained). Sort by λ and keep top k components.

t-SNE

t-Distributed Stochastic Neighbor Embedding is a nonlinear technique for visualization, preserving local structure by matching probability distributions in high and low dimensions.

11. Neural Networks

Perceptron and Multi-Layer Networks

Single Neuron

output = σ(∑_i w_ix_i + b)

Where σ is an activation function (sigmoid, ReLU, tanh, etc.)

Activation Functions

Sigmoid: σ(z) = 1/(1+e^-z) → output in (0,1)
ReLU: f(z) = max(0, z) → addresses vanishing gradient
Softmax: For multi-class classification, outputs probability distribution

Backpropagation

Backpropagation computes gradients efficiently using the chain rule:

∂L/∂w_ij = ∂L/∂a_j · ∂a_j/∂z_j · ∂z_j/∂w_ij

12. Model Evaluation

Classification Metrics

Metric	Formula	Use Case
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Balanced classes
Precision	TP/(TP+FP)	Minimize false positives
Recall	TP/(TP+FN)	Minimize false negatives
F1 Score	2 · (Precision · Recall)/(Precision + Recall)	Balance precision/recall
AUC-ROC	Area under ROC curve	Overall ranking ability

Regression Metrics

MSE: (1/n) ∑(y_i - ŷ_i)²
RMSE: √MSE
MAE: (1/n) ∑|y_i - ŷ_i|
R²: 1 - SS_res/SS_tot

Cross-Validation

K-fold cross-validation provides robust performance estimates:

Split data into k equal folds
For each fold: train on k-1 folds, evaluate on held-out fold
Average performance across all folds

References and Further Reading

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd Edition. Springer.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Murphy, K.P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
James, G., et al. (2021). An Introduction to Statistical Learning, 2nd Edition. Springer.
Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32.

← Previous: SPC Next: Time Series →