1. Introduction to Machine Learning
Machine Learning (ML) is a branch of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. The field emerged from pattern recognition and computational learning theory, with roots tracing back to Alan Turing's seminal question: "Can machines think?"
Formally, a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E (Mitchell, 1997).
Prerequisites for This Tutorial
- Linear algebra: vectors, matrices, eigenvalues
- Calculus: derivatives, gradients, optimization
- Probability and statistics: distributions, Bayes' theorem
- Basic programming concepts
The Machine Learning Pipeline
A typical ML workflow consists of:
- Data Collection: Gathering relevant data from various sources
- Data Preprocessing: Cleaning, normalization, feature engineering
- Model Selection: Choosing appropriate algorithm(s)
- Training: Fitting the model to training data
- Validation: Tuning hyperparameters using validation set
- Testing: Evaluating final performance on held-out test data
- Deployment: Putting the model into production
2. Machine Learning Taxonomy
Supervised Learning
The algorithm learns from labeled training data, mapping inputs to known outputs. Applications include:
- Regression: Predicting continuous values (e.g., price prediction, demand forecasting)
- Classification: Predicting categorical labels (e.g., spam detection, medical diagnosis)
Unsupervised Learning
The algorithm finds patterns in unlabeled data without predefined outputs:
- Clustering: Grouping similar data points (e.g., customer segmentation)
- Dimensionality Reduction: Reducing feature space while preserving information
- Anomaly Detection: Identifying unusual patterns
Reinforcement Learning
The algorithm learns through interaction with an environment, receiving rewards or penalties for actions. Used in robotics, game playing, and autonomous systems.
3. The Bias-Variance Tradeoff
Understanding the bias-variance tradeoff is fundamental to building effective ML models. For a given point x, the expected prediction error can be decomposed as:
Bias[f̂(x)] = E[f̂(x)] - f(x) (systematic error)
Var[f̂(x)] = E[(f̂(x) - E[f̂(x)])2] (variance in predictions)
σ2 = irreducible error (noise in the data)
- High Bias (Underfitting): Model too simple, fails to capture underlying patterns
- High Variance (Overfitting): Model too complex, fits noise in training data
The goal is to find the optimal model complexity that minimizes total error.
4. Linear Regression
Linear regression models the relationship between a dependent variable y and one or more independent variables X by fitting a linear equation to observed data.
Simple Linear Regression
Multiple Linear Regression
β̂ = (XTX)-1XTy
Assumptions of Linear Regression
- Linearity: The relationship between X and y is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals are normally distributed
- No multicollinearity: Predictors are not perfectly correlated
Regularization
To prevent overfitting, regularization adds a penalty term to the loss function:
5. Logistic Regression
Despite its name, logistic regression is a classification algorithm that models the probability of a binary outcome.
where z = β0 + β1x1 + ... + βpxp
Maximum Likelihood Estimation
Interpretation: Odds Ratios
The coefficients β have a natural interpretation in terms of odds ratios:
6. Decision Trees
Decision trees are non-parametric models that recursively partition the feature space into regions, making predictions based on the majority class (classification) or mean value (regression) within each region.
Splitting Criteria
IG = H(parent) - ∑children (nc/n) H(c)
Tree Pruning
To prevent overfitting, trees can be pruned using:
- Pre-pruning: Stop growing early (max depth, min samples per leaf)
- Post-pruning: Grow full tree, then remove branches that don't improve validation performance
7. Support Vector Machines
SVMs find the optimal hyperplane that maximizes the margin between classes. They are particularly effective in high-dimensional spaces.
subject to: yi(w · xi + b) ≥ 1 for all i
Soft Margin SVM
For non-linearly separable data, slack variables ξi allow some misclassification:
subject to: yi(w · xi + b) ≥ 1 - ξi, ξi ≥ 0
The Kernel Trick
Kernels enable SVMs to learn non-linear decision boundaries by implicitly mapping data to higher-dimensional feature spaces:
- Linear: K(x, x') = x · x'
- Polynomial: K(x, x') = (γ x · x' + r)d
- RBF (Gaussian): K(x, x') = exp(-γ||x - x'||2)
8. Ensemble Methods
Ensemble methods combine multiple models to produce better predictions than any single model.
Bagging (Bootstrap Aggregating)
Train multiple models on bootstrap samples and average predictions (regression) or vote (classification).
Random Forests
Random forests extend bagging for decision trees by adding feature randomization:
- Build many trees on bootstrap samples
- At each split, consider only a random subset of features (typically √p for classification, p/3 for regression)
- Aggregate predictions by majority vote or averaging
Random Forest Feature Importance
Two common measures:
- Mean Decrease in Impurity (MDI): Average reduction in Gini impurity across all trees
- Mean Decrease in Accuracy (MDA): Reduction in OOB accuracy when feature values are permuted
Boosting
Boosting builds models sequentially, with each new model focusing on examples the previous models misclassified.
AdaBoost
Gradient Boosting
Gradient boosting generalizes boosting by optimizing a differentiable loss function:
9. Clustering Algorithms
K-Means Clustering
K-Means partitions n observations into k clusters by minimizing within-cluster variance.
Algorithm Steps:
- Initialize k centroids randomly
- Assign each point to nearest centroid
- Recompute centroids as mean of assigned points
- Repeat steps 2-3 until convergence
Hierarchical Clustering
Builds a hierarchy of clusters using either:
- Agglomerative (bottom-up): Start with each point as a cluster, merge closest pairs
- Divisive (top-down): Start with one cluster, recursively split
DBSCAN
Density-Based Spatial Clustering of Applications with Noise identifies clusters as dense regions separated by sparse regions. Advantages include:
- No need to specify number of clusters
- Can find arbitrarily-shaped clusters
- Robust to outliers
10. Dimensionality Reduction
Principal Component Analysis (PCA)
PCA finds orthogonal directions (principal components) that maximize variance.
t-SNE
t-Distributed Stochastic Neighbor Embedding is a nonlinear technique for visualization, preserving local structure by matching probability distributions in high and low dimensions.
11. Neural Networks
Perceptron and Multi-Layer Networks
Activation Functions
- Sigmoid: σ(z) = 1/(1+e-z) → output in (0,1)
- ReLU: f(z) = max(0, z) → addresses vanishing gradient
- Softmax: For multi-class classification, outputs probability distribution
Backpropagation
Backpropagation computes gradients efficiently using the chain rule:
12. Model Evaluation
Classification Metrics
| Metric | Formula | Use Case |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes |
| Precision | TP/(TP+FP) | Minimize false positives |
| Recall | TP/(TP+FN) | Minimize false negatives |
| F1 Score | 2 · (Precision · Recall)/(Precision + Recall) | Balance precision/recall |
| AUC-ROC | Area under ROC curve | Overall ranking ability |
Regression Metrics
- MSE: (1/n) ∑(yi - ŷi)2
- RMSE: √MSE
- MAE: (1/n) ∑|yi - ŷi|
- R2: 1 - SSres/SStot
Cross-Validation
K-fold cross-validation provides robust performance estimates:
- Split data into k equal folds
- For each fold: train on k-1 folds, evaluate on held-out fold
- Average performance across all folds
References and Further Reading
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd Edition. Springer.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
- Murphy, K.P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- James, G., et al. (2021). An Introduction to Statistical Learning, 2nd Edition. Springer.
- Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32.