Prerequisites
- Module 1 (Math essentials) and Module 2 (Supervised learning) fundamentals
- Comfort with Python, NumPy, Pandas, Matplotlib / Seaborn
- Experience using scikit-learn (train/test split, basic model API)
1. Ensemble Methods
Ensemble methods combine multiple base models to improve stability and predictive performance. We'll cover Bagging, Boosting, and Random Forests.
1.1 Bagging (Bootstrap Aggregating)
Concept: train multiple models on different bootstrap samples (sampling with replacement) and aggregate predictions.
Why it helps: reduces variance (averaging many high-variance learners makes predictions more stable).
Workflow
- Create many bootstrap samples from the training data.
- Train a base learner (often decision trees) on each sample.
- Aggregate predictions: average for regression, majority vote for classification.
Python example (BaggingClassifier)
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
bagging = BaggingClassifier(
base_estimator=DecisionTreeClassifier(),
n_estimators=50,
random_state=42
)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
print("Bagging accuracy:", accuracy_score(y_test, y_pred))
Real-life use cases
- Medical diagnostics where you want more stable predictions than a single tree
- Ensemble for high-variance models on noisy datasets
1.2 Boosting
Concept: build models sequentially; each new model focuses on mistakes of the previous ones.
Effect: reduces bias and often yields very high accuracy, but can overfit if not regularized.
Common boosting algorithms
- AdaBoost — reweights samples iteratively based on errors
- Gradient Boosting (GBM) — fits new models to the residuals (negative gradients)
- XGBoost / LightGBM / CatBoost — optimized, faster, often higher accuracy in practice
Python example (AdaBoost)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
adaboost = AdaBoostClassifier(n_estimators=100, random_state=42)
adaboost.fit(X_train, y_train)
y_pred = adaboost.predict(X_test)
print("AdaBoost accuracy:", accuracy_score(y_test, y_pred))
When to use boosting?
- When you need better predictive performance and have enough data
- Common in tabular data competitions and production models (XGBoost / LightGBM)
1.3 Random Forests
Concept: an ensemble of decision trees trained on bootstrap samples where each split considers a random subset of features.
Why it helps: reduces variance (bagging) and decorrelates trees via feature randomness, giving robust performance and built-in feature importance.
Python example (RandomForestClassifier)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
Notes & tips
- Random Forests are strong off-the-shelf models for many problems
- They handle categorical (if encoded), numerical, and missing values reasonably well
- Use
feature_importances_to inspect important predictors
2. Support Vector Machines (SVM)
SVMs find a hyperplane that best separates classes by maximizing the margin between them. They work well in high-dimensional spaces and with clear margin separation.
2.1 Intuition & math (high level)
The margin width is 2 / ||w||. SVMs minimize ||w|| while allowing some classification errors (soft-margin) controlled by parameter C.
2.2 Kernels
When data are not linearly separable in input space, kernels map inputs into a higher-dimensional space:
- Linear — when data are linearly separable
- RBF (Gaussian) — popular default; handles curved boundaries
- Polynomial — for polynomial decision boundaries
2.3 Python example (SVM)
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3, random_state=42)
svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
2.4 When to use SVM
- Small-to-medium sized datasets with clear margins or high-dimensional feature spaces
- Problems where outliers are not dominant (SVMs can be sensitive to noise)
- Text classification with TF-IDF vectors (often effective)
C and gamma (or use an automated search like GridSearchCV).
3. Feature Engineering & Feature Selection
Transforming raw data into features that better represent the underlying problem can substantially improve model performance. Feature selection simplifies models, reduces overfitting and speeds up training.
3.1 Feature Engineering techniques
- Encoding categorical variables: One-Hot, Label Encoding, Ordinal Encoding, Target Encoding (careful with leakage)
- Scaling numeric features: StandardScaler, MinMaxScaler, RobustScaler
- Create interaction features: multiplication or concatenation of features
- Polynomial / basis expansions for non-linear models
- Time features: extract hour/day/month, cyclic encoding for time-of-day
- Text features: TF-IDF, embeddings
Python example: encoding and scaling
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
df = pd.DataFrame({
'Color': ['Red','Blue','Green','Blue'],
'Size': [10, 20, 30, 25]
})
# One-hot encode Color
ohe = OneHotEncoder(sparse_output=False)
encoded = ohe.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out(['Color']))
# Scale Size
scaler = StandardScaler()
scaled_size = scaler.fit_transform(df[['Size']])
print("Encoded:\n", encoded_df)
print("Scaled Size:\n", scaled_size)
3.2 Feature Selection techniques
Three broad families:
- Filter methods — independent of model (correlation threshold, chi-square, mutual information)
- Wrapper methods — use a model to evaluate subsets (RFE)
- Embedded methods — selection occurs during model training (Lasso, tree-based importance)
Python example: Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=500)
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y) # X,y from your dataset
print("Selected features mask:", rfe.support_)
print("Feature ranking:", rfe.ranking_)
Notes on feature selection
- Always perform selection inside cross-validation to avoid selection bias
- Tree-based models give good feature importance estimates (but can be biased for high-cardinality categorical features)
- L1 regularization (Lasso) can zero out features — useful for high-dimensional sparse data
- Start with domain-aware feature engineering (dates, aggregations, group stats)
- Apply simple filters (remove constant or near-constant features)
- Use embedded or wrapper methods inside CV to refine selected features
4. Summary Table
| Technique | Type | Metrics Used | Key Advantages / Typical Uses |
|---|---|---|---|
| Bagging | Ensemble | Accuracy, RMSE, R² | Reduces variance; use with high-variance base learners (e.g., trees) |
| Boosting (AdaBoost, GBM, XGBoost) | Ensemble | Accuracy, F1-score, AUC | Reduces bias; very powerful for tabular data; careful tuning required |
| Random Forest | Ensemble | Accuracy, Feature Importance | Robust, handles large feature sets, built-in importance |
| Support Vector Machine (SVM) | Classifier / Regressor | Accuracy, Precision/Recall, F1, AUC | Effective in high-dimensional spaces; kernel trick handles non-linearities |
| Feature Engineering | Preprocessing | Depends on downstream model metrics | Transforms raw data into predictive signals; essential step in pipelines |
| Feature Selection (RFE, Lasso, tree-based) | Preprocessing/Embedded | Model performance metrics (AUC, R², RMSE) | Reduces dimensionality, improves generalization and speed |
5. Final notes & best practices
- Always start with a simple baseline model (e.g., logistic regression / small tree) before moving to complex ensembles.
- Carefully split data into train / validation / test; use cross-validation for robust estimates.
- Scale features when required (SVM, KNN, linear models with regularization).
- Be mindful of data leakage — never use target information when creating features for train/test splits.
- Use feature importance and model-agnostic explainability tools (SHAP, LIME) to interpret ensemble models.