Learn With Frahim

Prerequisites

Module 1 (Math essentials) and Module 2 (Supervised learning) fundamentals
Comfort with Python, NumPy, Pandas, Matplotlib / Seaborn
Experience using scikit-learn (train/test split, basic model API)

Goal: after this module learners will understand ensemble strategies (bagging, boosting, random forest), how SVMs work and when to use them, and practical feature engineering & selection techniques to improve model performance.

1. Ensemble Methods

Ensemble methods combine multiple base models to improve stability and predictive performance. We'll cover Bagging, Boosting, and Random Forests.

1.1 Bagging (Bootstrap Aggregating)

Concept: train multiple models on different bootstrap samples (sampling with replacement) and aggregate predictions.

Why it helps: reduces variance (averaging many high-variance learners makes predictions more stable).

Workflow

Create many bootstrap samples from the training data.
Train a base learner (often decision trees) on each sample.
Aggregate predictions: average for regression, majority vote for classification.

Python example (BaggingClassifier)

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

bagging = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
print("Bagging accuracy:", accuracy_score(y_test, y_pred))

Real-life use cases

Medical diagnostics where you want more stable predictions than a single tree
Ensemble for high-variance models on noisy datasets

1.2 Boosting

Concept: build models sequentially; each new model focuses on mistakes of the previous ones.

Effect: reduces bias and often yields very high accuracy, but can overfit if not regularized.

Common boosting algorithms

AdaBoost — reweights samples iteratively based on errors
Gradient Boosting (GBM) — fits new models to the residuals (negative gradients)
XGBoost / LightGBM / CatBoost — optimized, faster, often higher accuracy in practice

Python example (AdaBoost)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

adaboost = AdaBoostClassifier(n_estimators=100, random_state=42)
adaboost.fit(X_train, y_train)
y_pred = adaboost.predict(X_test)
print("AdaBoost accuracy:", accuracy_score(y_test, y_pred))

When to use boosting?

When you need better predictive performance and have enough data
Common in tabular data competitions and production models (XGBoost / LightGBM)

1.3 Random Forests

Concept: an ensemble of decision trees trained on bootstrap samples where each split considers a random subset of features.

Why it helps: reduces variance (bagging) and decorrelates trees via feature randomness, giving robust performance and built-in feature importance.

Python example (RandomForestClassifier)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

Notes & tips

Random Forests are strong off-the-shelf models for many problems
They handle categorical (if encoded), numerical, and missing values reasonably well
Use feature_importances_ to inspect important predictors

Practical tip: If you need extremely fast training on large datasets, consider LightGBM or XGBoost (they implement boosting efficiently). For interpretability, Random Forests and shallow trees are easier to inspect.

2. Support Vector Machines (SVM)

SVMs find a hyperplane that best separates classes by maximizing the margin between them. They work well in high-dimensional spaces and with clear margin separation.

2.1 Intuition & math (high level)

The margin width is 2 / ||w||. SVMs minimize ||w|| while allowing some classification errors (soft-margin) controlled by parameter C.

2.2 Kernels

When data are not linearly separable in input space, kernels map inputs into a higher-dimensional space:

Linear — when data are linearly separable
RBF (Gaussian) — popular default; handles curved boundaries
Polynomial — for polynomial decision boundaries

2.3 Python example (SVM)

from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3, random_state=42)

svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

2.4 When to use SVM

Small-to-medium sized datasets with clear margins or high-dimensional feature spaces
Problems where outliers are not dominant (SVMs can be sensitive to noise)
Text classification with TF-IDF vectors (often effective)

Practical note: SVMs with RBF kernel require careful tuning of C and gamma (or use an automated search like GridSearchCV).

3. Feature Engineering & Feature Selection

Transforming raw data into features that better represent the underlying problem can substantially improve model performance. Feature selection simplifies models, reduces overfitting and speeds up training.

3.1 Feature Engineering techniques

Encoding categorical variables: One-Hot, Label Encoding, Ordinal Encoding, Target Encoding (careful with leakage)
Scaling numeric features: StandardScaler, MinMaxScaler, RobustScaler
Create interaction features: multiplication or concatenation of features
Polynomial / basis expansions for non-linear models
Time features: extract hour/day/month, cyclic encoding for time-of-day
Text features: TF-IDF, embeddings

Python example: encoding and scaling

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler

df = pd.DataFrame({
    'Color': ['Red','Blue','Green','Blue'],
    'Size': [10, 20, 30, 25]
})

# One-hot encode Color
ohe = OneHotEncoder(sparse_output=False)
encoded = ohe.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out(['Color']))

# Scale Size
scaler = StandardScaler()
scaled_size = scaler.fit_transform(df[['Size']])

print("Encoded:\n", encoded_df)
print("Scaled Size:\n", scaled_size)

3.2 Feature Selection techniques

Three broad families:

Filter methods — independent of model (correlation threshold, chi-square, mutual information)
Wrapper methods — use a model to evaluate subsets (RFE)
Embedded methods — selection occurs during model training (Lasso, tree-based importance)

Python example: Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=500)
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y)  # X,y from your dataset
print("Selected features mask:", rfe.support_)
print("Feature ranking:", rfe.ranking_)

Notes on feature selection

Always perform selection inside cross-validation to avoid selection bias
Tree-based models give good feature importance estimates (but can be biased for high-cardinality categorical features)
L1 regularization (Lasso) can zero out features — useful for high-dimensional sparse data

Practical workflow suggestion:

Start with domain-aware feature engineering (dates, aggregations, group stats)
Apply simple filters (remove constant or near-constant features)
Use embedded or wrapper methods inside CV to refine selected features

4. Summary Table

Technique	Type	Metrics Used	Key Advantages / Typical Uses
Bagging	Ensemble	Accuracy, RMSE, R²	Reduces variance; use with high-variance base learners (e.g., trees)
Boosting (AdaBoost, GBM, XGBoost)	Ensemble	Accuracy, F1-score, AUC	Reduces bias; very powerful for tabular data; careful tuning required
Random Forest	Ensemble	Accuracy, Feature Importance	Robust, handles large feature sets, built-in importance
Support Vector Machine (SVM)	Classifier / Regressor	Accuracy, Precision/Recall, F1, AUC	Effective in high-dimensional spaces; kernel trick handles non-linearities
Feature Engineering	Preprocessing	Depends on downstream model metrics	Transforms raw data into predictive signals; essential step in pipelines
Feature Selection (RFE, Lasso, tree-based)	Preprocessing/Embedded	Model performance metrics (AUC, R², RMSE)	Reduces dimensionality, improves generalization and speed

5. Final notes & best practices

Always start with a simple baseline model (e.g., logistic regression / small tree) before moving to complex ensembles.
Carefully split data into train / validation / test; use cross-validation for robust estimates.
Scale features when required (SVM, KNN, linear models with regularization).
Be mindful of data leakage — never use target information when creating features for train/test splits.
Use feature importance and model-agnostic explainability tools (SHAP, LIME) to interpret ensemble models.

05 - Module 03 - Advance Machine Learning

Module 3: Advanced Machine Learning Techniques