Learn With Frahim

1) Prerequisites

Python basics; NumPy Pandas Matplotlib Seaborn
Data preprocessing: missing values, encoding, scaling (from Module 1)
Math essentials: vectors, dot product; mean/variance; basic probability

2) What is Supervised Learning?

We learn a mapping from features X to a labeled target y, then use it to predict on new data.

Regression — predict a continuous value (price, temperature)
Classification — predict a category/label (spam vs not spam)

Examples: Price prediction (regression), medical diagnosis (classification), churn prediction (classification), demand forecasting (regression).

3) Regression

3.1 Linear Regression — Theory

We assume a linear relationship: y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ + ε. Parameters β are learned by minimizing a loss (typically MSE).

3.2 Metrics for Regression (explained where used)

MSE (Mean Squared Error): average of squared differences between predictions and actuals. Penalizes large errors.

RMSE (Root MSE): square root of MSE; in the same units as the target—more interpretable.

R² (Coefficient of Determination): fraction of variance explained by the model (1 = perfect; can be negative if model is worse than a horizontal line).

3.3 Real-world Uses

House/car price estimation
Energy consumption prediction
Sales/demand forecasting

3.4 Code: Simple & Multiple Linear Regression

# ---------- Linear Regression (simple & multiple) ----------
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np

# Example data (replace with your CSV/DB as needed)
df = pd.DataFrame({
    "Area":[650, 800, 1200, 1500, 1800, 2100, 2500],
    "Bedrooms":[1,2,3,3,4,4,5],
    "Age":[30, 20, 15, 10, 8, 5, 3],
    "Price":[200000, 250000, 350000, 400000, 460000, 520000, 600000]
})

# --- 1) Simple regression: Price ~ Area ---
X_simple = df[["Area"]]
y = df["Price"]
X_tr, X_te, y_tr, y_te = train_test_split(X_simple, y, test_size=0.3, random_state=42)

simple = LinearRegression().fit(X_tr, y_tr)
pred_simple = simple.predict(X_te)

mse = mean_squared_error(y_te, pred_simple)   # MSE explained above
rmse = np.sqrt(mse)                           # RMSE explained above
r2 = r2_score(y_te, pred_simple)              # R² explained above
print("Simple Regression → MSE:", mse, "RMSE:", rmse, "R²:", r2)

# --- 2) Multiple regression: Price ~ Area + Bedrooms + Age ---
X_multi = df[["Area","Bedrooms","Age"]]
X_tr, X_te, y_tr, y_te = train_test_split(X_multi, y, test_size=0.3, random_state=42)

multi = LinearRegression().fit(X_tr, y_tr)
pred_multi = multi.predict(X_te)

mse2 = mean_squared_error(y_te, pred_multi)
rmse2 = np.sqrt(mse2)
r22 = r2_score(y_te, pred_multi)
print("Multiple Regression → MSE:", mse2, "RMSE:", rmse2, "R²:", r22)
print("Coefficients (β):", multi.coef_, "Intercept (β₀):", multi.intercept_)

3.5 Polynomial Regression (non-linear patterns)

# ---------- Polynomial Regression ----------
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

X = np.array([1,2,3,4,5]).reshape(-1,1)
y = np.array([1,4,9,16,25])  # y = x^2

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

lin = LinearRegression().fit(X_poly, y)
yhat = lin.predict(X_poly)

print("Coefficients:", lin.coef_, "Intercept:", lin.intercept_)

3.6 Regularization (overfitting guard)

Ridge (L2): penalizes large coefficients (squared). Helps when features are correlated.

Lasso (L1): can shrink some coefficients to zero → implicit feature selection.

# ---------- Ridge & Lasso ----------
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0).fit(X_multi, y)   # L2 penalty
lasso = Lasso(alpha=0.1).fit(X_multi, y)   # L1 penalty

print("Ridge β:", ridge.coef_)
print("Lasso β:", lasso.coef_)  # some may be zero

4) Classification

4.1 Concept

Predict a discrete label: binary (0/1) or multi-class (A/B/C...). Often we model class probabilities and pick the highest.

4.2 Core Metrics for Classification (explained where used)

Confusion Matrix: a 2×2 (binary) table with TP, TN, FP, FN. Basis for all other metrics.

Accuracy = (TP + TN) / Total — proportion of correct predictions.

Precision = TP / (TP + FP) — how many predicted positives were actually positive.

Recall (Sensitivity) = TP / (TP + FN) — how many actual positives were caught.

F1-score = harmonic mean of Precision & Recall — balances the two.

4.3 Code: Logistic Regression (binary)

# ---------- Logistic Regression (binary example) ----------
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Example data (study hours → pass/fail)
df = pd.DataFrame({"Hours":[1,2,3,4,5,6,7], "Passed":[0,0,0,1,1,1,1]})
X = df[["Hours"]]
y = df["Passed"]

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)

clf = LogisticRegression()
clf.fit(X_tr, y_tr)

pred = clf.predict(X_te)

# --- Metrics (defined above) ---
cm  = confusion_matrix(y_te, pred)     # Confusion Matrix
acc = accuracy_score(y_te, pred)       # Accuracy
pre = precision_score(y_te, pred)      # Precision
rec = recall_score(y_te, pred)         # Recall
f1  = f1_score(y_te, pred)             # F1

print("Confusion Matrix:\n", cm)
print("Accuracy:", acc, "Precision:", pre, "Recall:", rec, "F1:", f1)

Imbalanced classes? Prefer Precision/Recall/F1 (and ROC-AUC/PR-AUC) over Accuracy.

4.4 Other Classifiers

# ---------- Decision Tree ----------
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_tr, y_tr)
print("Tree Accuracy:", tree_clf.score(X_te, y_te))

# ---------- k-Nearest Neighbors ----------
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3).fit(X_tr, y_tr)
print("kNN Accuracy:", knn.score(X_te, y_te))

# ---------- Support Vector Machine ----------
from sklearn.svm import SVC
svm = SVC(kernel="rbf", probability=True, random_state=42).fit(X_tr, y_tr)
print("SVM Accuracy:", svm.score(X_te, y_te))

5) Model Validation, Overfitting & Underfitting

5.1 Cross-Validation

# ---------- k-Fold Cross-Validation ----------
from sklearn.model_selection import cross_val_score
scores = cross_val_score(LinearRegression(), X_multi, y, cv=5, scoring="r2")
print("CV R² (mean ± std):", scores.mean(), "±", scores.std())

5.2 Overfitting vs Underfitting

Overfitting: low train error, high test error → too complex; fix via regularization, more data, simpler model, early stopping.
Underfitting: high train & test error → too simple; fix via more features, non-linear transforms, more complex model.

6) End-to-End Mini Example

# Goal: Predict price; compare Linear vs Ridge; report metrics with clear meaning.

import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score

# 1) Load/prepare
df = pd.read_csv("your_prices.csv")  # <-- replace with your data path
X = df[["area","rooms","age"]]
y = df["price"]

# 2) Split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# 3) Scale numeric features (often helps)
scaler = StandardScaler().fit(X_tr)
X_tr_s = scaler.transform(X_tr)
X_te_s = scaler.transform(X_te)

# 4) Train two models
lin = LinearRegression().fit(X_tr_s, y_tr)
ridge = Ridge(alpha=1.0).fit(X_tr_s, y_tr)

# 5) Predict & evaluate with explained metrics
for name, m in [("Linear", lin), ("Ridge", ridge)]:
    pred = m.predict(X_te_s)
    mse = mean_squared_error(y_te, pred)         # avg squared error
    rmse = np.sqrt(mse)                           # error in target units
    r2 = r2_score(y_te, pred)                     # variance explained
    print(f"{name}: RMSE={rmse:.2f}, R²={r2:.3f}")

7) Summary Table — What to Use & How to Measure

Type	Typical Algorithms	Main Metrics	Notes / When to Use
Regression	Linear, Polynomial, Ridge/Lasso, Random Forest Regr.	MSE, RMSE, R²	Continuous targets (price, demand). Use regularization to reduce overfitting.
Classification	Logistic Reg., Decision Tree/Forest, kNN, SVM	Accuracy, Precision, Recall, F1, Confusion Matrix	Imbalanced data → prefer Precision/Recall/F1, PR-AUC/ROC-AUC; tune thresholds.

Key Takeaways

Pick regression for continuous outputs; classification for categories.
Understand metrics where they appear: MSE/RMSE/R² for regression; Confusion Matrix → Accuracy/Precision/Recall/F1 for classification.
Validate with train/test & cross-validation; watch out for over/underfitting.
Regularization (Ridge/Lasso) improves generalization; scaling often helps.

05 - Supervised Learning

Module 2: Supervised Learning