1) Prerequisites
- Python basics; NumPy Pandas Matplotlib Seaborn
- Data preprocessing: missing values, encoding, scaling (from Module 1)
- Math essentials: vectors, dot product; mean/variance; basic probability
2) What is Supervised Learning?
We learn a mapping from features X to a labeled target y, then use it to predict on new data.
- Regression — predict a continuous value (price, temperature)
- Classification — predict a category/label (spam vs not spam)
3) Regression
3.1 Linear Regression — Theory
We assume a linear relationship:
y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ + ε.
Parameters β are learned by minimizing a loss (typically MSE).
3.2 Metrics for Regression (explained where used)
3.3 Real-world Uses
- House/car price estimation
- Energy consumption prediction
- Sales/demand forecasting
3.4 Code: Simple & Multiple Linear Regression
# ---------- Linear Regression (simple & multiple) ----------
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
# Example data (replace with your CSV/DB as needed)
df = pd.DataFrame({
"Area":[650, 800, 1200, 1500, 1800, 2100, 2500],
"Bedrooms":[1,2,3,3,4,4,5],
"Age":[30, 20, 15, 10, 8, 5, 3],
"Price":[200000, 250000, 350000, 400000, 460000, 520000, 600000]
})
# --- 1) Simple regression: Price ~ Area ---
X_simple = df[["Area"]]
y = df["Price"]
X_tr, X_te, y_tr, y_te = train_test_split(X_simple, y, test_size=0.3, random_state=42)
simple = LinearRegression().fit(X_tr, y_tr)
pred_simple = simple.predict(X_te)
mse = mean_squared_error(y_te, pred_simple) # MSE explained above
rmse = np.sqrt(mse) # RMSE explained above
r2 = r2_score(y_te, pred_simple) # R² explained above
print("Simple Regression → MSE:", mse, "RMSE:", rmse, "R²:", r2)
# --- 2) Multiple regression: Price ~ Area + Bedrooms + Age ---
X_multi = df[["Area","Bedrooms","Age"]]
X_tr, X_te, y_tr, y_te = train_test_split(X_multi, y, test_size=0.3, random_state=42)
multi = LinearRegression().fit(X_tr, y_tr)
pred_multi = multi.predict(X_te)
mse2 = mean_squared_error(y_te, pred_multi)
rmse2 = np.sqrt(mse2)
r22 = r2_score(y_te, pred_multi)
print("Multiple Regression → MSE:", mse2, "RMSE:", rmse2, "R²:", r22)
print("Coefficients (β):", multi.coef_, "Intercept (β₀):", multi.intercept_)
3.5 Polynomial Regression (non-linear patterns)
# ---------- Polynomial Regression ----------
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.array([1,2,3,4,5]).reshape(-1,1)
y = np.array([1,4,9,16,25]) # y = x^2
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
lin = LinearRegression().fit(X_poly, y)
yhat = lin.predict(X_poly)
print("Coefficients:", lin.coef_, "Intercept:", lin.intercept_)
3.6 Regularization (overfitting guard)
# ---------- Ridge & Lasso ----------
from sklearn.linear_model import Ridge, Lasso
ridge = Ridge(alpha=1.0).fit(X_multi, y) # L2 penalty
lasso = Lasso(alpha=0.1).fit(X_multi, y) # L1 penalty
print("Ridge β:", ridge.coef_)
print("Lasso β:", lasso.coef_) # some may be zero
4) Classification
4.1 Concept
Predict a discrete label: binary (0/1) or multi-class (A/B/C...). Often we model class probabilities and pick the highest.
4.2 Core Metrics for Classification (explained where used)
4.3 Code: Logistic Regression (binary)
# ---------- Logistic Regression (binary example) ----------
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
# Example data (study hours → pass/fail)
df = pd.DataFrame({"Hours":[1,2,3,4,5,6,7], "Passed":[0,0,0,1,1,1,1]})
X = df[["Hours"]]
y = df["Passed"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
clf = LogisticRegression()
clf.fit(X_tr, y_tr)
pred = clf.predict(X_te)
# --- Metrics (defined above) ---
cm = confusion_matrix(y_te, pred) # Confusion Matrix
acc = accuracy_score(y_te, pred) # Accuracy
pre = precision_score(y_te, pred) # Precision
rec = recall_score(y_te, pred) # Recall
f1 = f1_score(y_te, pred) # F1
print("Confusion Matrix:\n", cm)
print("Accuracy:", acc, "Precision:", pre, "Recall:", rec, "F1:", f1)
4.4 Other Classifiers
# ---------- Decision Tree ----------
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_tr, y_tr)
print("Tree Accuracy:", tree_clf.score(X_te, y_te))
# ---------- k-Nearest Neighbors ----------
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3).fit(X_tr, y_tr)
print("kNN Accuracy:", knn.score(X_te, y_te))
# ---------- Support Vector Machine ----------
from sklearn.svm import SVC
svm = SVC(kernel="rbf", probability=True, random_state=42).fit(X_tr, y_tr)
print("SVM Accuracy:", svm.score(X_te, y_te))
5) Model Validation, Overfitting & Underfitting
5.1 Cross-Validation
# ---------- k-Fold Cross-Validation ----------
from sklearn.model_selection import cross_val_score
scores = cross_val_score(LinearRegression(), X_multi, y, cv=5, scoring="r2")
print("CV R² (mean ± std):", scores.mean(), "±", scores.std())
5.2 Overfitting vs Underfitting
- Overfitting: low train error, high test error → too complex; fix via regularization, more data, simpler model, early stopping.
- Underfitting: high train & test error → too simple; fix via more features, non-linear transforms, more complex model.
6) End-to-End Mini Example
# Goal: Predict price; compare Linear vs Ridge; report metrics with clear meaning.
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
# 1) Load/prepare
df = pd.read_csv("your_prices.csv") # <-- replace with your data path
X = df[["area","rooms","age"]]
y = df["price"]
# 2) Split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
# 3) Scale numeric features (often helps)
scaler = StandardScaler().fit(X_tr)
X_tr_s = scaler.transform(X_tr)
X_te_s = scaler.transform(X_te)
# 4) Train two models
lin = LinearRegression().fit(X_tr_s, y_tr)
ridge = Ridge(alpha=1.0).fit(X_tr_s, y_tr)
# 5) Predict & evaluate with explained metrics
for name, m in [("Linear", lin), ("Ridge", ridge)]:
pred = m.predict(X_te_s)
mse = mean_squared_error(y_te, pred) # avg squared error
rmse = np.sqrt(mse) # error in target units
r2 = r2_score(y_te, pred) # variance explained
print(f"{name}: RMSE={rmse:.2f}, R²={r2:.3f}")
7) Summary Table — What to Use & How to Measure
| Type | Typical Algorithms | Main Metrics | Notes / When to Use |
|---|---|---|---|
| Regression | Linear, Polynomial, Ridge/Lasso, Random Forest Regr. | MSE, RMSE, R² | Continuous targets (price, demand). Use regularization to reduce overfitting. |
| Classification | Logistic Reg., Decision Tree/Forest, kNN, SVM | Accuracy, Precision, Recall, F1, Confusion Matrix | Imbalanced data → prefer Precision/Recall/F1, PR-AUC/ROC-AUC; tune thresholds. |
Key Takeaways
- Pick regression for continuous outputs; classification for categories.
- Understand metrics where they appear: MSE/RMSE/R² for regression; Confusion Matrix → Accuracy/Precision/Recall/F1 for classification.
- Validate with train/test & cross-validation; watch out for over/underfitting.
- Regularization (Ridge/Lasso) improves generalization; scaling often helps.