Learn With Frahim

1. Introduction

Feature engineering transforms raw data into meaningful features suitable for ML algorithms. Scaling ensures numerical features are comparable and prevents certain features from dominating others.

2. Feature Scaling

Some ML algorithms are sensitive to feature magnitude, so we scale features using:

Normalization: Rescales values between 0 and 1.
Standardization: Centers data around mean=0 and std=1.

Normalization Example

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])
print(df[['Age','Fare']].head())

Standardization Example

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])
print(df[['Age','Fare']].head())

Tip: Use normalization for KNN or neural networks; standardization for SVM, logistic regression, or PCA.

3. Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that projects data into directions of maximum variance. PCA works best on standardized data.

Why PCA?

Reduce feature dimensionality
Improve computation efficiency
Visualize high-dimensional data

Python Example

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
principal_components = pca.fit_transform(df[['Age','Fare','FamilySize']])
print(principal_components[:5])

4. Summary Table

Technique	Purpose	When to Use	Python Snippet
Normalization	Scale features to 0-1	KNN, Neural Networks	`MinMaxScaler`
Standardization	Mean=0, Std=1	SVM, Logistic Regression, PCA	`StandardScaler`
PCA	Reduce dimensionality	High feature datasets, visualization	`PCA(n_components=2)`

5. Overfitting and Underfitting

When building machine learning models, it’s important to understand how well your model generalizes to new data. Two common issues are overfitting and underfitting.

Underfitting

Underfitting occurs when the model is too simple to capture the underlying patterns in the data. It results in poor performance on both training and test datasets.

Example: Using a linear model to predict a highly non-linear relationship.

Overfitting

Overfitting occurs when the model learns the training data too well, including noise and outliers, and performs poorly on unseen data.

Example: A decision tree with unlimited depth perfectly predicting training data but failing on test data.

Visual Illustration

TBD : Imagine fitting a curve to data points:

Tips to Avoid Overfitting

Use more training data
Reduce model complexity
Use regularization (L1, L2)
Apply cross-validation
Use ensemble methods (Random Forest, Gradient Boosting)

Tips to Avoid Underfitting

Use a more complex model
Use better features (feature engineering)
Reduce regularization
Train longer or with more data if possible

5. Key Takeaways

Feature scaling is crucial for distance-based algorithms and models sensitive to feature magnitude.
Normalization rescales values between 0-1; standardization centers data around 0 with std=1.
PCA reduces dimensionality and helps visualize high-dimensional datasets.
Always scale or standardize data before applying PCA.

03 - Feature Engineering

Lesson 3: Feature Engineering & Scaling