+91 9873530045
admin@learnwithfrahimcom
Mon - Sat : 09 AM - 09 PM

03 - Feature Engineering

Lesson 3: Feature Engineering & Scaling

Lesson 3: Feature Engineering & Scaling

Learn feature scaling techniques, normalization, standardization, and PCA for machine learning.

1. Introduction

Feature engineering transforms raw data into meaningful features suitable for ML algorithms. Scaling ensures numerical features are comparable and prevents certain features from dominating others.

2. Feature Scaling

Some ML algorithms are sensitive to feature magnitude, so we scale features using:

  • Normalization: Rescales values between 0 and 1.
  • Standardization: Centers data around mean=0 and std=1.

Normalization Example

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])
print(df[['Age','Fare']].head())

Standardization Example

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])
print(df[['Age','Fare']].head())
Tip: Use normalization for KNN or neural networks; standardization for SVM, logistic regression, or PCA.

3. Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that projects data into directions of maximum variance. PCA works best on standardized data.

Why PCA?

  • Reduce feature dimensionality
  • Improve computation efficiency
  • Visualize high-dimensional data

Python Example

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
principal_components = pca.fit_transform(df[['Age','Fare','FamilySize']])
print(principal_components[:5])

4. Summary Table

Technique Purpose When to Use Python Snippet
Normalization Scale features to 0-1 KNN, Neural Networks MinMaxScaler
Standardization Mean=0, Std=1 SVM, Logistic Regression, PCA StandardScaler
PCA Reduce dimensionality High feature datasets, visualization PCA(n_components=2)

5. Overfitting and Underfitting

When building machine learning models, it’s important to understand how well your model generalizes to new data. Two common issues are overfitting and underfitting.

Underfitting

Underfitting occurs when the model is too simple to capture the underlying patterns in the data. It results in poor performance on both training and test datasets.

Example: Using a linear model to predict a highly non-linear relationship.

Overfitting

Overfitting occurs when the model learns the training data too well, including noise and outliers, and performs poorly on unseen data.

Example: A decision tree with unlimited depth perfectly predicting training data but failing on test data.

Visual Illustration

TBD : Imagine fitting a curve to data points:

Overfitting vs Underfitting

Tips to Avoid Overfitting

  • Use more training data
  • Reduce model complexity
  • Use regularization (L1, L2)
  • Apply cross-validation
  • Use ensemble methods (Random Forest, Gradient Boosting)

Tips to Avoid Underfitting

  • Use a more complex model
  • Use better features (feature engineering)
  • Reduce regularization
  • Train longer or with more data if possible

5. Key Takeaways

  • Feature scaling is crucial for distance-based algorithms and models sensitive to feature magnitude.
  • Normalization rescales values between 0-1; standardization centers data around 0 with std=1.
  • PCA reduces dimensionality and helps visualize high-dimensional datasets.
  • Always scale or standardize data before applying PCA.
© 2025 Machine Learning Course | Lesson 3