1. Introduction
Feature engineering transforms raw data into meaningful features suitable for ML algorithms. Scaling ensures numerical features are comparable and prevents certain features from dominating others.
2. Feature Scaling
Some ML algorithms are sensitive to feature magnitude, so we scale features using:
- Normalization: Rescales values between 0 and 1.
- Standardization: Centers data around mean=0 and std=1.
Normalization Example
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])
print(df[['Age','Fare']].head())
Standardization Example
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])
print(df[['Age','Fare']].head())
3. Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that projects data into directions of maximum variance. PCA works best on standardized data.
Why PCA?
- Reduce feature dimensionality
- Improve computation efficiency
- Visualize high-dimensional data
Python Example
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df[['Age','Fare','FamilySize']])
print(principal_components[:5])
4. Summary Table
| Technique | Purpose | When to Use | Python Snippet |
|---|---|---|---|
| Normalization | Scale features to 0-1 | KNN, Neural Networks | MinMaxScaler |
| Standardization | Mean=0, Std=1 | SVM, Logistic Regression, PCA | StandardScaler |
| PCA | Reduce dimensionality | High feature datasets, visualization | PCA(n_components=2) |
5. Overfitting and Underfitting
When building machine learning models, it’s important to understand how well your model generalizes to new data. Two common issues are overfitting and underfitting.
Underfitting
Underfitting occurs when the model is too simple to capture the underlying patterns in the data. It results in poor performance on both training and test datasets.
Overfitting
Overfitting occurs when the model learns the training data too well, including noise and outliers, and performs poorly on unseen data.
Visual Illustration
TBD : Imagine fitting a curve to data points:
Tips to Avoid Overfitting
- Use more training data
- Reduce model complexity
- Use regularization (L1, L2)
- Apply cross-validation
- Use ensemble methods (Random Forest, Gradient Boosting)
Tips to Avoid Underfitting
- Use a more complex model
- Use better features (feature engineering)
- Reduce regularization
- Train longer or with more data if possible
5. Key Takeaways
- Feature scaling is crucial for distance-based algorithms and models sensitive to feature magnitude.
- Normalization rescales values between 0-1; standardization centers data around 0 with std=1.
- PCA reduces dimensionality and helps visualize high-dimensional datasets.
- Always scale or standardize data before applying PCA.