Glossary/Feature Engineering — Selection, Extraction, Scaling & Encoding
Machine Learning

Feature Engineering — Selection, Extraction, Scaling & Encoding

Transforming raw data into features that machines can actually learn from.


Definition

Feature engineering is the process of using domain knowledge to create, select, and transform input variables (features) that make ML models more accurate. In practice, 80% of an ML project's time is spent on feature engineering — not on model selection. The steps: feature selection (choosing which variables to use), feature extraction (creating new features from existing ones), feature scaling (normalising magnitudes), and encoding (converting categorical to numeric). Good features make simple models powerful; bad features make complex models useless.

Real-life analogy: The chef with raw ingredients

A machine learning model is like a recipe — and your raw data is unprocessed ingredients. Feature engineering is the chef's preparation: chopping (binning continuous variables), marinating (scaling), combining (interaction features), filtering out rotten bits (removing irrelevant features), and converting whole spices to powder (encoding categoricals). The best ML practitioners are not model-tuners — they are feature engineers.

Feature selection — choosing the right variables

Feature selection methods: filter, wrapper, and embedded

from sklearn.datasets import load_boston  # or any regression dataset
from sklearn.feature_selection import (SelectKBest, f_regression, mutual_info_regression,
    RFE, RFECV, SelectFromModel)
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

# ── Simulated data ──
np.random.seed(42)
X = pd.DataFrame(np.random.randn(500, 15),
    columns=[f'feature_{i}' for i in range(15)])
y = 3*X['feature_0'] - 2*X['feature_1'] + 0.5*X['feature_2'] + np.random.randn(500)
# Only features 0, 1, 2 are truly relevant

# METHOD 1: Filter methods — statistical correlation (fast, model-independent)
# F-statistic: linear correlation with target
selector_f = SelectKBest(score_func=f_regression, k=5)
selector_f.fit(X, y)
f_scores   = pd.Series(selector_f.scores_, index=X.columns).sort_values(ascending=False)
print("Top 5 by F-statistic:", f_scores.head().index.tolist())

# Mutual Information: non-linear correlation (detects any relationship)
selector_mi = SelectKBest(score_func=mutual_info_regression, k=5)
selector_mi.fit(X, y)
mi_scores   = pd.Series(selector_mi.scores_, index=X.columns).sort_values(ascending=False)
print("Top 5 by Mutual Info:", mi_scores.head().index.tolist())

# METHOD 2: Wrapper methods — train model, evaluate subsets
# Recursive Feature Elimination: remove least important features one at a time
rfe = RFE(estimator=Ridge(), n_features_to_select=5, step=1)
rfe.fit(X, y)
selected_rfe = X.columns[rfe.support_].tolist()
print("RFE selected:", selected_rfe)

# RFECV: RFE with cross-validation to find optimal number of features
rfecv = RFECV(estimator=Ridge(), min_features_to_select=1, cv=5, scoring='r2')
rfecv.fit(X, y)
print(f"RFECV optimal features: {rfecv.n_features_} = {X.columns[rfecv.support_].tolist()}")

# METHOD 3: Embedded methods — feature importance from model training
# Lasso: L1 regularisation naturally zeroes out irrelevant features
lasso = Lasso(alpha=0.1).fit(X, y)
lasso_selected = X.columns[lasso.coef_ != 0].tolist()
print("Lasso selected:", lasso_selected)

# Random Forest feature importance
rf = RandomForestRegressor(n_estimators=100, random_state=42).fit(X, y)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("RF top 5:", importances.head().index.tolist())

sfm = SelectFromModel(rf, threshold='mean')
sfm.fit(X, y)
print("SelectFromModel:", X.columns[sfm.get_support()].tolist())

Feature scaling — normalisation vs standardisation

When scaling is needed and which method to use

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer
import pandas as pd

data = pd.DataFrame({
    'Age':       [22, 35, 45, 55, 80, 28, 42],
    'Salary':    [25000, 80000, 150000, 250000, 500000, 45000, 120000],
    'YearsExp':  [1, 5, 10, 15, 30, 3, 8]
})

# StandardScaler: z-score — mean=0, std=1
# Use when: algorithm assumes Gaussian (linear regression, SVM, PCA, LDA)
std = StandardScaler()
data_std = pd.DataFrame(std.fit_transform(data), columns=data.columns)
print("StandardScaler — mean:", data_std.mean().round(2).to_dict())
print("StandardScaler — std: ", data_std.std().round(2).to_dict())

# MinMaxScaler: scale to [0, 1] or [-1, 1]
# Use when: bounded range needed (neural networks, image pixels), no Gaussian assumption
mm = MinMaxScaler(feature_range=(0, 1))
data_mm = pd.DataFrame(mm.fit_transform(data), columns=data.columns)

# RobustScaler: median and IQR — robust to outliers
# Use when: dataset has significant outliers
rob = RobustScaler()
data_rob = pd.DataFrame(rob.fit_transform(data), columns=data.columns)

# Normalizer: scales each ROW to unit norm (not each column!)
# Use when: text data (TF-IDF vectors), cosine similarity needed
norm = Normalizer(norm='l2')
data_norm = pd.DataFrame(norm.fit_transform(data), columns=data.columns)

# Models that NEED scaling:     KNN, SVM, Linear/Logistic Regression, PCA, Neural Nets
# Models that DON'T need it:    Decision Trees, Random Forests, XGBoost (tree-based)
# Critical rule: fit scaler on training data ONLY, transform both train and test
X_train_scaled = std.fit_transform(X_train)   # fit + transform
X_test_scaled  = std.transform(X_test)         # ONLY transform (no refit!)

Encoding categorical variables

One-hot encoding, label encoding, target encoding

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'Colour':   ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'Size':     ['Small', 'Medium', 'Large', 'Small', 'Large'],   # Ordinal
    'Brand':    ['Nike', 'Adidas', 'Puma', 'Nike', 'Reebok'],
    'Price':    [100, 150, 80, 120, 90]
})

# 1. ONE-HOT ENCODING (OHE): for NOMINAL categories (no natural order)
# Creates one binary column per category
# Use for: colour, brand, country — where no order exists
ohe = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids multicollinearity
colour_encoded = ohe.fit_transform(df[['Colour']])
colour_df = pd.DataFrame(colour_encoded, columns=ohe.get_feature_names_out(['Colour']))
print("One-Hot:
", colour_df)
# Colour_Green, Colour_Red (Blue dropped to avoid dummy variable trap)

# pandas get_dummies (simpler for exploration)
ohe_pd = pd.get_dummies(df['Colour'], prefix='Colour', drop_first=True)
print("Pandas OHE:
", ohe_pd)

# 2. LABEL ENCODING: assign integer to each category
# Use ONLY for ordinal data or tree-based models (arbitrary order is fine for trees)
le = LabelEncoder()
df['Brand_Label'] = le.fit_transform(df['Brand'])   # Adidas=0, Nike=1, Puma=2, Reebok=3
print("Label Encoded:", df[['Brand', 'Brand_Label']])
# WARNING: implies Nike > Adidas which is meaningless for linear models!

# 3. ORDINAL ENCODING: encode with meaningful order
ordinal = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_Ordinal'] = ordinal.fit_transform(df[['Size']])
print("Ordinal Encoded:", df[['Size', 'Size_Ordinal']])  # Small=0, Medium=1, Large=2

# 4. TARGET ENCODING (mean encoding): replace category with mean of target
# Powerful for high-cardinality features (postal codes, user IDs, product IDs)
target_means = df.groupby('Brand')['Price'].mean()
df['Brand_TargetEnc'] = df['Brand'].map(target_means)
print("Target Encoded:", df[['Brand', 'Brand_TargetEnc']])
# WARNING: causes data leakage if done on full dataset — always do within CV folds

# 5. FREQUENCY ENCODING: replace category with its frequency
freq = df['Brand'].value_counts(normalize=True)
df['Brand_FreqEnc'] = df['Brand'].map(freq)

# High cardinality (1000+ categories) — use embeddings instead of OHE
# For tree models: use label/ordinal encoding (ignore artificial order)
# For linear models: ALWAYS use OHE for nominal categories

Feature extraction — creating new features

Feature extraction: interaction, polynomial, domain-specific

import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame({
    'Length': [10, 20, 15, 8, 25],
    'Width':  [5, 10, 7, 4, 12],
    'Height': [3, 8, 6, 2, 10],
    'Price':  [150, 1600, 630, 64, 3000]
})

# Interaction features (domain knowledge: volume = L × W × H)
df['Volume']       = df['Length'] * df['Width'] * df['Height']
df['LengthToWidth'] = df['Length'] / df['Width']   # Aspect ratio
df['SurfaceArea']  = 2 * (df['Length']*df['Width'] + df['Width']*df['Height'] + df['Length']*df['Height'])

# Polynomial features (automated interaction and power terms)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['Length', 'Width']])
poly_names = poly.get_feature_names_out(['Length', 'Width'])
print("Polynomial features:", poly_names)
# ['Length', 'Width', 'Length^2', 'Length Width', 'Width^2']

# Date/time feature extraction
dates = pd.DataFrame({'Date': pd.date_range('2024-01-01', periods=100)})
dates['DayOfWeek']    = dates['Date'].dt.dayofweek      # 0=Monday, 6=Sunday
dates['Month']        = dates['Date'].dt.month
dates['Quarter']      = dates['Date'].dt.quarter
dates['IsWeekend']    = dates['Date'].dt.dayofweek >= 5
dates['DaysSinceEpoch'] = (dates['Date'] - pd.Timestamp('2000-01-01')).dt.days

# Text feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer
texts = ['machine learning is great', 'deep learning advances fast', 'ML and AI']
tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
X_text = tfidf.fit_transform(texts)

Practice questions

  1. A dataset has a "Country" column with 150 unique values. Why is one-hot encoding problematic here? (Answer: OHE creates 150 new columns — high dimensionality, sparse matrix, may overfit with small datasets. Better alternatives: target encoding (replace with mean target value per country), frequency encoding, or learned embeddings.)
  2. When should you use ordinal encoding vs label encoding for a categorical variable? (Answer: Ordinal encoding: when the categories have a meaningful order (Small < Medium < Large). Label encoding: for tree-based models where the arbitrary numeric order is irrelevant. NEVER use label encoding for nominal categories (colour, brand) with linear/distance-based models.)
  3. Why must feature scaling be fitted on training data only? (Answer: Fitting on the full dataset includes test set statistics — information leakage. The model effectively "sees" test data during training. Fit scaler on X_train only, then apply the same transformation to X_test using the training statistics.)
  4. What is the dummy variable trap and how does drop='first' in OHE solve it? (Answer: With k categories, OHE creates k columns that sum to 1 — perfect multicollinearity. The model cannot uniquely determine all coefficients. Dropping one column removes the redundancy: the dropped category's effect is captured in the intercept.)
  5. Name three methods for feature selection and classify them as filter, wrapper, or embedded. (Answer: Filter: mutual information, F-statistic, correlation coefficient — fast, model-independent. Wrapper: RFE, forward/backward selection — model-dependent, computationally expensive. Embedded: Lasso (L1 zeroes irrelevant features), Random Forest importance — simultaneous training and selection.)

On LumiChats

LumiChats can audit your feature engineering pipeline, suggest encoding strategies for high-cardinality categoricals, detect data leakage risks, and generate complete preprocessing code. Describe your dataset columns and LumiChats designs the full pipeline.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms