Glossary/Data Transformation — Normalisation, Discretisation, Sampling & Compression
Data Warehousing

Data Transformation — Normalisation, Discretisation, Sampling & Compression

Preparing raw data for analysis — cleaning, reshaping, and compressing at scale.


Definition

Data transformation converts raw, messy source data into clean, analysis-ready datasets. The four core transformations: Normalisation (scaling numeric values to a standard range), Discretisation (converting continuous values into categorical bins), Sampling (selecting representative subsets of large datasets), and Compression (reducing data size without losing essential information). These are part of the ETL process and are used in both data warehousing and machine learning preprocessing.

Real-life analogy: The food preparation kitchen

Raw ingredients from farms (source systems) arrive dirty, in different sizes, in bulk. Before cooking (analysis): wash and peel (cleaning), cut to uniform size (normalisation), group into categories (discretisation), taste-test a sample (sampling), vacuum-seal to reduce volume (compression). Data transformation is the preparation kitchen that makes raw ingredients usable.

1. Data Normalisation (feature scaling)

Three scaling methods. Min-Max: scales to [0,1], sensitive to outliers. Z-score: zero mean unit variance, handles outliers better. Robust: uses median and IQR, best when many outliers exist.

Normalisation methods with sklearn

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

data = pd.DataFrame({
    'Age':    [22, 35, 45, 55, 80, 28, 42],
    'Salary': [25000, 80000, 150000, 250000, 500000, 45000, 120000],
    'Score':  [65, 72, 88, 91, 95, 70, 85]
})

# Min-Max: scales to [0, 1]
mm  = MinMaxScaler()
std = StandardScaler()   # Z-score: mean=0, std=1
rob = RobustScaler()     # Median + IQR: robust to outliers

X_mm  = pd.DataFrame(mm.fit_transform(data),  columns=data.columns)
X_std = pd.DataFrame(std.fit_transform(data), columns=data.columns)
X_rob = pd.DataFrame(rob.fit_transform(data), columns=data.columns)

print("Min-Max range:", X_mm.min().round(2).to_dict(), "to", X_mm.max().round(2).to_dict())
print("Z-score mean:", X_std.mean().round(2).to_dict())
print("Z-score std: ", X_std.std().round(2).to_dict())

# IMPORTANT: Fit on TRAINING data only, transform both train and test
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(data, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform
X_test_scaled  = scaler.transform(X_test)        # ONLY transform (no refit!)

DBMS normalisation vs data normalisation — different concepts!

DBMS normalisation = structuring tables to eliminate redundancy (1NF/2NF/3NF/BCNF). Data transformation normalisation = scaling numerical feature values to a standard range for machine learning. Same word, completely different meanings. GATE tests both in different contexts.

2. Discretisation — continuous to categorical

Equal-width, equal-frequency, and custom binning

import pandas as pd
import numpy as np

ages = pd.Series([18, 22, 25, 31, 35, 42, 47, 55, 62, 68, 72, 85])

# METHOD 1: Equal-width (uniform interval size)
bins_ew = pd.cut(ages, bins=4, labels=['Young','Adult','Middle','Senior'])
print("Equal-width:", pd.cut(ages, bins=4).value_counts().sort_index())
# (17.9-36], (36-54], (54-72], (72-85] — each is ~17 years wide

# METHOD 2: Equal-frequency (same count per bin — quantile-based)
bins_ef = pd.qcut(ages, q=4, labels=['Q1','Q2','Q3','Q4'])
print("Equal-freq:", pd.qcut(ages, q=4).value_counts().sort_index())
# Each bin has ~3 values regardless of range

# METHOD 3: Custom domain-based bins (most meaningful)
bins_custom = pd.cut(ages,
    bins=[0, 19, 29, 39, 59, 100],
    labels=['Teen','Young Adult','Adult','Middle Aged','Senior'])
print("Custom:", bins_custom.value_counts())

# Income discretisation for data warehouse reporting
salary = pd.Series([15000, 35000, 75000, 120000, 200000, 500000])
salary_cat = pd.cut(salary,
    bins=[0, 25000, 50000, 100000, 250000, float('inf')],
    labels=['Low','Lower-Middle','Middle','Upper-Middle','High'])
print("Salary categories:", salary_cat.tolist())
MethodBin sizesWhen to useProblem
Equal-widthSame range per binUniform distribution dataSkewed data: most values in one bin
Equal-frequencySame count per binSkewed data, ensure coverageMay group very different values together
Custom/DomainBusiness-defined boundariesWhen business rules define categoriesRequires domain knowledge

3. Sampling — working with representative subsets

Sampling strategies for large datasets

import pandas as pd
import numpy as np

np.random.seed(42)
n = 100_000
df = pd.DataFrame({
    'CustomerID': range(n),
    'Age':        np.random.randint(18, 80, n),
    'Region':     np.random.choice(['North','South','East','West'], n, p=[0.4,0.3,0.2,0.1]),
    'Purchased':  np.random.choice([0, 1], n, p=[0.8, 0.2])  # 80/20 imbalance
})

# 1. Simple Random Sampling (SRS)
srs = df.sample(n=5000, random_state=42)
print(f"SRS: {len(srs)} rows, Region: {srs.Region.value_counts(normalize=True).round(2).to_dict()}")

# 2. Stratified Sampling (preserve class proportions)
stratified = df.groupby('Region', group_keys=False).apply(
    lambda x: x.sample(frac=0.05, random_state=42)
)
print(f"Stratified: {len(stratified)} rows — region dist preserved")

# 3. Systematic Sampling — every k-th record
k = 20
systematic = df.iloc[::k]
print(f"Systematic (every {k}th): {len(systematic)} rows")

# 4. Over/Under-sampling for imbalanced datasets (fraud, churn)
from sklearn.utils import resample
minority = df[df.Purchased == 1]   # 20,000 (20%)
majority = df[df.Purchased == 0]   # 80,000 (80%)

# Oversample minority to balance classes
minority_up = resample(minority, n_samples=len(majority), random_state=42)
balanced = pd.concat([majority, minority_up])
print(f"Balanced dataset: {balanced.Purchased.value_counts().to_dict()}")

# 5. Reservoir sampling for streaming data (unknown total size)
def reservoir_sample(stream, k):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:  reservoir.append(item)
        else:
            j = np.random.randint(0, i + 1)
            if j < k: reservoir[j] = item
    return reservoir

4. Data Compression

TechniqueHow it worksBest for
Run-Length Encoding (RLE)Replace runs: AAABBB → 3A3BSorted, repeated values (Region column sorted by region)
Dictionary EncodingMap frequent values to short codes: {North:0,South:1}Low-cardinality string columns (Status, Region, Category)
Delta EncodingStore differences: [100,102,105] → [100,+2,+3]Sequential numeric data (timestamps, IDs)
Parquet/ORC (columnar)Store columns together + per-column compressionData lakes, BigQuery, Redshift, Spark — 10-50x smaller
Gzip/Snappy/ZstdGeneral-purpose byte-stream compressionFile storage, network transfer, S3 objects

Columnar storage compression comparison

import pandas as pd, io

# 100k rows of realistic data
df = pd.DataFrame({
    'Region':    (['North','South','East','West'] * 25000),
    'Status':    (['Active','Inactive','Pending'] * 33333 + ['Active']),
    'Amount':    pd.Series(range(100000)) * 0.99,
    'OrderDate': pd.date_range('2020-01-01', periods=100000, freq='1h'),
})

# CSV (row-based)
csv_buf = io.BytesIO(); df.to_csv(csv_buf); csv_size = csv_buf.tell()

# Parquet + Snappy (columnar + light compression)
p_snap = io.BytesIO(); df.to_parquet(p_snap, compression='snappy'); snap_size = p_snap.tell()

# Parquet + Gzip (columnar + heavy compression)
p_gz   = io.BytesIO(); df.to_parquet(p_gz,   compression='gzip');   gz_size = p_gz.tell()

print(f"CSV (row store):      {csv_size:>10,} bytes (baseline)")
print(f"Parquet + Snappy:     {snap_size:>10,} bytes ({csv_size/snap_size:.1f}x smaller, fast)")
print(f"Parquet + Gzip:       {gz_size:>10,}  bytes ({csv_size/gz_size:.1f}x smaller, slow)")

# Dictionary encoding: Region column as category saves ~20x
str_mem = df['Region'].memory_usage(deep=True)
cat_mem = df['Region'].astype('category').memory_usage(deep=True)
print(f"
Region as string:   {str_mem:,} bytes")
print(f"Region as category: {cat_mem:,} bytes ({str_mem//cat_mem}x smaller)")

Practice questions

  1. When to use Min-Max vs Z-score normalisation? (Answer: Min-Max: need fixed [0,1] range, no significant outliers (neural networks, image pixels). Z-score: data has outliers or algorithm assumes Gaussian distribution (linear regression, PCA, LDA). Robust scaler: many extreme outliers.)
  2. Equal-width vs equal-frequency discretisation — when does each fail? (Answer: Equal-width fails for skewed data — most values fall in one bin. Equal-frequency ensures uniform bin population but may group very different values if data has gaps.)
  3. Why is stratified sampling preferred for imbalanced datasets? (Answer: If 1% of data is fraud, SRS of 1000 records may have 0-20 fraud cases by chance. Stratified guarantees approximately 10 fraud cases (1%), preserving class distribution for reliable model training.)
  4. Run-length encoding gives best compression when: (Answer: Sorted, low-cardinality columns with many consecutive repeated values. Example: Region column sorted as AAAA...BBBB compresses dramatically. Unsorted or high-cardinality columns may expand.)
  5. What is the key advantage of columnar storage (Parquet/ORC) over row-based storage (CSV) for analytics? (Answer: Analytics queries read only needed columns — columnar stores skip irrelevant columns entirely. Also enables highly effective per-column compression (all values in a column have same type/range). Row stores must read full rows even if only 2 of 50 columns are needed.)

On LumiChats

LumiChats can write complete ETL pipeline code in Python (pandas, sklearn, PySpark) including all transformation types. Describe your data: 'CSV with age, skewed income, region, and 95% non-fraud labels' — LumiChats writes the full preprocessing pipeline.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms