Chapter 6: Basic Ideas of Machine Learning | Big Data Course Notes

Chapter 6: Basic Ideas of Machine Learning

Algorithms That Learn from Data to Make Predictions and Decisions

🏷️ Supervised Learning 🏷️ Unsupervised Learning 🏷️ Spark MLlib 📝 4 Credits 🎯 Topic 6 of 14

1. What is Machine Learning?

Learning Without Explicit Programming

Machine Learning (ML) is a subset of Artificial Intelligence that enables computers to learn from data and improve their performance on a task without being explicitly programmed. Instead of hand-coding every rule, ML algorithms discover patterns automatically from example input-output pairs.

Arthur Samuel's Definition (1959)

"Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed." — The rules emerge from the data; you don't write them.

Traditional Programming vs Machine Learning

Traditional Programming:

[ Data ] + [ Rules (Code) ] ──▶ [ Output / Answers ]

Machine Learning:

[ Data ] + [ Answers (Labels) ] ──▶ [ Rules / Model ]

Figure 1: ML reverses the traditional paradigm — the model learns the rules from examples.

Why Machine Learning for Big Data?

Pattern Discovery: Find hidden patterns in massive, complex datasets that humans cannot manually inspect.
Prediction at Scale: Forecast future trends, customer behaviors, and fraud across millions of records per second.
Automation: Automate decision-making pipelines without per-rule engineering at scale.
Personalization: Deliver unique, customized experiences dynamically for millions of users simultaneously.
Anomaly Detection: Identify fraud, quality errors, or unusual events in real-time sensor streams.

2. The Three Branches of Machine Learning

How Algorithms Learn

Machine learning algorithms are categorized based on the type of feedback signal they receive during training. Each branch unlocks a distinct class of problems.

Machine Learning

│

┌───────┼──────────────┐

SupervisedUnsupervisedReinforcement

(labeled data)(no labels)(reward/penalty)

Figure 2: The three main branches of Machine Learning.

Branch	Data Type	Goal	Classic Examples
Supervised	Labeled (X, y pairs)	Predict output for new inputs	Spam detection, house price prediction
Unsupervised	Unlabeled (X only)	Discover hidden structure	Customer segmentation, anomaly detection
Reinforcement	Environment + reward signal	Maximize cumulative reward	AlphaGo, autonomous driving, robotics

Reinforcement Learning Key Concepts

Agent: The learner / decision-maker operating inside an environment.
State: The current observable snapshot of the environment the agent receives.
Action: The set of moves the agent can take at any given state.
Reward: The scalar feedback signal returned by the environment after each action.
Policy: The learned mapping from states to optimal actions.

3. Supervised Learning Algorithms

Learning from Labeled Examples

Supervised learning is the most widely deployed ML paradigm in industry. The algorithm receives input–output pairs and derives a generalizable model. Two fundamental sub-tasks exist: Classification (predicting categories) and Regression (predicting continuous values).

1. Linear Regression — Predicting Continuous Values

Models the relationship between inputs and a continuous numeric target as a weighted sum: ŷ = w₁x₁ + w₂x₂ + ... + b. Training minimizes the Mean Squared Error (MSE) across all samples.

Python / scikit-learn — House Price Prediction

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load housing dataset
df = pd.read_csv("housing.csv")
X = df[["square_feet", "bedrooms", "age"]]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print(f"R² Score: {model.score(X_test, y_test):.3f}")

# Predict a new house: 2000 sqft, 3 bed, 10 years old
price = model.predict([[2000, 3, 10]])
print(f"Predicted Price: ${price[0]:,.2f}")

2. Logistic Regression — Binary Classification

Despite the name, Logistic Regression solves classification by squashing a linear combination through the sigmoid function to produce a probability. The output is a probability in [0,1], thresholded at 0.5 by default.

Python — Email Spam Classifier

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

emails = ["Win money now!", "Meeting tomorrow", "Free prize!!!", "Project update"]
labels = [1, 0, 1, 0]  # 1=spam, 0=not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

clf = LogisticRegression()
clf.fit(X, labels)

new_email = vectorizer.transform(["Claim your free gift now!"])
prob = clf.predict_proba(new_email)
print(f"Spam confidence: {prob[0][1]:.2%}")

3. Decision Trees & Random Forests

A Decision Tree partitions the feature space using a hierarchy of binary questions (splits). Each internal node tests one feature; leaves assign class labels or regression values. The tree structure naturally explains its own decisions — making it highly interpretable.

Random Forests overcome the brittleness of single trees by building an ensemble of hundreds of trees, each trained on a random bootstrap sample and random feature subset. Predictions are averaged (regression) or voted (classification), drastically reducing variance and improving generalisation.

Python — Random Forest Churn Prediction

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

rf = RandomForestClassifier(
    n_estimators=100,     # 100 trees
    max_depth=10,         # limit depth to prevent overfit
    n_jobs=-1,            # use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

# Inspect which features drove decisions most
for feat, imp in zip(X_train.columns, rf.feature_importances_):
    print(f"{feat}: {imp:.3f}")

🌲

SVM Kernel Trick: Support Vector Machines find the widest possible margin hyperplane separating classes. When data is not linearly separable, the Kernel Trick implicitly maps data to a higher-dimensional space (using RBF, polynomial, or sigmoid kernels) where a linear separator exists — without explicitly computing the transformation.

4. Unsupervised Learning

Discovering Hidden Structure Without Labels

Unsupervised learning receives only raw input data with no target labels. The algorithm must independently discover the underlying structure — groupings, compressed representations, or co-occurrence rules hidden within the data.

1. K-Means Clustering

Partitions N data points into exactly K non-overlapping clusters by iteratively assigning each point to its nearest centroid, then recomputing centroids. Converges when assignments no longer change. The Elbow Method helps choose the optimal K by plotting inertia (within-cluster sum-of-squares) against K values.

Python — Customer Segmentation with K-Means

from sklearn.cluster import KMeans

X = customer_df[["frequency", "avg_spend", "recency"]]

# Elbow method to find optimal K
inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto")
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Apply with chosen K=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init="auto")
customer_df["segment"] = kmeans.fit_predict(X)

for seg in range(4):
    grp = customer_df[customer_df["segment"] == seg]
    print(f"Segment {seg}: {len(grp)} customers")
    print(grp[["frequency", "avg_spend", "recency"]].mean())

2. Principal Component Analysis (PCA)

PCA finds the axes of maximum variance (principal components) in the data and projects it onto a lower-dimensional subspace. It's used for visualization, noise reduction, and alleviating the curse of dimensionality before applying other ML algorithms.

Python — PCA Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# PCA is sensitive to scale — always standardize first!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total: {sum(pca.explained_variance_ratio_):.2%}")

3. Association Rules — Apriori Algorithm

Discovers co-occurrence rules of the form "if {bread, butter} then {jam}" in transactional data. Three core metrics guide rule quality:

Support: How frequently the itemset appears across all transactions.
Confidence: How often the rule fires correctly (conditional probability).
Lift > 1: Indicates a genuine positive association, not random co-occurrence.

5. The End-to-End ML Workflow

From Raw Data to Production Model

Successful ML projects follow a repeatable, structured pipeline. Skipping or rushing any phase typically leads to poor generalization or silent production failures.

[1. Problem Definition] → [2. Data Collection] → [3. Preprocessing]

[4. Feature Engineering] → [5. Model Training] → [6. Evaluation] → [7. Deployment]

Figure 3: ML pipeline — each stage feeds the next. Concept drift monitoring closes the loop back to re-training.

Data Preprocessing Essentials

Technique	Purpose	Common Implementation
Missing Value Imputation	Prevent errors and bias from null entries	Mean/median fill, KNN imputer, indicator flag column
Categorical Encoding	Convert text to numeric features	One-hot encoding, label encoding, target encoding
Feature Scaling	Equalize feature contribution magnitude	StandardScaler (z-score), MinMaxScaler, RobustScaler
Train/Test Split	Honest evaluation on unseen data	80/20 or 70/30 split; stratified for class balance
Cross-Validation	Robust performance estimation	K-Fold (K=5 or 10), StratifiedKFold, TimeSeriesSplit

6. Machine Learning at Big Data Scale

When scikit-learn Doesn't Fit in RAM

Traditional single-machine ML libraries (scikit-learn, XGBoost) hit hard limits when datasets exceed available RAM — typically 100GB+. Distributed frameworks partition both data and computation across dozens or hundreds of nodes.

⚠️

The Scale Wall: Training a model on billions of rows on a single 64GB laptop is simply impossible. Distributed frameworks like Spark MLlib shard the dataset across a cluster, each node training on its local shard and contributing to a shared global gradient or model update.

Apache Spark MLlib — Distributed ML Pipeline

PySpark — Distributed Logistic Regression Pipeline

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BigDataML").getOrCreate()

# Load from HDFS — could be terabytes across hundreds of nodes
data = spark.read.parquet("hdfs:///data/churn_dataset.parquet")

# Build a preprocessing pipeline
indexer = StringIndexer(inputCol="plan_type", outputCol="planIdx")
assembler = VectorAssembler(
    inputCols=["tenure_months", "monthly_charges", "planIdx"],
    outputCol="features"
)
lr = LogisticRegression(featuresCol="features", labelCol="churned")

pipeline = Pipeline(stages=[indexer, assembler, lr])
model = pipeline.fit(data)

# Predictions executed in parallel across the cluster
predictions = model.transform(spark.read.parquet("hdfs:///data/churn_test.parquet"))
predictions.select("prediction", "probability", "churned").show(10)

Frameworks at a Glance

Framework	Strength	Best For
Spark MLlib	Tight Hadoop ecosystem integration, pipelines	Classical ML (regression, trees, clustering) at petabyte scale
TensorFlow/PyTorch Distributed	GPU acceleration, deep learning primitives	Neural networks, computer vision, NLP
H2O.ai AutoML	Automatic model selection and hyperparameter tuning	Rapid prototyping, non-ML teams needing fast results

✅

Big Data ML Best Practices: (1) Always develop on a sampled subset first. (2) Use a Feature Store to centralize feature computation and prevent train/serve skew. (3) Monitor model performance continuously — data distributions shift slowly over time (concept drift), silently degrading accuracy.

7. Model Evaluation & Selection

Measuring What Matters

Choosing the wrong evaluation metric is one of the most dangerous mistakes in ML. A model with 99% accuracy on a fraud dataset where 99% of transactions are legitimate is completely useless — it simply predicts "not fraud" every time.

Classification Metrics

Metric	Formula	When to Prioritize
Accuracy	(TP+TN) / Total	Balanced classes only
Precision	TP / (TP+FP)	False positives are expensive (spam filter)
Recall	TP / (TP+FN)	False negatives are dangerous (disease screening)
F1-Score	2 × (P × R) / (P + R)	Imbalanced classes — harmonic mean of P and R
ROC-AUC	Area under ROC curve	Comparing models across decision thresholds

Regression Metrics

Metric	Formula	Key Property
MAE	mean(\|y − ŷ\|)	Robust to outliers; same units as target
RMSE	√mean((y − ŷ)²)	Penalises large errors heavily; outlier-sensitive
R² Score	1 − SS_res/SS_tot	1 = perfect fit; 0 = baseline mean model
MAPE	mean(\|y−ŷ\|/y) × 100%	Percentage error — intuitive but undefined at y=0

⚠️

Common Pitfalls to Avoid: Data Leakage — accidentally including future or test-time information during training. Overfitting — model memorizes noise instead of generalizing. Concept Drift — real-world distributions shift over months; models silently degrade without active monitoring and scheduled retraining.

Model Comparison with Cross-Validation

Python — Stratified K-Fold Model Comparison

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "SVM (RBF)": SVC(kernel="rbf", probability=True, random_state=42)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
    print(f"{name}: F1 = {scores.mean():.3f} ± {scores.std():.3f}")

best = max(models, key=lambda k: results[k]["mean_f1"])
print(f"Winner: {best}")