Chapter 6: Basic Ideas of Machine Learning
Algorithms That Learn from Data to Make Predictions and Decisions
1. What is Machine Learning?
Learning Without Explicit Programming
Machine Learning (ML) is a subset of Artificial Intelligence that enables computers to learn from data and improve their performance on a task without being explicitly programmed. Instead of hand-coding every rule, ML algorithms discover patterns automatically from example input-output pairs.
Traditional Programming vs Machine Learning
Why Machine Learning for Big Data?
- Pattern Discovery: Find hidden patterns in massive, complex datasets that humans cannot manually inspect.
- Prediction at Scale: Forecast future trends, customer behaviors, and fraud across millions of records per second.
- Automation: Automate decision-making pipelines without per-rule engineering at scale.
- Personalization: Deliver unique, customized experiences dynamically for millions of users simultaneously.
- Anomaly Detection: Identify fraud, quality errors, or unusual events in real-time sensor streams.
2. The Three Branches of Machine Learning
How Algorithms Learn
Machine learning algorithms are categorized based on the type of feedback signal they receive during training. Each branch unlocks a distinct class of problems.
| Branch | Data Type | Goal | Classic Examples |
|---|---|---|---|
| Supervised | Labeled (X, y pairs) | Predict output for new inputs | Spam detection, house price prediction |
| Unsupervised | Unlabeled (X only) | Discover hidden structure | Customer segmentation, anomaly detection |
| Reinforcement | Environment + reward signal | Maximize cumulative reward | AlphaGo, autonomous driving, robotics |
Reinforcement Learning Key Concepts
- Agent: The learner / decision-maker operating inside an environment.
- State: The current observable snapshot of the environment the agent receives.
- Action: The set of moves the agent can take at any given state.
- Reward: The scalar feedback signal returned by the environment after each action.
- Policy: The learned mapping from states to optimal actions.
3. Supervised Learning Algorithms
Learning from Labeled Examples
Supervised learning is the most widely deployed ML paradigm in industry. The algorithm receives input–output pairs and derives a generalizable model. Two fundamental sub-tasks exist: Classification (predicting categories) and Regression (predicting continuous values).
1. Linear Regression — Predicting Continuous Values
Models the relationship between inputs and a continuous numeric target as a weighted sum: ŷ = w₁x₁ + w₂x₂ + ... + b. Training minimizes the Mean Squared Error (MSE) across all samples.
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split # Load housing dataset df = pd.read_csv("housing.csv") X = df[["square_feet", "bedrooms", "age"]] y = df["price"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) print(f"R² Score: {model.score(X_test, y_test):.3f}") # Predict a new house: 2000 sqft, 3 bed, 10 years old price = model.predict([[2000, 3, 10]]) print(f"Predicted Price: ${price[0]:,.2f}")
2. Logistic Regression — Binary Classification
Despite the name, Logistic Regression solves classification by squashing a linear combination through the sigmoid function to produce a probability. The output is a probability in [0,1], thresholded at 0.5 by default.
from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import CountVectorizer emails = ["Win money now!", "Meeting tomorrow", "Free prize!!!", "Project update"] labels = [1, 0, 1, 0] # 1=spam, 0=not spam vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) clf = LogisticRegression() clf.fit(X, labels) new_email = vectorizer.transform(["Claim your free gift now!"]) prob = clf.predict_proba(new_email) print(f"Spam confidence: {prob[0][1]:.2%}")
3. Decision Trees & Random Forests
A Decision Tree partitions the feature space using a hierarchy of binary questions (splits). Each internal node tests one feature; leaves assign class labels or regression values. The tree structure naturally explains its own decisions — making it highly interpretable.
Random Forests overcome the brittleness of single trees by building an ensemble of hundreds of trees, each trained on a random bootstrap sample and random feature subset. Predictions are averaged (regression) or voted (classification), drastically reducing variance and improving generalisation.
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report rf = RandomForestClassifier( n_estimators=100, # 100 trees max_depth=10, # limit depth to prevent overfit n_jobs=-1, # use all CPU cores random_state=42 ) rf.fit(X_train, y_train) y_pred = rf.predict(X_test) print(classification_report(y_test, y_pred)) # Inspect which features drove decisions most for feat, imp in zip(X_train.columns, rf.feature_importances_): print(f"{feat}: {imp:.3f}")
SVM Kernel Trick: Support Vector Machines find the widest possible margin hyperplane separating classes. When data is not linearly separable, the Kernel Trick implicitly maps data to a higher-dimensional space (using RBF, polynomial, or sigmoid kernels) where a linear separator exists — without explicitly computing the transformation.
4. Unsupervised Learning
Discovering Hidden Structure Without Labels
Unsupervised learning receives only raw input data with no target labels. The algorithm must independently discover the underlying structure — groupings, compressed representations, or co-occurrence rules hidden within the data.
1. K-Means Clustering
Partitions N data points into exactly K non-overlapping clusters by iteratively assigning each point to its nearest centroid, then recomputing centroids. Converges when assignments no longer change. The Elbow Method helps choose the optimal K by plotting inertia (within-cluster sum-of-squares) against K values.
from sklearn.cluster import KMeans X = customer_df[["frequency", "avg_spend", "recency"]] # Elbow method to find optimal K inertias = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto") kmeans.fit(X) inertias.append(kmeans.inertia_) # Apply with chosen K=4 kmeans = KMeans(n_clusters=4, random_state=42, n_init="auto") customer_df["segment"] = kmeans.fit_predict(X) for seg in range(4): grp = customer_df[customer_df["segment"] == seg] print(f"Segment {seg}: {len(grp)} customers") print(grp[["frequency", "avg_spend", "recency"]].mean())
2. Principal Component Analysis (PCA)
PCA finds the axes of maximum variance (principal components) in the data and projects it onto a lower-dimensional subspace. It's used for visualization, noise reduction, and alleviating the curse of dimensionality before applying other ML algorithms.
from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler # PCA is sensitive to scale — always standardize first! scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Reduce to 2 dimensions for visualization pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"Variance explained: {pca.explained_variance_ratio_}") print(f"Total: {sum(pca.explained_variance_ratio_):.2%}")
3. Association Rules — Apriori Algorithm
Discovers co-occurrence rules of the form "if {bread, butter} then {jam}" in transactional data. Three core metrics guide rule quality:
- Support: How frequently the itemset appears across all transactions.
- Confidence: How often the rule fires correctly (conditional probability).
- Lift > 1: Indicates a genuine positive association, not random co-occurrence.
5. The End-to-End ML Workflow
From Raw Data to Production Model
Successful ML projects follow a repeatable, structured pipeline. Skipping or rushing any phase typically leads to poor generalization or silent production failures.
Data Preprocessing Essentials
| Technique | Purpose | Common Implementation |
|---|---|---|
| Missing Value Imputation | Prevent errors and bias from null entries | Mean/median fill, KNN imputer, indicator flag column |
| Categorical Encoding | Convert text to numeric features | One-hot encoding, label encoding, target encoding |
| Feature Scaling | Equalize feature contribution magnitude | StandardScaler (z-score), MinMaxScaler, RobustScaler |
| Train/Test Split | Honest evaluation on unseen data | 80/20 or 70/30 split; stratified for class balance |
| Cross-Validation | Robust performance estimation | K-Fold (K=5 or 10), StratifiedKFold, TimeSeriesSplit |
6. Machine Learning at Big Data Scale
When scikit-learn Doesn't Fit in RAM
Traditional single-machine ML libraries (scikit-learn, XGBoost) hit hard limits when datasets exceed available RAM — typically 100GB+. Distributed frameworks partition both data and computation across dozens or hundreds of nodes.
The Scale Wall: Training a model on billions of rows on a single 64GB laptop is simply impossible. Distributed frameworks like Spark MLlib shard the dataset across a cluster, each node training on its local shard and contributing to a shared global gradient or model update.
Apache Spark MLlib — Distributed ML Pipeline
from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import VectorAssembler, StringIndexer from pyspark.ml import Pipeline from pyspark.sql import SparkSession spark = SparkSession.builder.appName("BigDataML").getOrCreate() # Load from HDFS — could be terabytes across hundreds of nodes data = spark.read.parquet("hdfs:///data/churn_dataset.parquet") # Build a preprocessing pipeline indexer = StringIndexer(inputCol="plan_type", outputCol="planIdx") assembler = VectorAssembler( inputCols=["tenure_months", "monthly_charges", "planIdx"], outputCol="features" ) lr = LogisticRegression(featuresCol="features", labelCol="churned") pipeline = Pipeline(stages=[indexer, assembler, lr]) model = pipeline.fit(data) # Predictions executed in parallel across the cluster predictions = model.transform(spark.read.parquet("hdfs:///data/churn_test.parquet")) predictions.select("prediction", "probability", "churned").show(10)
Frameworks at a Glance
| Framework | Strength | Best For |
|---|---|---|
| Spark MLlib | Tight Hadoop ecosystem integration, pipelines | Classical ML (regression, trees, clustering) at petabyte scale |
| TensorFlow/PyTorch Distributed | GPU acceleration, deep learning primitives | Neural networks, computer vision, NLP |
| H2O.ai AutoML | Automatic model selection and hyperparameter tuning | Rapid prototyping, non-ML teams needing fast results |
Big Data ML Best Practices: (1) Always develop on a sampled subset first. (2) Use a Feature Store to centralize feature computation and prevent train/serve skew. (3) Monitor model performance continuously — data distributions shift slowly over time (concept drift), silently degrading accuracy.
7. Model Evaluation & Selection
Measuring What Matters
Choosing the wrong evaluation metric is one of the most dangerous mistakes in ML. A model with 99% accuracy on a fraud dataset where 99% of transactions are legitimate is completely useless — it simply predicts "not fraud" every time.
Classification Metrics
| Metric | Formula | When to Prioritize |
|---|---|---|
| Accuracy | (TP+TN) / Total | Balanced classes only |
| Precision | TP / (TP+FP) | False positives are expensive (spam filter) |
| Recall | TP / (TP+FN) | False negatives are dangerous (disease screening) |
| F1-Score | 2 × (P × R) / (P + R) | Imbalanced classes — harmonic mean of P and R |
| ROC-AUC | Area under ROC curve | Comparing models across decision thresholds |
Regression Metrics
| Metric | Formula | Key Property |
|---|---|---|
| MAE | mean(|y − ŷ|) | Robust to outliers; same units as target |
| RMSE | √mean((y − ŷ)²) | Penalises large errors heavily; outlier-sensitive |
| R² Score | 1 − SS_res/SS_tot | 1 = perfect fit; 0 = baseline mean model |
| MAPE | mean(|y−ŷ|/y) × 100% | Percentage error — intuitive but undefined at y=0 |
Common Pitfalls to Avoid: Data Leakage — accidentally including future or test-time information during training. Overfitting — model memorizes noise instead of generalizing. Concept Drift — real-world distributions shift over months; models silently degrade without active monitoring and scheduled retraining.
Model Comparison with Cross-Validation
from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.svm import SVC models = { "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42), "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42), "SVM (RBF)": SVC(kernel="rbf", probability=True, random_state=42) } cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for name, model in models.items(): scores = cross_val_score(model, X, y, cv=cv, scoring="f1") print(f"{name}: F1 = {scores.mean():.3f} ± {scores.std():.3f}") best = max(models, key=lambda k: results[k]["mean_f1"]) print(f"Winner: {best}")