Building ML Pipelines: Data Preparation, Training, and Serving

Building a robust machine learning (ML) system means orchestrating far more than just model training. You need a repeatable, automated pipeline that ensures clean data flows into your model, training happens efficiently, and predictions are reliably served in production. This post delivers a practitioner’s guide to designing and implementing ML pipelines—covering data preparation, training, and model serving—with practical code, architecture choices, and real-world pitfalls to avoid.

Key Takeaways:

Understand the three pillars of ML pipelines: data preparation, training, and serving

Learn how to structure data pipelines for production-scale reliability

Get hands-on code examples for feature engineering, training orchestration, and model deployment

Compare batch, online, and streaming serving strategies with trade-offs

Avoid common mistakes that derail ML projects before and after deployment

What Is an ML Pipeline? Architecture and Core Stages

ML pipelines are end-to-end workflows that automate the sequence of steps needed to move data from raw sources to actionable predictions. Unlike ad-hoc scripts, well-designed pipelines ensure consistency, reproducibility, and scalability for both experimentation and production.

One ring to rule them all.

J. R. R. Tolkien

One Cloud Storage to Share with Them All: China, USA, Europe, APAC…

Sesame Disk by NiHao Cloud

Core Stages of an ML Pipeline

Data Preparation: Ingest, clean, transform, and feature-engineer data
Training: Train, validate, and tune models—often at scale
Model Serving: Deploy trained models for inference (batch or real-time)

This modular approach lets you iterate on each stage independently and plug in different tools as needs evolve.

Typical ML Pipeline Architecture

Most mature ML workflows use orchestration frameworks (like Apache Airflow, Kubeflow Pipelines, or MLflow) to manage dependencies, trigger runs, and monitor failures.

Stage	Main Tools	Key Outputs
Data Preparation	Pandas, Spark, sklearn.pipeline, Airflow, dbt	Cleaned datasets, feature matrices
Training	scikit-learn, PyTorch, TensorFlow, MLflow	Trained model artifacts, metrics, logs
Model Serving	FastAPI, Flask, TensorFlow Serving, Seldon Core, AWS SageMaker	REST endpoints, batch outputs, monitoring dashboards

This separation isn’t just academic: it enables versioning, A/B testing, and rollback without tangled dependencies.

Data Preparation: Cleaning, Feature Engineering, and Pipelines in Practice

Data preparation often consumes over 60% of the total project time (O’Reilly 2016 survey), and quality here determines downstream model performance. The two main challenges are reproducibility and scalability: you need to ensure that the same transformations apply in both training and production, and that they can scale to millions of rows.

Building a Robust Data Pipeline

Use sklearn’s Pipeline or similar abstractions to encapsulate preprocessing steps, making them portable and versioned. For production-scale data, Spark and dbt are often used, but the principles remain the same.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Example: Predicting loan default using tabular data
data = pd.read_csv('loans.csv')  # columns: amount, term, income, industry, default

numeric_features = ['amount', 'term', 'income']
categorical_features = ['industry']

# Define preprocessing for each column type
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine into a single preprocessor
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Usage: preprocessor.fit(X_train) -> preprocessor.transform(X_test)

This pipeline guarantees that the same cleaning and encoding logic is applied at both train and inference time, reducing risk of train/serve skew.

Now at a Reduced Price: On-Demand Cloud Storage and Collaboration for Teams!

NiHao Cloud

Start with pay-as-you-go pricing! The cloud storage solution that works wherever your team is—China, America, Europe, and more—all at the same time!

Scaling Up: Distributed Data Preparation

For data sizes beyond memory (millions of rows), migrate these patterns to Spark ML Pipelines or dbt (for SQL-based data warehouses). The principle—encapsulate every transformation for automated, repeatable execution—remains the same.

Feature Engineering Best Practices

Version all transformation code with your model (use MLflow or DVC for data versioning)
Automate validation: check for data drift, missing values, and schema changes before passing data downstream
Document every transformation: ambiguity here is a leading cause of broken production predictions

Model Training: Automation, Experiment Tracking, and Scaling

Model training is more than just fitting algorithms—it’s about reproducible experiments, hyperparameter tuning, and scalable execution. Manual training scripts quickly become brittle as data grows and requirements change. The solution is to codify training into pipeline steps and use tools for tracking and orchestration.

Automated Training Workflows

Let’s extend the previous data pipeline to include model training and hyperparameter search, using scikit-learn and MLflow for tracking.

Upgrade & share files freely!

Unlock the full potential of cloud storage by subscribing today. Logo Sesame Disk

Enjoy seamless access and sharing across China, the USA, Europe, and just everywhere!

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
import mlflow.sklearn

X = data.drop('default', axis=1)
y = data['default']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit preprocessor
preprocessor.fit(X_train)
X_train_prepared = preprocessor.transform(X_train)
X_test_prepared = preprocessor.transform(X_test)

# Model selection with grid search
param_grid = {'n_estimators': [100, 200], 'max_depth': [10, 20]}
clf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3)

with mlflow.start_run():
    clf.fit(X_train_prepared, y_train)
    score = clf.score(X_test_prepared, y_test)
    mlflow.log_param('best_params', clf.best_params_)
    mlflow.log_metric('test_accuracy', score)
    mlflow.sklearn.log_model(clf.best_estimator_, "model")

This example demonstrates:

Automated hyperparameter tuning via GridSearchCV
Experiment tracking with MLflow—logging metrics, parameters, and the trained model artifact
Reproducibility: you can rerun the pipeline with new data, and your previous runs are tracked for comparison

Scaling Training: Distributed and Cloud Approaches

For large models (e.g., BERT, ResNet), use distributed frameworks like PyTorch Distributed or TensorFlow MirroredStrategy. Managed platforms (AWS SageMaker, Azure ML, GCP Vertex AI) handle scaling and infrastructure, but always monitor training costs—training a GPT-3 scale model can cost millions, while a typical tabular ML model often costs under $100 on cloud instances.

Key Metrics to Track

Validation accuracy, loss, and AUC
Training/inference time (latency, throughput)
Resource utilization (GPU/CPU hours, memory footprint)

Automated pipelines should log these for every run, enabling informed regression detection and rollback decisions.

Model Serving: Batch, Online, and Streaming Approaches

Once a model is trained, you need to serve predictions reliably—whether as a nightly batch job, a low-latency API, or a streaming system. Each approach has trade-offs in complexity, cost, and latency. The right choice depends on your business SLA, data velocity, and operational maturity.

Serving Methods Compared

Approach	Latency	Use Case	Example Tools	Pros	Cons
Batch	Minutes–hours	Risk scoring, churn prediction	Airflow, Spark, dbt	Simple, easy to scale, cheap	Not suitable for real-time
Online (API)	ms–sec	Personalization, fraud detection	FastAPI, Flask, TensorFlow Serving, Seldon	Low latency, interactive	Needs robust infra, higher ops cost
Streaming	Sub-second	Ad targeting, anomaly detection	Kafka, Flink, Ray Serve	Handles high data velocity, real-time	Complex, harder to test

Example: Deploying a Model as a REST API with FastAPI

For many use cases, a REST API is the fastest path to production. Below is a minimal FastAPI app that loads the trained model and preprocessor, and exposes a /predict endpoint. This is suitable for demo or staging, but in production, use a WSGI server (like gunicorn) and container orchestration (Kubernetes, Seldon Core).

from fastapi import FastAPI
import joblib
import pandas as pd

app = FastAPI()
model = joblib.load('rf_model.joblib')
preprocessor = joblib.load('preprocessor.joblib')

@app.post("/predict")
def predict(input_data: dict):
    df = pd.DataFrame([input_data])
    X_prepared = preprocessor.transform(df)
    prediction = model.predict(X_prepared)
    return {"prediction": int(prediction[0])}

To serve at scale:

Containerize with Docker
Use gunicorn or uvicorn workers for concurrency
Add health checks, logging, and monitoring (Prometheus, Grafana, Sentry)
Automate deployment with CI/CD pipelines

Advanced: Model Versioning and Canary Deployments

For mission-critical systems, serve multiple model versions side-by-side. Tools like Seldon Core and MLflow Model Registry support A/B or canary testing, enabling you to route a fraction of traffic to new models and monitor for regressions before full rollout.

Common Pitfalls and Pro Tips for Every Stage

Even seasoned teams encounter subtle pipeline bugs that undermine business value. Here are the most common issues and how to avoid them:

Data Preparation

Train/serve skew: Preprocessing code drift between training and inference. Solution: Package transformations as reusable, versioned functions or pipelines.
Silent data drift: Upstream schema or distribution changes silently break models. Solution: Add automated validation and drift detection (e.g., Evidently).

Training

Non-reproducible experiments: Untracked random seeds, missing code/data versioning. Solution: Log every run with tools like MLflow; fix seeds; snapshot data and code.
Overfitting due to leakage: Feature engineering leaks future data into training. Solution: Time-aware pipelines and careful aggregation windows.

Model Serving

Latency spikes: Loading large models synchronously or inefficient pre/post-processing. Solution: Warm up models, optimize serialization, batch requests if possible.
Unmonitored drift: Deployed model accuracy degrades due to changing production data. Solution: Monitor input distributions and prediction quality; automate retraining triggers.
Dependency hell: Mismatched library versions between training and serving. Solution: Use Docker containers with explicit version pinning; CI/CD for deployment validation.

General Pro Tips

Automate everything: manual steps introduce errors
Test pipelines end-to-end, not just in unit tests
Document assumptions and data contracts at every stage
Review costs: unnecessary pipeline complexity leads to spiraling cloud bills

Conclusion and Next Steps

Production ML success depends on robust, automated pipelines for data preparation, training, and serving—not just model accuracy. Building these workflows with modular, testable, and versioned components pays off in reliability, reproducibility, and scalability. Start by refactoring your next project to use explicit pipelines for data and features, add experiment tracking for training, and deploy with a serving architecture that matches your latency/scale needs. For deeper dives, explore orchestration tools like Airflow, model registry best practices, or advanced monitoring for ML in production.