From Raw Data to Predictions: A Hands-On Guide to Building a PySpark ML Pipeline

Akram Chauhan
Akram Chauhan
10 min read146 views
From Raw Data to Predictions: A Hands-On Guide to Building a PySpark ML Pipeline

Let's be honest. Hearing the words "Apache Spark" can be a little intimidating. You imagine massive server clusters, complex configurations, and a steep learning curve. For a long time, I thought it was this mythical beast only accessible to data engineers at huge tech companies.

But what if I told you that you could tame this beast right from your browser? That you could build a complete, end-to-end data and machine learning pipeline using the power of Spark, all within a simple Google Colab notebook?

That’s exactly what we’re going to do today. We're going on a journey, starting with a messy little pile of raw data and ending with a working machine learning model that can make predictions. No complicated setup, no server clusters. Just you, me, and a little bit of PySpark magic.

Think of this as a guided project. I'll walk you through every step, explaining not just the what but the why behind each piece of code. Let's get started.

First Things First: Waking Up the Spark Engine

Before we can do anything, we need to get Spark running in our Colab environment. This used to be a bit of a hassle, but now it's surprisingly straightforward.

First, we'll install PySpark, which is the Python library that lets us talk to Spark.

pip install -q pyspark==3.5.1

With that installed, our next move is to create a SparkSession. You can think of this as turning on the engine of a powerful data-processing car. It’s the entry point to all of Spark's functionality. We'll give our session a name and tell it to run locally.

from pyspark.sql import SparkSession, functions as F, Window
from pyspark.sql.types import IntegerType, StringType, StructType, StructField, FloatType

spark = (SparkSession.builder
    .appName("ColabSparkProject")
    .master("local[*]")
    .getOrCreate())

print("Spark is ready to go! Version:", spark.version)

Now that our engine is running, we need some fuel—our data. Instead of loading a file, let's just create a small dataset of user information right here. It includes things like their name, country, income, and which subscription plan they're on.

# Our raw user data
data = [
    (1, "Alice", "IN", "2025-10-01", 56000.0, "premium"),
    (2, "Bob", "US", "2025-10-03", 43000.0, "standard"),
    (3, "Carlos", "IN", "2025-09-27", 72000.0, "premium"),
    (4, "Diana", "UK", "2025-09-30", 39000.0, "standard"),
    (5, "Esha", "IN", "2025-10-02", 85000.0, "premium"),
    (6, "Farid", "AE", "2025-10-02", 31000.0, "basic"),
    (7, "Gita", "IN", "2025-09-29", 46000.0, "standard"),
    (8, "Hassan", "PK", "2025-10-01", 52000.0, "premium"),
]

# Defining the structure of our data (the "schema")
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("country", StringType(), True),
    StructField("signup_date", StringType(), True),
    StructField("income", FloatType(), True),
    StructField("plan", StringType(), True),
])

# Creating our first DataFrame
df = spark.createDataFrame(data, schema)
df.show()

And just like that, we have a Spark DataFrame! It looks like a table, but under the hood, it's a powerful, distributed data structure ready for some serious work.

Shaping the Clay: Transforming Our Data

Raw data is rarely perfect. Our job as data professionals is to clean it, shape it, and add new information to make it more useful. This is where Spark really starts to shine.

Let’s perform a few transformations. The signup_date is currently just a string, which isn't very helpful. We'll convert it to a proper timestamp and then extract the year and month from it. We'll also add a handy little column that's 1 if the user is from India and 0 otherwise.

The withColumn() function is your best friend for this. It lets you add or replace columns without messing up your original DataFrame.

df2 = (df.withColumn("signup_ts", F.to_timestamp("signup_date"))
    .withColumn("year", F.year("signup_ts"))
    .withColumn("month", F.month("signup_ts"))
    .withColumn("is_india", (F.col("country") == "IN").cast("int")))

df2.show()

See those new columns on the right? Pretty cool, right?

Speaking a Familiar Language: Spark SQL

One of the best things about Spark is that you don't have to learn a whole new way of thinking to query your data. If you know SQL, you're already halfway there. We can register our DataFrame as a temporary table and then run standard SQL queries on it.

Let's find out how many users we have in each country and what their average income is.

df2.createOrReplaceTempView("users")

spark.sql("""
    SELECT 
        country, 
        COUNT(*) AS user_count, 
        AVG(income) AS avg_income
    FROM users
    GROUP BY country
    ORDER BY user_count DESC
""").show()

This makes it incredibly easy to do quick aggregations and explorations.

A Window into Deeper Insights

Okay, let's try something a bit more advanced: a window function. Imagine you want to rank users by income, but you want to do it within each country. So, who's the highest earner in India? Who's the highest in the US?

A simple GROUP BY can't do that. But a window function can. Think of it like looking through a "window" at a specific partition of your data (in our case, each country).

# Define the "window" - partition by country, order by income
w = Window.partitionBy("country").orderBy(F.col("income").desc())

# Add a new column with the rank over that window
df_ranked = df2.withColumn("income_rank_in_country", F.rank().over(w))
df_ranked.show()

Now you can clearly see Alice is #2 in India, while Esha is #1. This is a super powerful technique for more complex analysis.

Creating Our Own Tools: User-Defined Functions (UDFs)

Sometimes, Spark's built-in functions aren't enough. What if we want to apply some custom logic? Let's say we want to assign a numerical priority to each subscription plan: premium is 3, standard is 2, and so on.

We can write a regular Python function and then wrap it as a User-Defined Function (UDF) so Spark can use it across our entire dataset. A word of caution: UDFs can be slower than native Spark functions, but they are incredibly flexible when you need them.

from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

def plan_priority(plan):
    if plan == "premium": return 3
    if plan == "standard": return 2
    if plan == "basic": return 1
    return 0

# Register our Python function as a UDF
plan_priority_udf = F.udf(plan_priority, IntegerType())

# Apply it to create a new column
df_udf = df_ranked.withColumn("plan_priority", plan_priority_udf(F.col("plan")))
df_udf.show()

Connecting the Dots: Enriching Data with Joins

Our user data is looking good, but we can make it even better. What if we had another dataset with information about each country, like its continent and population? We could join that information into our main table.

This is a fundamental concept in data engineering—combining different datasets to create a richer, more complete picture.

# A second DataFrame with country info
country_data = [
    ("IN", "Asia", 1.42),
    ("US", "North America", 0.33),
    ("UK", "Europe", 0.07),
    ("AE", "Asia", 0.01),
    ("PK", "Asia", 0.24),
]
country_schema = StructType([
    StructField("country", StringType(), True),
    StructField("region", StringType(), True),
    StructField("population_bn", FloatType(), True),
])
country_df = spark.createDataFrame(country_data, country_schema)

# Now, let's join them on the 'country' column
joined_df = df_udf.join(country_df, on="country", how="left")
joined_df.show()

With this single join operation, we've instantly added region and population data to every user record. Now we can ask new questions, like "What's the average income for premium users in Asia?"

(joined_df.groupBy("region", "plan")
    .agg(
        F.count("*").alias("user_count"),
        F.round(F.avg("income"), 2).alias("avg_income")
    )
    .orderBy("region", "plan")
).show()

Building Our Crystal Ball: Machine Learning in PySpark

Alright, our data is clean, transformed, and enriched. Now for the fun part. Let's use this data to build a predictive model. Our goal: predict whether a user will sign up for a "premium" plan based on their income and country.

This is where we transition from data engineering to machine learning, and Spark's MLlib library makes this seamless.

Step 1: Feature Engineering

Machine learning models don't understand text like "IN" or "US." They need numbers. So, our first job is to prepare the data for the model. This is called feature engineering.

  1. Create a Label: We need a target to predict. We'll create a label column that is 1 if the plan is "premium" and 0 otherwise.
  2. Index Categorical Features: We'll use a StringIndexer to convert the country column from text to a numerical index (e.g., IN=0, US=1).
  3. Assemble a Feature Vector: Models like to have all their input features bundled together in a single vector. We'll use a VectorAssembler to combine our income, country_idx, and plan_priority columns into one features column.
# Create the binary label for our model
ml_df = joined_df.withColumn("label", (F.col("plan") == "premium").cast("int")).na.drop()

# Turn country strings into numbers
country_indexer = StringIndexer(inputCol="country", outputCol="country_idx", handleInvalid="keep")
ml_df2 = country_indexer.fit(ml_df).transform(ml_df)

# Bundle our features into a single vector
assembler = VectorAssembler(inputCols=["income", "country_idx", "plan_priority"], outputCol="features")
ml_final = assembler.transform(ml_df2)

# Split data into training and testing sets
train_df, test_df = ml_final.randomSplit([0.7, 0.3], seed=42)

Step 2: Training and Evaluating the Model

With our data prepped, training the model is surprisingly simple. We'll use a LogisticRegression model, a classic choice for binary classification problems like this.

We'll fit the model on our training data and then use it to make predictions on our test data.

# Define and train the model
lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train_df)

# Make predictions on the test data
predictions = lr_model.transform(test_df)

predictions.select("name", "income", "plan", "label", "prediction", "probability").show(truncate=False)

Look at that! We have our original data alongside the model's prediction. Finally, let's see how well it did. We'll use an evaluator to calculate the accuracy.

# Evaluate the model's performance
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Our model's accuracy is: {accuracy:.2f}")

Saving Our Work and Wrapping Up

We've done a ton of work to process this data. It would be a shame to lose it. A common last step in a data pipeline is to save the results. We'll save our joined DataFrame to Parquet, a highly efficient, column-oriented file format that's perfect for big data analytics.

output_path = "/content/spark_users_parquet"
joined_df.write.mode("overwrite").parquet(output_path)

# Let's prove it worked by reading it back in
reloaded_df = spark.read.parquet(output_path)
print("Data reloaded from Parquet:")
reloaded_df.show()

And just to show that our SQL context is still alive and kicking, let's run one final query to find the most recent signups.

spark.sql("""
    SELECT name, country, income, signup_ts
    FROM users
    WHERE signup_ts >= '2025-10-01'
    ORDER BY signup_ts DESC
""").show()

Finally, it's good practice to explicitly stop the Spark session when you're done. It’s like turning off the lights and locking the door when you leave.

spark.stop()

And there you have it. We went from a simple list of data to a fully transformed and enriched dataset, built a machine learning model, made predictions, and saved our work. We've completed an entire end-to-end pipeline, all within the friendly confines of a Colab notebook.

Hopefully, this little journey has shown you that Spark isn't some scary, unapproachable monster. It’s a powerful, flexible tool that you can start using today for your own data challenges. Now, what will you build?

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.