Data Mining with Python

Why Python is Your Data Mining Swiss Army Knife 🇨🇭

You've learned the concepts of data mining, pattern recognition, and scraping. Now it's time to actually *do* it. Python is the undisputed king of data science for a few simple reasons: it's easy to read, it's supported by a massive community, and it has powerful, free libraries that do the heavy lifting for you.

This lesson will walk you through a complete, practical data mining project in about 20 lines of code. No complex math, just a clear, repeatable workflow.

The Data Scientist's Holy Trinity

You only need to know three main libraries to get started:

🐼

Pandas

Your data spreadsheet. Use it to load, clean, and manipulate data from files like CSVs.

🔢

NumPy

The calculator. It handles all the fast, efficient math under the hood.

🤖

Scikit-learn

Your toolbox of models. Use it for pattern recognition, prediction, and clustering.

A 5-Step Workflow: Predicting Customer Churn

Let's solve a real business problem: "Which of our customers are most likely to cancel their subscription?"

Step 1: Get the Data

First, we use Pandas to load our customer data from a CSV file into a "DataFrame" (think of it as a smart spreadsheet).

import pandas as pd

# Load customer data from a CSV file
df = pd.read_csv('customer_data.csv')

print(df.head())

Step 2: Prepare the Data

Real-world data is messy. We need to select the features (inputs) we think are important and define our target (what we want to predict). We'll also split our data into a training set (to teach the model) and a testing set (to see how well it learned).

from sklearn.model_selection import train_test_split

# Define our features (X) and target (y)
features = ['age', 'monthly_spend', 'support_tickets']
target = 'churned'

X = df[features]
y = df[target]

# Split data into 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 3: Build a Model

Now for the magic. We'll import a Decision Tree model from Scikit-learn, which is great for predicting "yes/no" answers. We then "fit" (train) the model on our training data.

from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree model
model = DecisionTreeClassifier()

# Train the model on our data
model.fit(X_train, y_train)

Step 4: Check Accuracy

How good is our model? We use the testing data we set aside earlier to see how accurately it can predict churn on data it's never seen before.

# See how accurate the model is on the test data
accuracy = model.score(X_test, y_test)

print(f"Model Accuracy: {accuracy * 100:.2f}%")

Step 5: Get an Answer

Finally, the payoff. Let's predict what a new customer might do. We create a new profile and ask the model to predict if they will churn (1 for Yes, 0 for No).

# Profile of a new customer: 
# Age 45, spends $150/month, has 5 support tickets
new_customer = [[45, 150, 5]]

# Get the prediction
prediction = model.predict(new_customer)

if prediction[0] == 1:
    print("Prediction: This customer is likely to churn.")
else:
    print("Prediction: This customer is likely to stay.")

Data Mining with Python

Why Python is Your Data Mining Swiss Army Knife 🇨🇭

The Data Scientist's Holy Trinity

Pandas

NumPy

Scikit-learn

A 5-Step Workflow: Predicting Customer Churn

Step 1: Get the Data

Step 2: Prepare the Data

Step 3: Build a Model

Step 4: Check Accuracy

Step 5: Get an Answer

📚 Level Up Your Python Skills

10 Minutes to Pandas

Scikit-learn Basic Tutorial

Kaggle Datasets

Real Python

Stay Ahead of the Curve