Python90 min readIntermediate

Python for Data Science Complete Guide

Master data analysis, visualization, and machine learning with Python, pandas, and scikit-learn through practical examples.

Table of Contents

Introduction to Data Science

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

Python has become the go-to language for data science due to its simplicity, extensive libraries, and strong community support. This tutorial will guide you through the essential tools and techniques.

What You'll Learn

  • Data manipulation with pandas
  • Numerical computing with NumPy
  • Data visualization with matplotlib and seaborn
  • Statistical analysis techniques
  • Machine learning with scikit-learn
  • Real-world data science project

Setting Up Your Environment

Before diving into data science, let's set up a proper Python environment with all necessary libraries.

Installing Anaconda

Anaconda is the easiest way to get started with data science in Python. It includes Python, Jupyter notebooks, and most popular data science libraries.

# Download Anaconda from https://www.anaconda.com/products/distribution
# Or install with package manager:

# On macOS with Homebrew:
brew install --cask anaconda

# On Windows with Chocolatey:
choco install anaconda3

# On Linux:
wget https://repo.anaconda.com/archive/Anaconda3-latest-Linux-x86_64.sh
bash Anaconda3-latest-Linux-x86_64.sh
bash

Essential Libraries

# Install additional libraries if needed
conda install pandas numpy matplotlib seaborn scikit-learn jupyter

# Or with pip:
pip install pandas numpy matplotlib seaborn scikit-learn jupyter plotly

# For interactive notebooks:
pip install ipywidgets
bash

Starting Jupyter Notebook

# Launch Jupyter Notebook
jupyter notebook

# Or Jupyter Lab (modern interface)
jupyter lab

# This will open your browser with the Jupyter interface
bash

NumPy Fundamentals

NumPy (Numerical Python) is the foundation of data science in Python. It provides support for large multi-dimensional arrays and matrices, along with mathematical functions.

Creating Arrays

import numpy as np

# Creating arrays from lists
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(f"1D Array: {arr1}")
print(f"2D Array:\n{arr2}")
print(f"Shape: {arr2.shape}")
print(f"Data type: {arr1.dtype}")

# Creating special arrays
zeros = np.zeros((3, 4))          # Array of zeros
ones = np.ones((2, 3))            # Array of ones
identity = np.eye(3)              # Identity matrix
random_array = np.random.rand(3, 3)  # Random values 0-1

# Creating ranges
range_array = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)    # 5 evenly spaced numbers 0-1
python

Array Operations

# Mathematical operations
arr = np.array([1, 2, 3, 4, 5])

# Element-wise operations
squared = arr ** 2              # [1, 4, 9, 16, 25]
added = arr + 10                # [11, 12, 13, 14, 15]
multiplied = arr * 2            # [2, 4, 6, 8, 10]

# Array operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

sum_arrays = arr1 + arr2        # [5, 7, 9]
dot_product = np.dot(arr1, arr2)  # 32

# Statistical operations
data = np.random.randn(100)  # 100 random numbers
print(f"Mean: {np.mean(data):.2f}")
print(f"Standard deviation: {np.std(data):.2f}")
print(f"Min: {np.min(data):.2f}, Max: {np.max(data):.2f}")

# Boolean indexing
large_values = data[data > 0]   # Only positive values
print(f"Positive values: {len(large_values)}")
python

Pandas for Data Manipulation

Pandas is the most important library for data manipulation and analysis. It provides data structures like DataFrame and Series that make working with structured data intuitive.

Creating DataFrames

import pandas as pd
import numpy as np

# Creating DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
    'Salary': [70000, 80000, 75000, 85000, 90000]
}

df = pd.DataFrame(data)
print(df)

# Basic information about the DataFrame
print(f"\nShape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nBasic statistics:\n{df.describe()}")

# Viewing data
print(f"\nFirst 3 rows:\n{df.head(3)}")
print(f"\nLast 2 rows:\n{df.tail(2)}")
print(f"\nColumn names: {df.columns.tolist()}")
python

Data Selection and Filtering

# Selecting columns
names = df['Name']                    # Single column (Series)
subset = df[['Name', 'Age']]         # Multiple columns (DataFrame)

# Selecting rows
first_row = df.iloc[0]               # By position
first_three = df.iloc[0:3]          # Slice by position
by_label = df.loc[df['Age'] > 30]    # By condition

# Filtering data
high_earners = df[df['Salary'] > 80000]
young_and_rich = df[(df['Age'] < 30) & (df['Salary'] > 75000)]

# Adding new columns
df['Salary_USD'] = df['Salary']
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
df['Name_Length'] = df['Name'].str.len()

print(df)
python

Data Aggregation and Grouping

# Reading real data (example with CSV)
# df = pd.read_csv('sales_data.csv')

# Simulating sales data
np.random.seed(42)
sales_data = pd.DataFrame({
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100),
    'Month': np.random.choice(range(1, 13), 100)
})

# Group by operations
product_sales = sales_data.groupby('Product')['Sales'].sum()
region_stats = sales_data.groupby('Region').agg({
    'Sales': ['mean', 'sum', 'count']
})

print("Sales by Product:")
print(product_sales)
print("\nRegion Statistics:")
print(region_stats)

# Pivot tables
pivot = sales_data.pivot_table(
    values='Sales',
    index='Product',
    columns='Region',
    aggfunc='mean'
)
print("\nPivot Table:")
print(pivot)
python

Data Visualization

Visualization is crucial for understanding data patterns and communicating insights. We'll use matplotlib and seaborn for creating compelling visualizations.

Matplotlib Basics

import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better-looking plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Basic line plot
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y1, label='sin(x)', linewidth=2)
plt.plot(x, y2, label='cos(x)', linewidth=2)
plt.title('Trigonometric Functions')
plt.xlabel('x values')
plt.ylabel('y values')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['Age'], df['Salary'], c=df['Salary'], cmap='viridis', s=100)
plt.colorbar(label='Salary')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
python

Advanced Visualizations with Seaborn

# Create sample dataset
tips = sns.load_dataset('tips')  # Built-in seaborn dataset

# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Histogram
sns.histplot(data=tips, x='total_bill', bins=20, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Total Bill')

# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[0, 1])
axes[0, 1].set_title('Total Bill by Day')

# Correlation heatmap
correlation_matrix = tips.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', ax=axes[1, 0])
axes[1, 0].set_title('Correlation Matrix')

# Scatter plot with regression line
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', ax=axes[1, 1])
sns.regplot(data=tips, x='total_bill', y='tip', scatter=False, ax=axes[1, 1])
axes[1, 1].set_title('Tips vs Total Bill')

plt.tight_layout()
plt.show()

# Advanced plot: Pairplot
plt.figure(figsize=(12, 8))
sns.pairplot(tips, hue='time', vars=['total_bill', 'tip', 'size'])
plt.show()
python

Data Cleaning & Preprocessing

Real-world data is messy. Data cleaning is often 80% of a data scientist's work. Let's learn how to handle missing values, outliers, and data inconsistencies.

# Create messy dataset
messy_data = pd.DataFrame({
    'Name': ['John Doe', 'jane smith', 'BOB JOHNSON', None, 'Alice Brown'],
    'Age': [25, None, 35, 28, 150],  # 150 is an outlier
    'Email': ['john@email.com', 'JANE@EMAIL.COM', 'bob@email', None, 'alice@email.com'],
    'Salary': [50000, 60000, None, 55000, 70000],
    'Date_Joined': ['2020-01-15', '2019/12/20', '2021-03-01', '2020-06-30', None]
})

print("Original messy data:")
print(messy_data)
print(f"\nMissing values:\n{messy_data.isnull().sum()}")

# Data cleaning steps
cleaned_data = messy_data.copy()

# 1. Handle missing values
# Fill missing ages with median
cleaned_data['Age'].fillna(cleaned_data['Age'].median(), inplace=True)

# Fill missing salaries with mean
cleaned_data['Salary'].fillna(cleaned_data['Salary'].mean(), inplace=True)

# Drop rows with missing names (critical field)
cleaned_data.dropna(subset=['Name'], inplace=True)

# 2. Fix data types and formats
# Standardize names
cleaned_data['Name'] = cleaned_data['Name'].str.title()

# Fix email formats
cleaned_data['Email'] = cleaned_data['Email'].str.lower()

# Convert dates
cleaned_data['Date_Joined'] = pd.to_datetime(
    cleaned_data['Date_Joined'], 
    errors='coerce'
)

# 3. Handle outliers
# Remove unrealistic ages (assuming max reasonable age is 100)
cleaned_data = cleaned_data[cleaned_data['Age'] <= 100]

# 4. Add derived columns
cleaned_data['Years_Experience'] = (
    pd.Timestamp.now() - cleaned_data['Date_Joined']
).dt.days / 365.25

print("\nCleaned data:")
print(cleaned_data)
print(f"\nRemaining missing values:\n{cleaned_data.isnull().sum()}")
python

Machine Learning Basics

Machine learning allows us to build predictive models from data. We'll cover supervised learning with classification and regression examples.

Classification Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Load the famous Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred = clf.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nFeature Importance:")
print(feature_importance)
python

Regression Example

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Load Boston housing dataset (or use California housing for newer scikit-learn)
try:
    boston = load_boston()
    X, y = boston.data, boston.target
    feature_names = boston.feature_names
except ImportError:
    # Alternative dataset
    from sklearn.datasets import fetch_california_housing
    california = fetch_california_housing()
    X, y = california.data, california.target
    feature_names = california.feature_names

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a linear regression model
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)

# Make predictions
y_pred = regressor.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

# Visualize predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()
python

Real-World Project: Sales Analysis

Let's put everything together in a complete data science project analyzing sales data.

# Complete Sales Analysis Project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Generate synthetic sales data
np.random.seed(42)
n_records = 1000

sales_data = pd.DataFrame({
    'Date': pd.date_range('2022-01-01', periods=n_records, freq='D'),
    'Product_Category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'Marketing_Spend': np.random.uniform(100, 1000, n_records),
    'Temperature': np.random.normal(20, 10, n_records),
    'Weekend': np.random.choice([0, 1], n_records, p=[0.7, 0.3])
})

# Create sales with some realistic patterns
base_sales = 100 + sales_data['Marketing_Spend'] * 0.5
seasonal_effect = 50 * np.sin(2 * np.pi * sales_data.index / 365)
category_effect = sales_data['Product_Category'].map({
    'Electronics': 100, 'Clothing': 80, 'Books': 40, 'Home': 60
})
weekend_effect = sales_data['Weekend'] * 30
noise = np.random.normal(0, 20, n_records)

sales_data['Sales'] = base_sales + seasonal_effect + category_effect + weekend_effect + noise

# Data Analysis
print("Sales Data Analysis")
print("=" * 50)
print(f"Dataset shape: {sales_data.shape}")
print(f"\nBasic statistics:")
print(sales_data.describe())

# Time series analysis
sales_data['Month'] = sales_data['Date'].dt.month
monthly_sales = sales_data.groupby('Month')['Sales'].mean()

plt.figure(figsize=(15, 10))

# Monthly sales trend
plt.subplot(2, 2, 1)
monthly_sales.plot(kind='bar')
plt.title('Average Sales by Month')
plt.xlabel('Month')
plt.ylabel('Average Sales')

# Sales by category
plt.subplot(2, 2, 2)
sns.boxplot(data=sales_data, x='Product_Category', y='Sales')
plt.title('Sales Distribution by Category')
plt.xticks(rotation=45)

# Marketing spend vs sales
plt.subplot(2, 2, 3)
plt.scatter(sales_data['Marketing_Spend'], sales_data['Sales'], alpha=0.6)
plt.xlabel('Marketing Spend')
plt.ylabel('Sales')
plt.title('Marketing Spend vs Sales')

# Weekend effect
plt.subplot(2, 2, 4)
weekend_sales = sales_data.groupby('Weekend')['Sales'].mean()
weekend_sales.plot(kind='bar')
plt.title('Average Sales: Weekday vs Weekend')
plt.xlabel('Weekend (0=Weekday, 1=Weekend)')
plt.ylabel('Average Sales')

plt.tight_layout()
plt.show()

# Predictive Modeling
print("\nPredictive Modeling")
print("=" * 30)

# Prepare features for machine learning
features = ['Marketing_Spend', 'Temperature', 'Weekend', 'Month']
features_encoded = pd.get_dummies(sales_data[['Marketing_Spend', 'Temperature', 'Weekend', 'Month', 'Product_Category', 'Region']])

X = features_encoded
y = sales_data['Sales']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.2f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False).head(10)

print(f"\nTop 10 Most Important Features:")
print(feature_importance)

# Business insights
print("\nBusiness Insights")
print("=" * 20)
print(f"1. Average daily sales: $\{sales_data['Sales'].mean():.2f\}")
print(f"2. Best performing category: \{sales_data.groupby('Product_Category')['Sales'].mean().idxmax()\}")
print(f"3. Marketing ROI: $\{(sales_data['Sales'].sum() / sales_data['Marketing_Spend'].sum()):.2f\} per dollar spent")
print(f"4. Weekend boost: \{((weekend_sales[1] - weekend_sales[0]) / weekend_sales[0] * 100):.1f\}% increase")
python

Next Steps in Your Data Science Journey

Congratulations! You've completed a comprehensive introduction to Python for data science. You now have the foundation to tackle real-world data problems.

What You've Learned

  • NumPy for numerical computing and array operations
  • Pandas for data manipulation and analysis
  • Matplotlib and Seaborn for data visualization
  • Data cleaning and preprocessing techniques
  • Machine learning fundamentals with scikit-learn
  • Complete end-to-end data science project

Advanced Topics to Explore

Deep Learning

  • • TensorFlow and PyTorch
  • • Neural networks
  • • Computer vision
  • • Natural language processing

Big Data & Cloud

  • • Apache Spark with PySpark
  • • AWS/GCP data services
  • • Distributed computing
  • • Real-time data processing

Recommended Resources

  • Books: "Python for Data Analysis" by Wes McKinney, "Hands-On Machine Learning" by Aurélien Géron
  • Online Courses: Coursera Machine Learning Course, edX MIT Introduction to Computer Science
  • Practice Platforms: Kaggle competitions, Google Colab notebooks
  • Communities: r/MachineLearning, Stack Overflow, GitHub

Keep practicing with real datasets, participate in Kaggle competitions, and don't hesitate to tackle challenging problems. Data science is a field where hands-on experience is invaluable!