Python for Data Science Complete Guide

Introduction to Data Science

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

Python has become the go-to language for data science due to its simplicity, extensive libraries, and strong community support. This tutorial will guide you through the essential tools and techniques.

What You'll Learn

Data manipulation with pandas
Numerical computing with NumPy
Data visualization with matplotlib and seaborn
Statistical analysis techniques
Machine learning with scikit-learn
Real-world data science project

Setting Up Your Environment

Before diving into data science, let's set up a proper Python environment with all necessary libraries.

Installing Anaconda

Anaconda is the easiest way to get started with data science in Python. It includes Python, Jupyter notebooks, and most popular data science libraries.

# Download Anaconda from https://www.anaconda.com/products/distribution
# Or install with package manager:

# On macOS with Homebrew:
brew install --cask anaconda

# On Windows with Chocolatey:
choco install anaconda3

# On Linux:
wget https://repo.anaconda.com/archive/Anaconda3-latest-Linux-x86_64.sh
bash Anaconda3-latest-Linux-x86_64.sh

bash

Essential Libraries

# Install additional libraries if needed
conda install pandas numpy matplotlib seaborn scikit-learn jupyter

# Or with pip:
pip install pandas numpy matplotlib seaborn scikit-learn jupyter plotly

# For interactive notebooks:
pip install ipywidgets

bash

Starting Jupyter Notebook

# Launch Jupyter Notebook
jupyter notebook

# Or Jupyter Lab (modern interface)
jupyter lab

# This will open your browser with the Jupyter interface

bash

NumPy Fundamentals

NumPy (Numerical Python) is the foundation of data science in Python. It provides support for large multi-dimensional arrays and matrices, along with mathematical functions.

Creating Arrays

import numpy as np

# Creating arrays from lists
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print(f"1D Array: {arr1}")
print(f"2D Array:\n{arr2}")
print(f"Shape: {arr2.shape}")
print(f"Data type: {arr1.dtype}")

# Creating special arrays
zeros = np.zeros((3, 4))          # Array of zeros
ones = np.ones((2, 3))            # Array of ones
identity = np.eye(3)              # Identity matrix
random_array = np.random.rand(3, 3)  # Random values 0-1

# Creating ranges
range_array = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)    # 5 evenly spaced numbers 0-1

python

Array Operations

# Mathematical operations
arr = np.array([1, 2, 3, 4, 5])

# Element-wise operations
squared = arr ** 2              # [1, 4, 9, 16, 25]
added = arr + 10                # [11, 12, 13, 14, 15]
multiplied = arr * 2            # [2, 4, 6, 8, 10]

# Array operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

sum_arrays = arr1 + arr2        # [5, 7, 9]
dot_product = np.dot(arr1, arr2)  # 32

# Statistical operations
data = np.random.randn(100)  # 100 random numbers
print(f"Mean: {np.mean(data):.2f}")
print(f"Standard deviation: {np.std(data):.2f}")
print(f"Min: {np.min(data):.2f}, Max: {np.max(data):.2f}")

# Boolean indexing
large_values = data[data > 0]   # Only positive values
print(f"Positive values: {len(large_values)}")

python

Pandas for Data Manipulation

Pandas is the most important library for data manipulation and analysis. It provides data structures like DataFrame and Series that make working with structured data intuitive.

Creating DataFrames

import pandas as pd
import numpy as np

# Creating DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
    'Salary': [70000, 80000, 75000, 85000, 90000]
}

df = pd.DataFrame(data)
print(df)

# Basic information about the DataFrame
print(f"\nShape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nBasic statistics:\n{df.describe()}")

# Viewing data
print(f"\nFirst 3 rows:\n{df.head(3)}")
print(f"\nLast 2 rows:\n{df.tail(2)}")
print(f"\nColumn names: {df.columns.tolist()}")

python

Data Selection and Filtering

# Selecting columns
names = df['Name']                    # Single column (Series)
subset = df[['Name', 'Age']]         # Multiple columns (DataFrame)

# Selecting rows
first_row = df.iloc[0]               # By position
first_three = df.iloc[0:3]          # Slice by position
by_label = df.loc[df['Age'] > 30]    # By condition

# Filtering data
high_earners = df[df['Salary'] > 80000]
young_and_rich = df[(df['Age'] < 30) & (df['Salary'] > 75000)]

# Adding new columns
df['Salary_USD'] = df['Salary']
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
df['Name_Length'] = df['Name'].str.len()

print(df)

python

Data Aggregation and Grouping

# Reading real data (example with CSV)
# df = pd.read_csv('sales_data.csv')

# Simulating sales data
np.random.seed(42)
sales_data = pd.DataFrame({
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100),
    'Month': np.random.choice(range(1, 13), 100)
})

# Group by operations
product_sales = sales_data.groupby('Product')['Sales'].sum()
region_stats = sales_data.groupby('Region').agg({
    'Sales': ['mean', 'sum', 'count']
})

print("Sales by Product:")
print(product_sales)
print("\nRegion Statistics:")
print(region_stats)

# Pivot tables
pivot = sales_data.pivot_table(
    values='Sales',
    index='Product',
    columns='Region',
    aggfunc='mean'
)
print("\nPivot Table:")
print(pivot)

python

Data Visualization

Visualization is crucial for understanding data patterns and communicating insights. We'll use matplotlib and seaborn for creating compelling visualizations.

Matplotlib Basics

import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better-looking plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Basic line plot
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y1, label='sin(x)', linewidth=2)
plt.plot(x, y2, label='cos(x)', linewidth=2)
plt.title('Trigonometric Functions')
plt.xlabel('x values')
plt.ylabel('y values')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['Age'], df['Salary'], c=df['Salary'], cmap='viridis', s=100)
plt.colorbar(label='Salary')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()

python

Advanced Visualizations with Seaborn

# Create sample dataset
tips = sns.load_dataset('tips')  # Built-in seaborn dataset

# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Histogram
sns.histplot(data=tips, x='total_bill', bins=20, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Total Bill')

# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[0, 1])
axes[0, 1].set_title('Total Bill by Day')

# Correlation heatmap
correlation_matrix = tips.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', ax=axes[1, 0])
axes[1, 0].set_title('Correlation Matrix')

# Scatter plot with regression line
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', ax=axes[1, 1])
sns.regplot(data=tips, x='total_bill', y='tip', scatter=False, ax=axes[1, 1])
axes[1, 1].set_title('Tips vs Total Bill')

plt.tight_layout()
plt.show()

# Advanced plot: Pairplot
plt.figure(figsize=(12, 8))
sns.pairplot(tips, hue='time', vars=['total_bill', 'tip', 'size'])
plt.show()

python

Data Cleaning & Preprocessing

Real-world data is messy. Data cleaning is often 80% of a data scientist's work. Let's learn how to handle missing values, outliers, and data inconsistencies.

# Create messy dataset
messy_data = pd.DataFrame({
    'Name': ['John Doe', 'jane smith', 'BOB JOHNSON', None, 'Alice Brown'],
    'Age': [25, None, 35, 28, 150],  # 150 is an outlier
    'Email': ['john@email.com', 'JANE@EMAIL.COM', 'bob@email', None, 'alice@email.com'],
    'Salary': [50000, 60000, None, 55000, 70000],
    'Date_Joined': ['2020-01-15', '2019/12/20', '2021-03-01', '2020-06-30', None]
})

print("Original messy data:")
print(messy_data)
print(f"\nMissing values:\n{messy_data.isnull().sum()}")

# Data cleaning steps
cleaned_data = messy_data.copy()

# 1. Handle missing values
# Fill missing ages with median
cleaned_data['Age'].fillna(cleaned_data['Age'].median(), inplace=True)

# Fill missing salaries with mean
cleaned_data['Salary'].fillna(cleaned_data['Salary'].mean(), inplace=True)

# Drop rows with missing names (critical field)
cleaned_data.dropna(subset=['Name'], inplace=True)

# 2. Fix data types and formats
# Standardize names
cleaned_data['Name'] = cleaned_data['Name'].str.title()

# Fix email formats
cleaned_data['Email'] = cleaned_data['Email'].str.lower()

# Convert dates
cleaned_data['Date_Joined'] = pd.to_datetime(
    cleaned_data['Date_Joined'], 
    errors='coerce'
)

# 3. Handle outliers
# Remove unrealistic ages (assuming max reasonable age is 100)
cleaned_data = cleaned_data[cleaned_data['Age'] <= 100]

# 4. Add derived columns
cleaned_data['Years_Experience'] = (
    pd.Timestamp.now() - cleaned_data['Date_Joined']
).dt.days / 365.25

print("\nCleaned data:")
print(cleaned_data)
print(f"\nRemaining missing values:\n{cleaned_data.isnull().sum()}")

python

Machine Learning Basics

Machine learning allows us to build predictive models from data. We'll cover supervised learning with classification and regression examples.

Classification Example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Load the famous Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred = clf.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nFeature Importance:")
print(feature_importance)

python

Regression Example

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Load Boston housing dataset (or use California housing for newer scikit-learn)
try:
    boston = load_boston()
    X, y = boston.data, boston.target
    feature_names = boston.feature_names
except ImportError:
    # Alternative dataset
    from sklearn.datasets import fetch_california_housing
    california = fetch_california_housing()
    X, y = california.data, california.target
    feature_names = california.feature_names

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a linear regression model
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)

# Make predictions
y_pred = regressor.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

# Visualize predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

python

Real-World Project: Sales Analysis

Let's put everything together in a complete data science project analyzing sales data.

# Complete Sales Analysis Project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Generate synthetic sales data
np.random.seed(42)
n_records = 1000

sales_data = pd.DataFrame({
    'Date': pd.date_range('2022-01-01', periods=n_records, freq='D'),
    'Product_Category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'Marketing_Spend': np.random.uniform(100, 1000, n_records),
    'Temperature': np.random.normal(20, 10, n_records),
    'Weekend': np.random.choice([0, 1], n_records, p=[0.7, 0.3])
})

# Create sales with some realistic patterns
base_sales = 100 + sales_data['Marketing_Spend'] * 0.5
seasonal_effect = 50 * np.sin(2 * np.pi * sales_data.index / 365)
category_effect = sales_data['Product_Category'].map({
    'Electronics': 100, 'Clothing': 80, 'Books': 40, 'Home': 60
})
weekend_effect = sales_data['Weekend'] * 30
noise = np.random.normal(0, 20, n_records)

sales_data['Sales'] = base_sales + seasonal_effect + category_effect + weekend_effect + noise

# Data Analysis
print("Sales Data Analysis")
print("=" * 50)
print(f"Dataset shape: {sales_data.shape}")
print(f"\nBasic statistics:")
print(sales_data.describe())

# Time series analysis
sales_data['Month'] = sales_data['Date'].dt.month
monthly_sales = sales_data.groupby('Month')['Sales'].mean()

plt.figure(figsize=(15, 10))

# Monthly sales trend
plt.subplot(2, 2, 1)
monthly_sales.plot(kind='bar')
plt.title('Average Sales by Month')
plt.xlabel('Month')
plt.ylabel('Average Sales')

# Sales by category
plt.subplot(2, 2, 2)
sns.boxplot(data=sales_data, x='Product_Category', y='Sales')
plt.title('Sales Distribution by Category')
plt.xticks(rotation=45)

# Marketing spend vs sales
plt.subplot(2, 2, 3)
plt.scatter(sales_data['Marketing_Spend'], sales_data['Sales'], alpha=0.6)
plt.xlabel('Marketing Spend')
plt.ylabel('Sales')
plt.title('Marketing Spend vs Sales')

# Weekend effect
plt.subplot(2, 2, 4)
weekend_sales = sales_data.groupby('Weekend')['Sales'].mean()
weekend_sales.plot(kind='bar')
plt.title('Average Sales: Weekday vs Weekend')
plt.xlabel('Weekend (0=Weekday, 1=Weekend)')
plt.ylabel('Average Sales')

plt.tight_layout()
plt.show()

# Predictive Modeling
print("\nPredictive Modeling")
print("=" * 30)

# Prepare features for machine learning
features = ['Marketing_Spend', 'Temperature', 'Weekend', 'Month']
features_encoded = pd.get_dummies(sales_data[['Marketing_Spend', 'Temperature', 'Weekend', 'Month', 'Product_Category', 'Region']])

X = features_encoded
y = sales_data['Sales']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.2f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False).head(10)

print(f"\nTop 10 Most Important Features:")
print(feature_importance)

# Business insights
print("\nBusiness Insights")
print("=" * 20)
print(f"1. Average daily sales: $\{sales_data['Sales'].mean():.2f\}")
print(f"2. Best performing category: \{sales_data.groupby('Product_Category')['Sales'].mean().idxmax()\}")
print(f"3. Marketing ROI: $\{(sales_data['Sales'].sum() / sales_data['Marketing_Spend'].sum()):.2f\} per dollar spent")
print(f"4. Weekend boost: \{((weekend_sales[1] - weekend_sales[0]) / weekend_sales[0] * 100):.1f\}% increase")

python

Next Steps in Your Data Science Journey

Congratulations! You've completed a comprehensive introduction to Python for data science. You now have the foundation to tackle real-world data problems.

What You've Learned

NumPy for numerical computing and array operations
Pandas for data manipulation and analysis
Matplotlib and Seaborn for data visualization
Data cleaning and preprocessing techniques
Machine learning fundamentals with scikit-learn
Complete end-to-end data science project

Advanced Topics to Explore

Deep Learning

• TensorFlow and PyTorch
• Neural networks
• Computer vision
• Natural language processing

Big Data & Cloud

• Apache Spark with PySpark
• AWS/GCP data services
• Distributed computing
• Real-time data processing

Recommended Resources

Books: "Python for Data Analysis" by Wes McKinney, "Hands-On Machine Learning" by Aurélien Géron
Online Courses: Coursera Machine Learning Course, edX MIT Introduction to Computer Science
Practice Platforms: Kaggle competitions, Google Colab notebooks
Communities: r/MachineLearning, Stack Overflow, GitHub

Keep practicing with real datasets, participate in Kaggle competitions, and don't hesitate to tackle challenging problems. Data science is a field where hands-on experience is invaluable!

Python for Data Science Complete Guide

Table of Contents

Introduction to Data Science

What You'll Learn

Setting Up Your Environment

Installing Anaconda

Essential Libraries

Starting Jupyter Notebook

NumPy Fundamentals

Creating Arrays

Array Operations

Pandas for Data Manipulation

Creating DataFrames

Data Selection and Filtering

Data Aggregation and Grouping

Data Visualization

Matplotlib Basics

Advanced Visualizations with Seaborn

Data Cleaning & Preprocessing

Machine Learning Basics

Classification Example

Regression Example

Real-World Project: Sales Analysis

Next Steps in Your Data Science Journey

What You've Learned

Advanced Topics to Explore

Deep Learning

Big Data & Cloud

Recommended Resources