Introduction to Data Science
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
Python has become the go-to language for data science due to its simplicity, extensive libraries, and strong community support. This tutorial will guide you through the essential tools and techniques.
What You'll Learn
- Data manipulation with pandas
- Numerical computing with NumPy
- Data visualization with matplotlib and seaborn
- Statistical analysis techniques
- Machine learning with scikit-learn
- Real-world data science project
Setting Up Your Environment
Before diving into data science, let's set up a proper Python environment with all necessary libraries.
Installing Anaconda
Anaconda is the easiest way to get started with data science in Python. It includes Python, Jupyter notebooks, and most popular data science libraries.
# Download Anaconda from https://www.anaconda.com/products/distribution
# Or install with package manager:
# On macOS with Homebrew:
brew install --cask anaconda
# On Windows with Chocolatey:
choco install anaconda3
# On Linux:
wget https://repo.anaconda.com/archive/Anaconda3-latest-Linux-x86_64.sh
bash Anaconda3-latest-Linux-x86_64.sh
Essential Libraries
# Install additional libraries if needed
conda install pandas numpy matplotlib seaborn scikit-learn jupyter
# Or with pip:
pip install pandas numpy matplotlib seaborn scikit-learn jupyter plotly
# For interactive notebooks:
pip install ipywidgets
Starting Jupyter Notebook
# Launch Jupyter Notebook
jupyter notebook
# Or Jupyter Lab (modern interface)
jupyter lab
# This will open your browser with the Jupyter interface
NumPy Fundamentals
NumPy (Numerical Python) is the foundation of data science in Python. It provides support for large multi-dimensional arrays and matrices, along with mathematical functions.
Creating Arrays
import numpy as np
# Creating arrays from lists
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(f"1D Array: {arr1}")
print(f"2D Array:\n{arr2}")
print(f"Shape: {arr2.shape}")
print(f"Data type: {arr1.dtype}")
# Creating special arrays
zeros = np.zeros((3, 4)) # Array of zeros
ones = np.ones((2, 3)) # Array of ones
identity = np.eye(3) # Identity matrix
random_array = np.random.rand(3, 3) # Random values 0-1
# Creating ranges
range_array = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced numbers 0-1
Array Operations
# Mathematical operations
arr = np.array([1, 2, 3, 4, 5])
# Element-wise operations
squared = arr ** 2 # [1, 4, 9, 16, 25]
added = arr + 10 # [11, 12, 13, 14, 15]
multiplied = arr * 2 # [2, 4, 6, 8, 10]
# Array operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
sum_arrays = arr1 + arr2 # [5, 7, 9]
dot_product = np.dot(arr1, arr2) # 32
# Statistical operations
data = np.random.randn(100) # 100 random numbers
print(f"Mean: {np.mean(data):.2f}")
print(f"Standard deviation: {np.std(data):.2f}")
print(f"Min: {np.min(data):.2f}, Max: {np.max(data):.2f}")
# Boolean indexing
large_values = data[data > 0] # Only positive values
print(f"Positive values: {len(large_values)}")
Pandas for Data Manipulation
Pandas is the most important library for data manipulation and analysis. It provides data structures like DataFrame and Series that make working with structured data intuitive.
Creating DataFrames
import pandas as pd
import numpy as np
# Creating DataFrame from dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
'Salary': [70000, 80000, 75000, 85000, 90000]
}
df = pd.DataFrame(data)
print(df)
# Basic information about the DataFrame
print(f"\nShape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nBasic statistics:\n{df.describe()}")
# Viewing data
print(f"\nFirst 3 rows:\n{df.head(3)}")
print(f"\nLast 2 rows:\n{df.tail(2)}")
print(f"\nColumn names: {df.columns.tolist()}")
Data Selection and Filtering
# Selecting columns
names = df['Name'] # Single column (Series)
subset = df[['Name', 'Age']] # Multiple columns (DataFrame)
# Selecting rows
first_row = df.iloc[0] # By position
first_three = df.iloc[0:3] # Slice by position
by_label = df.loc[df['Age'] > 30] # By condition
# Filtering data
high_earners = df[df['Salary'] > 80000]
young_and_rich = df[(df['Age'] < 30) & (df['Salary'] > 75000)]
# Adding new columns
df['Salary_USD'] = df['Salary']
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
df['Name_Length'] = df['Name'].str.len()
print(df)
Data Aggregation and Grouping
# Reading real data (example with CSV)
# df = pd.read_csv('sales_data.csv')
# Simulating sales data
np.random.seed(42)
sales_data = pd.DataFrame({
'Product': np.random.choice(['A', 'B', 'C'], 100),
'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'Sales': np.random.randint(100, 1000, 100),
'Month': np.random.choice(range(1, 13), 100)
})
# Group by operations
product_sales = sales_data.groupby('Product')['Sales'].sum()
region_stats = sales_data.groupby('Region').agg({
'Sales': ['mean', 'sum', 'count']
})
print("Sales by Product:")
print(product_sales)
print("\nRegion Statistics:")
print(region_stats)
# Pivot tables
pivot = sales_data.pivot_table(
values='Sales',
index='Product',
columns='Region',
aggfunc='mean'
)
print("\nPivot Table:")
print(pivot)
Data Visualization
Visualization is crucial for understanding data patterns and communicating insights. We'll use matplotlib and seaborn for creating compelling visualizations.
Matplotlib Basics
import matplotlib.pyplot as plt
import seaborn as sns
# Set style for better-looking plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
# Basic line plot
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y1, label='sin(x)', linewidth=2)
plt.plot(x, y2, label='cos(x)', linewidth=2)
plt.title('Trigonometric Functions')
plt.xlabel('x values')
plt.ylabel('y values')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['Age'], df['Salary'], c=df['Salary'], cmap='viridis', s=100)
plt.colorbar(label='Salary')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
Advanced Visualizations with Seaborn
# Create sample dataset
tips = sns.load_dataset('tips') # Built-in seaborn dataset
# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Histogram
sns.histplot(data=tips, x='total_bill', bins=20, ax=axes[0, 0])
axes[0, 0].set_title('Distribution of Total Bill')
# Box plot
sns.boxplot(data=tips, x='day', y='total_bill', ax=axes[0, 1])
axes[0, 1].set_title('Total Bill by Day')
# Correlation heatmap
correlation_matrix = tips.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', ax=axes[1, 0])
axes[1, 0].set_title('Correlation Matrix')
# Scatter plot with regression line
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', ax=axes[1, 1])
sns.regplot(data=tips, x='total_bill', y='tip', scatter=False, ax=axes[1, 1])
axes[1, 1].set_title('Tips vs Total Bill')
plt.tight_layout()
plt.show()
# Advanced plot: Pairplot
plt.figure(figsize=(12, 8))
sns.pairplot(tips, hue='time', vars=['total_bill', 'tip', 'size'])
plt.show()
Data Cleaning & Preprocessing
Real-world data is messy. Data cleaning is often 80% of a data scientist's work. Let's learn how to handle missing values, outliers, and data inconsistencies.
# Create messy dataset
messy_data = pd.DataFrame({
'Name': ['John Doe', 'jane smith', 'BOB JOHNSON', None, 'Alice Brown'],
'Age': [25, None, 35, 28, 150], # 150 is an outlier
'Email': ['john@email.com', 'JANE@EMAIL.COM', 'bob@email', None, 'alice@email.com'],
'Salary': [50000, 60000, None, 55000, 70000],
'Date_Joined': ['2020-01-15', '2019/12/20', '2021-03-01', '2020-06-30', None]
})
print("Original messy data:")
print(messy_data)
print(f"\nMissing values:\n{messy_data.isnull().sum()}")
# Data cleaning steps
cleaned_data = messy_data.copy()
# 1. Handle missing values
# Fill missing ages with median
cleaned_data['Age'].fillna(cleaned_data['Age'].median(), inplace=True)
# Fill missing salaries with mean
cleaned_data['Salary'].fillna(cleaned_data['Salary'].mean(), inplace=True)
# Drop rows with missing names (critical field)
cleaned_data.dropna(subset=['Name'], inplace=True)
# 2. Fix data types and formats
# Standardize names
cleaned_data['Name'] = cleaned_data['Name'].str.title()
# Fix email formats
cleaned_data['Email'] = cleaned_data['Email'].str.lower()
# Convert dates
cleaned_data['Date_Joined'] = pd.to_datetime(
cleaned_data['Date_Joined'],
errors='coerce'
)
# 3. Handle outliers
# Remove unrealistic ages (assuming max reasonable age is 100)
cleaned_data = cleaned_data[cleaned_data['Age'] <= 100]
# 4. Add derived columns
cleaned_data['Years_Experience'] = (
pd.Timestamp.now() - cleaned_data['Date_Joined']
).dt.days / 365.25
print("\nCleaned data:")
print(cleaned_data)
print(f"\nRemaining missing values:\n{cleaned_data.isnull().sum()}")
Machine Learning Basics
Machine learning allows us to build predictive models from data. We'll cover supervised learning with classification and regression examples.
Classification Example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
# Load the famous Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)
# Make predictions
y_pred = clf.predict(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Feature importance
feature_importance = pd.DataFrame({
'feature': iris.feature_names,
'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)
print(f"\nFeature Importance:")
print(feature_importance)
Regression Example
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
# Load Boston housing dataset (or use California housing for newer scikit-learn)
try:
boston = load_boston()
X, y = boston.data, boston.target
feature_names = boston.feature_names
except ImportError:
# Alternative dataset
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
X, y = california.data, california.target
feature_names = california.feature_names
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train a linear regression model
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)
# Make predictions
y_pred = regressor.predict(X_test_scaled)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
# Visualize predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()
Real-World Project: Sales Analysis
Let's put everything together in a complete data science project analyzing sales data.
# Complete Sales Analysis Project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Generate synthetic sales data
np.random.seed(42)
n_records = 1000
sales_data = pd.DataFrame({
'Date': pd.date_range('2022-01-01', periods=n_records, freq='D'),
'Product_Category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),
'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
'Marketing_Spend': np.random.uniform(100, 1000, n_records),
'Temperature': np.random.normal(20, 10, n_records),
'Weekend': np.random.choice([0, 1], n_records, p=[0.7, 0.3])
})
# Create sales with some realistic patterns
base_sales = 100 + sales_data['Marketing_Spend'] * 0.5
seasonal_effect = 50 * np.sin(2 * np.pi * sales_data.index / 365)
category_effect = sales_data['Product_Category'].map({
'Electronics': 100, 'Clothing': 80, 'Books': 40, 'Home': 60
})
weekend_effect = sales_data['Weekend'] * 30
noise = np.random.normal(0, 20, n_records)
sales_data['Sales'] = base_sales + seasonal_effect + category_effect + weekend_effect + noise
# Data Analysis
print("Sales Data Analysis")
print("=" * 50)
print(f"Dataset shape: {sales_data.shape}")
print(f"\nBasic statistics:")
print(sales_data.describe())
# Time series analysis
sales_data['Month'] = sales_data['Date'].dt.month
monthly_sales = sales_data.groupby('Month')['Sales'].mean()
plt.figure(figsize=(15, 10))
# Monthly sales trend
plt.subplot(2, 2, 1)
monthly_sales.plot(kind='bar')
plt.title('Average Sales by Month')
plt.xlabel('Month')
plt.ylabel('Average Sales')
# Sales by category
plt.subplot(2, 2, 2)
sns.boxplot(data=sales_data, x='Product_Category', y='Sales')
plt.title('Sales Distribution by Category')
plt.xticks(rotation=45)
# Marketing spend vs sales
plt.subplot(2, 2, 3)
plt.scatter(sales_data['Marketing_Spend'], sales_data['Sales'], alpha=0.6)
plt.xlabel('Marketing Spend')
plt.ylabel('Sales')
plt.title('Marketing Spend vs Sales')
# Weekend effect
plt.subplot(2, 2, 4)
weekend_sales = sales_data.groupby('Weekend')['Sales'].mean()
weekend_sales.plot(kind='bar')
plt.title('Average Sales: Weekday vs Weekend')
plt.xlabel('Weekend (0=Weekday, 1=Weekend)')
plt.ylabel('Average Sales')
plt.tight_layout()
plt.show()
# Predictive Modeling
print("\nPredictive Modeling")
print("=" * 30)
# Prepare features for machine learning
features = ['Marketing_Spend', 'Temperature', 'Weekend', 'Month']
features_encoded = pd.get_dummies(sales_data[['Marketing_Spend', 'Temperature', 'Weekend', 'Month', 'Product_Category', 'Region']])
X = features_encoded
y = sales_data['Sales']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False).head(10)
print(f"\nTop 10 Most Important Features:")
print(feature_importance)
# Business insights
print("\nBusiness Insights")
print("=" * 20)
print(f"1. Average daily sales: $\{sales_data['Sales'].mean():.2f\}")
print(f"2. Best performing category: \{sales_data.groupby('Product_Category')['Sales'].mean().idxmax()\}")
print(f"3. Marketing ROI: $\{(sales_data['Sales'].sum() / sales_data['Marketing_Spend'].sum()):.2f\} per dollar spent")
print(f"4. Weekend boost: \{((weekend_sales[1] - weekend_sales[0]) / weekend_sales[0] * 100):.1f\}% increase")
Next Steps in Your Data Science Journey
Congratulations! You've completed a comprehensive introduction to Python for data science. You now have the foundation to tackle real-world data problems.
What You've Learned
- NumPy for numerical computing and array operations
- Pandas for data manipulation and analysis
- Matplotlib and Seaborn for data visualization
- Data cleaning and preprocessing techniques
- Machine learning fundamentals with scikit-learn
- Complete end-to-end data science project
Advanced Topics to Explore
Deep Learning
- • TensorFlow and PyTorch
- • Neural networks
- • Computer vision
- • Natural language processing
Big Data & Cloud
- • Apache Spark with PySpark
- • AWS/GCP data services
- • Distributed computing
- • Real-time data processing
Recommended Resources
- Books: "Python for Data Analysis" by Wes McKinney, "Hands-On Machine Learning" by Aurélien Géron
- Online Courses: Coursera Machine Learning Course, edX MIT Introduction to Computer Science
- Practice Platforms: Kaggle competitions, Google Colab notebooks
- Communities: r/MachineLearning, Stack Overflow, GitHub
Keep practicing with real datasets, participate in Kaggle competitions, and don't hesitate to tackle challenging problems. Data science is a field where hands-on experience is invaluable!