5 Data Analysis in Python
5.1 Series
A Series is a one-dimensional array-like object provided by pandas that can hold data of any type (integers, floats, strings, Python objects, etc.). It is similar to a column in an Excel spreadsheet or a single-column DataFrame in pandas. Each element in a Series has an associated index, which is used to access individual elements.
5.1.1 Creating a Series
You can create a Series in pandas using various data structures such as lists, dictionaries, or NumPy arrays. You can also specify the index manually.
5.1.1.1 Creating a Series from a List
The simplest way to create a Series is by passing a list to the pd.Series()
constructor.
import pandas as pd
# Creating a Series from a list
= [10, 20, 30, 40]
data = pd.Series(data)
series print(series)
Output:
0 10
1 20
2 30
3 40
dtype: int64
In this example, the default index starts at 0 and increments by 1.
5.1.1.2 Creating a Series with Custom Index
You can specify custom indices when creating a Series, making it more descriptive.
# Creating a Series with custom indices
= [100, 200, 300]
data = ['a', 'b', 'c']
index = pd.Series(data, index=index)
custom_series print(custom_series)
Output:
a 100
b 200
c 300
dtype: int64
5.1.1.3 Creating a Series from a Dictionary
You can also create a Series from a dictionary, where keys become the indices, and values become the Series values.
# Creating a Series from a dictionary
= {'apple': 50, 'banana': 30, 'cherry': 20}
data_dict = pd.Series(data_dict)
fruit_series print(fruit_series)
Output:
apple 50
banana 30
cherry 20
dtype: int64
5.1.2 Selecting Elements
Selecting elements from a Series can be done using indexing and slicing, similar to Python lists and NumPy arrays. The indexing can be done using integer-based indexing or label-based indexing if a custom index is set.
5.1.2.1 1.2.1 Accessing Single Elements
You can access single elements by using their index position or label.
# Accessing elements by index position
print(series[0]) # Output: 10
# Accessing elements by index label
print(custom_series['b']) # Output: 200
5.1.2.2 Slicing a Series
Slicing allows you to select a range of elements from a Series.
# Slicing using index positions
print(series[1:3]) # Output:
# 1 20
# 2 30
# dtype: int64
# Slicing using index labels
print(custom_series['a':'b']) # Output:
# a 100
# b 200
# dtype: int64
5.1.3 Series Operations
Series in pandas support vectorized operations, which means you can perform operations on entire Series without writing loops. These operations are element-wise, and pandas will align data using the index.
5.1.3.1 Arithmetic Operations
You can perform arithmetic operations like addition, subtraction, multiplication, and division directly on Series.
# Arithmetic operations
= pd.Series([1, 2, 3, 4])
series1 = pd.Series([10, 20, 30, 40])
series2
# Adding two Series
= series1 + series2
sum_series print(sum_series)
# Output:
# 0 11
# 1 22
# 2 33
# 3 44
# dtype: int64
5.1.3.2 Mathematical Functions
You can use mathematical functions like mean()
, sum()
, max()
, and others on Series.
# Using mathematical functions
= pd.Series([5, 10, 15, 20])
values print("Mean:", values.mean()) # Output: Mean: 12.5
print("Sum:", values.sum()) # Output: Sum: 50
print("Max:", values.max()) # Output: Max: 20
5.1.3.3 Applying Functions with apply()
The apply()
function allows you to apply a custom function to each element in the Series.
# Applying a custom function to each element
def square(x):
return x * x
= values.apply(square)
squared_series print(squared_series)
# Output:
# 0 25
# 1 100
# 2 225
# 3 400
# dtype: int64
5.1.3.4 Handling Missing Data
Series can handle missing data (NaN
values), and you can manipulate these values using methods like fillna()
, dropna()
, and isna()
.
# Series with missing values
= pd.Series([1, 2, None, 4])
data_with_nan print(data_with_nan)
# Filling missing values
= data_with_nan.fillna(0)
filled_series print(filled_series)
# Output:
# 0 1.0
# 1 2.0
# 2 0.0
# 3 4.0
# dtype: float64
5.1.4 Conclusion
Series in pandas is a powerful tool for one-dimensional data manipulation, allowing for efficient data access, arithmetic operations, and handling of missing data. Understanding Series is fundamental to working with more complex pandas data structures like DataFrames.
5.2 DataFrame
A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in pandas. Think of it as a table of data, similar to an Excel spreadsheet or a SQL table, which can hold data of various types (strings, integers, floats, etc.).
5.2.1 Creating and Modifying DataFrames
DataFrames can be created from various sources, including lists, dictionaries, NumPy arrays, and even other pandas objects like Series.
5.2.1.1 Creating a DataFrame from a Dictionary
The most common way to create a DataFrame is by using a dictionary, where the keys become column names, and the values are lists representing column data.
import pandas as pd
# Creating a DataFrame from a dictionary
= {
data 'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}= pd.DataFrame(data)
df print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
5.2.1.2 Creating a DataFrame from a List of Lists
You can create a DataFrame from a list of lists, specifying column names separately.
# Creating a DataFrame from a list of lists
= [
data 'Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago']
[
]= ['Name', 'Age', 'City']
columns = pd.DataFrame(data, columns=columns)
df print(df)
5.2.1.3 Creating a DataFrame from a NumPy Array
DataFrames can also be created from NumPy arrays. You must specify column names as pandas does not infer them automatically.
import numpy as np
# Creating a DataFrame from a NumPy array
= np.array([[1, 2], [3, 4], [5, 6]])
array_data = pd.DataFrame(array_data, columns=['Column1', 'Column2'])
df print(df)
5.2.1.4 Modifying a DataFrame
You can add, modify, or delete rows and columns in a DataFrame.
Adding a Column:
# Adding a new column to an existing DataFrame
'Score'] = [90, 85, 88]
df[print(df)
Modifying a Column:
# Modifying an existing column
'Age'] = df['Age'] + 1
df[print(df)
Deleting a Column:
# Deleting a column
'Score', axis=1, inplace=True)
df.drop(print(df)
5.2.2 Selecting Rows and Columns
Selecting data from a DataFrame can be done using indexing, loc
, and iloc
.
5.2.2.1 Selecting Columns
Columns can be selected using square brackets, dot notation, or loc
.
# Selecting a single column
= df['Name']
names print(names)
# Selecting multiple columns
= df[['Name', 'City']]
subset print(subset)
5.2.2.2 Selecting Rows
Rows can be selected by index position using iloc
or by label using loc
.
Using iloc
(Index Location):
iloc
selects rows and columns by integer index positions.
# Selecting rows using iloc
print(df.iloc[0]) # First row
print(df.iloc[1:3]) # Slicing rows
Using loc
(Label Location):
loc
selects rows and columns by label (index and column names).
# Selecting rows using loc
'Name', inplace=True) # Setting 'Name' column as index
df.set_index(print(df.loc['Alice']) # Select row with index 'Alice'
5.2.3 Filtering Data in DataFrames
Filtering allows you to extract specific rows based on conditions.
Filtering Rows Based on Conditions:
# Filtering rows where Age > 30
= df[df['Age'] > 30]
filtered_df print(filtered_df)
5.2.4 Adding and Removing Rows
You can add rows to a DataFrame using the append()
method and remove rows using drop()
.
5.2.4.1 Adding Rows
# Adding a new row using append() # will be expired.
= pd.DataFrame([['David', 28, 'Seattle']], columns=['Name', 'Age', 'City'])
new_row = df.append(new_row, ignore_index=True)
df print(df)
Here’s the revised version using updated pandas methods such as pd.concat()
instead of append()
:
import pandas as pd
# Creating the initial DataFrame
= {
data 'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}= pd.DataFrame(data)
df
# Adding a new row using pd.concat()
= pd.DataFrame([['David', 28, 'Seattle']], columns=['Name', 'Age', 'City'])
new_row = pd.concat([df, new_row], ignore_index=True)
df print(df)
5.2.4.2 Removing Rows
# Removing rows by index
= df.drop(0) # Removes the first row
df print(df)
In this updated code: - Rows are added using pd.concat()
, which is the recommended method in the newer versions of pandas. - The drop()
method is still valid and is used to remove rows by their index.
5.2.5 Handling Missing Data in DataFrames
Handling missing data is essential for clean and accurate data analysis.
5.2.5.1 Identifying Missing Data
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
= {
data 'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, np.nan, 40],
'Salary': [50000, 60000, np.nan, 80000]
}= pd.DataFrame(data)
df
# Checking for missing data
print(df.isna()) # Returns a DataFrame of booleans
print(df.isna().sum()) # Sum of missing values in each column
5.2.5.2 Filling Missing Data
You can fill missing data with a specific value, such as 0 or the mean of the column.
# Filling missing values with a specified value
'Age'].fillna(df['Age'].mean(), inplace=True) df[
The warning you received is related to pandas’ behavior when using chained assignments. In future versions, using inplace=True
with column-level operations might not work as expected. To fix this issue and ensure compatibility with future versions of pandas, you should avoid using inplace=True
in this context. Instead, you can assign the result of the fillna()
operation directly back to the DataFrame.
Here’s the corrected code:
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
= {
data 'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, np.nan, 40],
'Salary': [50000, 60000, np.nan, 80000]
}= pd.DataFrame(data)
df
# Filling missing values in the 'Salary' column and assigning back to the original DataFrame
'Salary'] = df['Salary'].fillna(df['Salary'].mean())
df[
print(df)
5.2.5.3 Dropping Missing Data
Remove rows or columns that contain missing values.
# Dropping rows with missing values
=True)
df.dropna(inplace df
5.2.6 Sorting Data in DataFrames
Sorting can be performed based on one or multiple columns using the sort_values()
method.
# Sorting DataFrame by a column
= df.sort_values(by='Age', ascending=False)
sorted_df print(sorted_df)
5.2.7 Grouping Data (cf. MS Excel ‘PivotTable’)
Grouping data is useful for performing aggregate operations on subsets of data.
5.2.7.1 Option 1: Select Numeric Columns Before Grouping
# Grouping by a column and calculating the mean
import pandas as pd
# Creating a DataFrame
= {
data 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}= pd.DataFrame(data)
df print(df)
# Grouping by the 'Department' column and calculating the mean of only numeric columns
= df.groupby('Department')[['Age', 'Salary']].mean()
grouped
print(grouped)
5.2.7.2 Option 2: Use numeric_only=True (if available)
import pandas as pd
# Creating a DataFrame
= {
data 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}= pd.DataFrame(data)
df
# Grouping by the 'Department' column and calculating the mean, ignoring non-numeric columns
= df.groupby('Department').mean(numeric_only=True)
grouped
print(grouped)
5.2.8 Merging DataFrames
Merging combines multiple DataFrames into one, similar to SQL JOIN operations.
5.2.8.1 Using merge()
# Merging two DataFrames on a common column
= pd.DataFrame({'Name': ['Alice', 'Bob'], 'Score': [85, 90]})
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Grade': ['A', 'B']})
df2
print(df1)
print('---')
print(df2)
print('---')
= pd.merge(df1, df2, on='Name')
merged_df print(merged_df)
5.2.8.2 Using concat()
# Concatenating DataFrames
= pd.concat([df1, df2], axis=1)
df3 print(df3)
5.2.9 Exporting DataFrames
You can export DataFrames to various file formats, including CSV, Excel, and JSON.
# Install the correct library
!pip install openpyxl
import pandas as pd
import openpyxl
# Creating a DataFrame
= {
data 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}= pd.DataFrame(data)
df # Saving DataFrame to a CSV file
'output.csv', index=False)
df.to_csv(
# Saving DataFrame to an Excel file
'output.xlsx', index=False) df.to_excel(
5.2.10 Conclusion
DataFrames in pandas are versatile and powerful for handling tabular data in Python. They provide robust functionality for data manipulation, cleaning, and analysis, making them an essential tool for data scientists and analysts.
5.3 Loading and Saving Data
Pandas provides powerful methods to load data from various file formats and save it back after manipulation. Common formats include CSV, Excel, JSON, and SQL databases.
5.3.1 Loading Data
Pandas offers various functions to read data from external sources into DataFrames, allowing for quick data analysis and manipulation.
5.3.1.1 Reading CSV Files
The read_csv()
function reads data from a CSV file into a DataFrame. This is one of the most common methods to import data.
import pandas as pd
# Reading a CSV file into a DataFrame
= pd.read_csv('data.csv')
df print(df.head()) # Display the first 5 rows
5.3.1.2 Reading Excel Files
To read Excel files, use the read_excel()
function. You can specify the sheet name if the Excel file contains multiple sheets.
# Reading data from an Excel file
= pd.read_excel('data.xlsx', sheet_name='Sheet1')
df print(df.head())
5.3.1.3 Reading JSON Files
JSON files can be read using the read_json()
function. JSON is commonly used for data exchange between systems.
# Reading data from a JSON file
= pd.read_json('data.json')
df print(df.head())
5.3.2 Saving Data
Once you have manipulated your data, you can save it back to your preferred format using pandas functions.
5.3.2.1 Saving to CSV
The to_csv()
function saves the DataFrame to a CSV file. You can choose whether to include the index or not.
# Saving DataFrame to a CSV file
'output.csv', index=False) df.to_csv(
5.3.2.2 Saving to Excel
To save data to an Excel file, use the to_excel()
function. You can also specify the sheet name.
# Saving DataFrame to an Excel file
'output.xlsx', index=False, sheet_name='Results') df.to_excel(
5.3.2.3 Saving to JSON
You can save data to JSON format using the to_json()
function, which is useful for data interchange between applications.
# Saving DataFrame to a JSON file
'output.json') df.to_json(
5.4 Exploring Summary Information and Statistical Values
Exploring data is an essential step in understanding the structure, distribution, and characteristics of your dataset. Pandas provides several methods to inspect and summarize DataFrames.
5.4.1 Viewing Basic Information
The info()
method provides a concise summary of the DataFrame, including the index, column names, non-null counts, and data types.
import pandas as pd
# Expanding the data with more people and overlapping cities
= {
data 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack'],
'Age': [25, 30, 35, 28, 22, 40, 33, 26, 29, 31],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'San Francisco', 'Chicago', 'Boston', 'New York', 'Boston']
}
# Creating the DataFrame
= pd.DataFrame(data)
df df
# Viewing basic information about the DataFrame
print(df.info())
5.4.2 Viewing Statistical Summaries
The describe()
method generates descriptive statistics that summarize the central tendency, dispersion, and shape of the data distribution, excluding NaN
values.
# Viewing statistical summary of numerical columns
print(df.describe())
You can also include non-numerical data by specifying include='all'
.
# Including all columns, including non-numeric
print(df.describe(include='all'))
5.4.3 Counting Unique Values
The value_counts()
function counts the unique values in a column, which is particularly useful for categorical data.
# Counting unique values in a column
print(df['City'].value_counts())
5.4.4 Displaying First and Last Rows
You can use head()
and tail()
to view the first and last few rows of the DataFrame.
# Displaying the first 5 rows
print(df.head())
# Displaying the last 5 rows
print(df.tail())
5.5 Handling Missing Data
Handling missing data is a crucial aspect of data cleaning, as missing values can lead to incorrect analysis results.
5.5.1 Identifying Missing Data
Use isna()
or isnull()
to detect missing values. These methods return a DataFrame of booleans indicating the presence of missing values.
# Creating a DataFrame with missing values for demonstrating missing data handling
= {
data_with_missing 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack'],
'Age': [25, 30, None, 28, 22, 40, 33, None, 29, 31], # Introducing missing values in 'Age'
'City': ['New York', 'Los Angeles', 'Chicago', None, 'Los Angeles', 'San Francisco', 'Chicago', 'Boston', 'New York', None] # Introducing missing values in 'City'
}
# Creating the DataFrame with missing data
= pd.DataFrame(data_with_missing)
df_missing df_missing
# Identifying missing data
print(df_missing.isna())
# Counting missing values in each column
print(df_missing.isna().sum())
5.5.2 Removing Missing Data
You can remove rows or columns with missing values using the dropna()
method.
# Dropping rows with any missing values
= df_missing.dropna()
df_cleaned1 print(df_cleaned1)
# Dropping columns with missing values
= df_missing.dropna(axis=1)
df_cleaned2 print(df_cleaned2)
5.5.3 Filling Missing Data
Instead of dropping missing data, you can fill it with specified values using fillna()
. Common strategies include filling with zeros, the mean, or forward/backward filling.
# Filling missing values with a specified value
'Age'].fillna(0)
df_missing[
# Filling missing values with the column mean
'Age'].fillna(df_missing['Age'].mean())
df_missing[
df_missing
# Forward filling missing values
='ffill') df_missing.fillna(method
5.6 Working with Indices
The index of a DataFrame is used to uniquely identify rows. Managing indices effectively allows for more efficient data manipulation and retrieval.
5.6.1 Setting and Resetting Index
You can set a column as the index using set_index()
and reset it back to the default integer index using reset_index()
.
# Setting 'Name' as the index
'Name', inplace=True)
df.set_index(print(df)
# Resetting the index to default
=True)
df.reset_index(inplaceprint(df)
5.6.2 Sorting by Index
The sort_index()
method sorts the DataFrame by its index.
# Sorting the DataFrame by index
=True)
df.sort_index(inplaceprint(df)
5.6.3 Changing the Index Name
You can rename the index by assigning a new name directly.
# Renaming the index
= 'ID'
df.index.name print(df)
5.7 Filtering Data
Filtering allows you to select rows based on specific conditions, enabling targeted analysis and manipulation.
5.7.1 Boolean Indexing
Boolean indexing uses conditional statements to filter data.
# Filtering rows where Age > 30
= df[df['Age'] > 30]
filtered_df print(filtered_df)
5.7.2 Using isin()
Method
The isin()
method filters rows based on multiple values in a column.
# Filtering rows where Name is Alice or Bob
= df[df['Name'].isin(['Alice', 'Bob'])]
filtered_df print(filtered_df)
5.7.3 Filtering with Multiple Conditions
You can combine multiple conditions using &
(and) or |
(or).
# Filtering with multiple conditions
= df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
filtered_df print(filtered_df)
5.8 Merging DataFrames
Merging DataFrames is an essential operation when working with multiple datasets that need to be combined. Pandas offers several functions for merging, joining, and concatenating DataFrames.
5.8.1 Using concat()
Function
The concat()
function concatenates DataFrames along rows (default) or columns. It is useful when you have DataFrames with the same columns and want to stack them.
import pandas as pd
# Creating sample DataFrames
= pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df1 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})
df2
# Concatenating DataFrames vertically
= pd.concat([df1, df2], ignore_index=True)
df_concat print(df_concat)
# Output:
# Name Age
# 0 Alice 25
# 1 Bob 30
# 2 Charlie 35
# 3 David 40
You can also concatenate DataFrames horizontally by specifying axis=1
.
# Concatenating DataFrames horizontally
= pd.concat([df1, df2], axis=1)
df_concat_horizontal print(df_concat_horizontal)
5.8.2 Using merge()
Function
The merge()
function combines two DataFrames based on a key column or index. It works similarly to SQL JOIN operations (inner
, outer
, left
, right
).
# Creating sample DataFrames for merging
= pd.DataFrame({'Name': ['Alice', 'Bob'], 'Score': [85, 90]})
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Grade': ['A', 'B']})
df2
# Merging DataFrames on the 'Name' column
= pd.merge(df1, df2, on='Name', how='inner')
df_merged print(df_merged)
# Output:
# Name Score Grade
# 0 Alice 85 A
# 1 Bob 90 B
5.8.2.1 SQL Join Methods
5.8.2.1.1 1. Inner Join
An inner join returns rows when there is a match in both tables.
5.8.2.1.2 2. Left Join (or Left Outer Join)
A left join returns all rows from the left table, and the matched rows from the right table. If no match is found, the result is NULL
on the right side.
5.8.2.1.3 3. Right Join (or Right Outer Join)
A right join returns all rows from the right table, and the matched rows from the left table. If no match is found, the result is NULL
on the left side.
5.8.2.1.4 4. Full Join (or Full Outer Join)
A full join returns rows when there is a match in either table. It returns all rows from both tables and fills NULL
where there is no match.
5.8.2.2 Diagram of SQL Joins:
1. INNER JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| common | | common |
+--------+ +--------+
2. LEFT JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| all | | common |
| from | +--------+
| left |
+--------+
3. RIGHT JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| common | | all |
+--------+ | from |
| right |
+--------+
4. FULL JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| all | | all |
| from | | from |
| both | | both |
+--------+ +--------+
5.8.2.3 Python Example with DataFrames
Let’s create two DataFrames and perform different types of joins using the merge()
function in pandas, which simulates SQL joins.
import pandas as pd
# Creating two DataFrames
= pd.DataFrame({
df1 'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]
})
= pd.DataFrame({
df2 'Name': ['Alice', 'Charlie', 'Eve'],
'City': ['New York', 'Chicago', 'Los Angeles']
})
# Display the DataFrames
print("DataFrame 1:")
print(df1)
print("DataFrame 2:")
print(df2)
# Inner Join
= pd.merge(df1, df2, on='Name', how='inner')
inner_join print("\nInner Join:")
print(inner_join)
# Left Join
= pd.merge(df1, df2, on='Name', how='left')
left_join print("\nLeft Join:")
print(left_join)
# Right Join
= pd.merge(df1, df2, on='Name', how='right')
right_join print("\nRight Join:")
print(right_join)
# Full Outer Join
= pd.merge(df1, df2, on='Name', how='outer')
full_join print("\nFull Outer Join:")
print(full_join)
5.8.3 Using join()
Function
The join()
function is used to join two DataFrames based on their indices. It is particularly useful for merging DataFrames with overlapping indices.
# Creating sample DataFrames with indices
= pd.DataFrame({'Score': [85, 90]}, index=['Alice', 'Bob'])
df1 = pd.DataFrame({'Grade': ['A', 'B']}, index=['Alice', 'Bob'])
df2
# Joining DataFrames on indices
= df1.join(df2)
df_joined print(df_joined)
# Output:
# Score Grade
# Alice 85 A
# Bob 90 B
5.9 Reshaping Data
Reshaping allows you to change the layout of a DataFrame, making it suitable for different types of analysis.
5.9.1 Using melt()
Function
The melt()
function unpivots a DataFrame from a wide format to a long format, making it easier to analyze and visualize.
# Creating a sample DataFrame
= pd.DataFrame({
df 'Name': ['Alice', 'Bob'],
'Math': [90, 80],
'Science': [85, 95]
})
# Melting the DataFrame
= pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'],
df_melted ='Subject', value_name='Score')
var_nameprint(df_melted)
# Output:
# Name Subject Score
# 0 Alice Math 90
# 1 Bob Math 80
# 2 Alice Science 85
# 3 Bob Science 95
5.9.2 Using pivot_table()
Function
The pivot_table()
function reshapes data by creating a new summary table, which is useful for aggregating data.
# Creating a DataFrame for pivoting
= pd.DataFrame({
df 'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
'Subject': ['Math', 'Math', 'Science', 'Science'],
'Score': [90, 80, 85, 95]
})
print(df)
print('---')
# Pivoting the DataFrame
= df.pivot_table(values='Score', index='Name', columns='Subject',
df_pivot ='mean')
aggfuncprint(df_pivot)
# Output:
# Subject Math Science
# Name
# Alice 90 85
# Bob 80 95
5.9.3 Using stack()
and unstack()
stack()
compresses columns into rows, while unstack()
does the opposite, expanding rows into columns.
# Stacking and unstacking a DataFrame
= pd.DataFrame({
df 'A': [1, 2],
'B': [3, 4]
=['X', 'Y'])
}, index
# Stacking the DataFrame
= df.stack()
stacked print(stacked)
# Unstacking the stacked DataFrame
= stacked.unstack()
unstacked print(unstacked)
5.10 Applying Functions to DataFrames
Applying functions to DataFrames allows you to transform data easily, using built-in or custom functions.
5.10.1 Applying Functions to Series
The apply()
function applies a function along an axis of the DataFrame.
# Creating a sample DataFrame
= pd.DataFrame({
df 'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Applying a function to each column
= df.apply(lambda x: x ** 2)
df_squared print(df_squared)
# Output:
# A B
# 0 1 16
# 1 4 25
# 2 9 36
# Applying a function to each column
# def square(x):
# y = x ** 2
# return y
# df_squared = df.apply(square)
5.10.2 Applying Functions to Entire DataFrames
The applymap()
function applies a function to each element of the DataFrame.
# Applying a function to each element
= df.applymap(lambda x: -x)
df_negated print(df_negated)
# Output:
# A B
# 0 -1 -4
# 1 -2 -5
# 2 -3 -6