5 Data Analysis in Python
5.1 Series
A Series is a one-dimensional array-like object provided by pandas that can hold data of any type (integers, floats, strings, Python objects, etc.). It is similar to a column in an Excel spreadsheet or a single-column DataFrame in pandas. Each element in a Series has an associated index, which is used to access individual elements.
5.1.1 Creating a Series
You can create a Series in pandas using various data structures such as lists, dictionaries, or NumPy arrays. You can also specify the index manually.
5.1.1.1 Creating a Series from a List
The simplest way to create a Series is by passing a list to the pd.Series() constructor.
import pandas as pd
# Creating a Series from a list
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)Output:
0 10
1 20
2 30
3 40
dtype: int64
In this example, the default index starts at 0 and increments by 1.
5.1.1.2 Creating a Series with Custom Index
You can specify custom indices when creating a Series, making it more descriptive.
# Creating a Series with custom indices
data = [100, 200, 300]
index = ['a', 'b', 'c']
custom_series = pd.Series(data, index=index)
print(custom_series)Output:
a 100
b 200
c 300
dtype: int64
5.1.1.3 Creating a Series from a Dictionary
You can also create a Series from a dictionary, where keys become the indices, and values become the Series values.
# Creating a Series from a dictionary
data_dict = {'apple': 50, 'banana': 30, 'cherry': 20}
fruit_series = pd.Series(data_dict)
print(fruit_series)Output:
apple 50
banana 30
cherry 20
dtype: int64
5.1.2 Selecting Elements
Selecting elements from a Series can be done using indexing and slicing, similar to Python lists and NumPy arrays. The indexing can be done using integer-based indexing or label-based indexing if a custom index is set.
5.1.2.1 1.2.1 Accessing Single Elements
You can access single elements by using their index position or label.
# Accessing elements by index position
print(series[0]) # Output: 10
# Accessing elements by index label
print(custom_series['b']) # Output: 2005.1.2.2 Slicing a Series
Slicing allows you to select a range of elements from a Series.
# Slicing using index positions
print(series[1:3]) # Output:
# 1 20
# 2 30
# dtype: int64
# Slicing using index labels
print(custom_series['a':'b']) # Output:
# a 100
# b 200
# dtype: int645.1.3 Series Operations
Series in pandas support vectorized operations, which means you can perform operations on entire Series without writing loops. These operations are element-wise, and pandas will align data using the index.
5.1.3.1 Arithmetic Operations
You can perform arithmetic operations like addition, subtraction, multiplication, and division directly on Series.
# Arithmetic operations
series1 = pd.Series([1, 2, 3, 4])
series2 = pd.Series([10, 20, 30, 40])
# Adding two Series
sum_series = series1 + series2
print(sum_series)
# Output:
# 0 11
# 1 22
# 2 33
# 3 44
# dtype: int645.1.3.2 Mathematical Functions
You can use mathematical functions like mean(), sum(), max(), and others on Series.
# Using mathematical functions
values = pd.Series([5, 10, 15, 20])
print("Mean:", values.mean()) # Output: Mean: 12.5
print("Sum:", values.sum()) # Output: Sum: 50
print("Max:", values.max()) # Output: Max: 205.1.3.3 Applying Functions with apply()
The apply() function allows you to apply a custom function to each element in the Series.
# Applying a custom function to each element
def square(x):
return x * x
squared_series = values.apply(square)
print(squared_series)
# Output:
# 0 25
# 1 100
# 2 225
# 3 400
# dtype: int645.1.3.4 Handling Missing Data
Series can handle missing data (NaN values), and you can manipulate these values using methods like fillna(), dropna(), and isna().
# Series with missing values
data_with_nan = pd.Series([1, 2, None, 4])
print(data_with_nan)
# Filling missing values
filled_series = data_with_nan.fillna(0)
print(filled_series)
# Output:
# 0 1.0
# 1 2.0
# 2 0.0
# 3 4.0
# dtype: float645.1.4 Conclusion
Series in pandas is a powerful tool for one-dimensional data manipulation, allowing for efficient data access, arithmetic operations, and handling of missing data. Understanding Series is fundamental to working with more complex pandas data structures like DataFrames.
5.2 DataFrame
A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in pandas. Think of it as a table of data, similar to an Excel spreadsheet or a SQL table, which can hold data of various types (strings, integers, floats, etc.).
5.2.1 Creating and Modifying DataFrames
DataFrames can be created from various sources, including lists, dictionaries, NumPy arrays, and even other pandas objects like Series.
5.2.1.1 Creating a DataFrame from a Dictionary
The most common way to create a DataFrame is by using a dictionary, where the keys become column names, and the values are lists representing column data.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
5.2.1.2 Creating a DataFrame from a List of Lists
You can create a DataFrame from a list of lists, specifying column names separately.
# Creating a DataFrame from a list of lists
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago']
]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)5.2.1.3 Creating a DataFrame from a NumPy Array
DataFrames can also be created from NumPy arrays. You must specify column names as pandas does not infer them automatically.
import numpy as np
# Creating a DataFrame from a NumPy array
array_data = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(array_data, columns=['Column1', 'Column2'])
print(df)5.2.1.4 Modifying a DataFrame
You can add, modify, or delete rows and columns in a DataFrame.
Adding a Column:
# Adding a new column to an existing DataFrame
df['Score'] = [90, 85, 88]
print(df)Modifying a Column:
# Modifying an existing column
df['Age'] = df['Age'] + 1
print(df)Deleting a Column:
# Deleting a column
df.drop('Score', axis=1, inplace=True)
print(df)5.2.2 Selecting Rows and Columns
Selecting data from a DataFrame can be done using indexing, loc, and iloc.
5.2.2.1 Selecting Columns
Columns can be selected using square brackets, dot notation, or loc.
# Selecting a single column
names = df['Name']
print(names)
# Selecting multiple columns
subset = df[['Name', 'City']]
print(subset)5.2.2.2 Selecting Rows
Rows can be selected by index position using iloc or by label using loc.
Using iloc (Index Location):
iloc selects rows and columns by integer index positions.
# Selecting rows using iloc
print(df.iloc[0]) # First row
print(df.iloc[1:3]) # Slicing rowsUsing loc (Label Location):
loc selects rows and columns by label (index and column names).
# Selecting rows using loc
df.set_index('Name', inplace=True) # Setting 'Name' column as index
print(df.loc['Alice']) # Select row with index 'Alice'5.2.3 Filtering Data in DataFrames
Filtering allows you to extract specific rows based on conditions.
Filtering Rows Based on Conditions:
# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)5.2.4 Adding and Removing Rows
You can add rows to a DataFrame using the append() method and remove rows using drop().
5.2.4.1 Adding Rows
# Adding a new row using append() # will be expired.
new_row = pd.DataFrame([['David', 28, 'Seattle']], columns=['Name', 'Age', 'City'])
df = df.append(new_row, ignore_index=True)
print(df)Here’s the revised version using updated pandas methods such as pd.concat() instead of append():
import pandas as pd
# Creating the initial DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Adding a new row using pd.concat()
new_row = pd.DataFrame([['David', 28, 'Seattle']], columns=['Name', 'Age', 'City'])
df = pd.concat([df, new_row], ignore_index=True)
print(df)5.2.4.2 Removing Rows
# Removing rows by index
df = df.drop(0) # Removes the first row
print(df)In this updated code: - Rows are added using pd.concat(), which is the recommended method in the newer versions of pandas. - The drop() method is still valid and is used to remove rows by their index.
5.2.5 Handling Missing Data in DataFrames
Handling missing data is essential for clean and accurate data analysis.
5.2.5.1 Identifying Missing Data
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, np.nan, 40],
'Salary': [50000, 60000, np.nan, 80000]
}
df = pd.DataFrame(data)
# Checking for missing data
print(df.isna()) # Returns a DataFrame of booleans
print(df.isna().sum()) # Sum of missing values in each column5.2.5.2 Filling Missing Data
You can fill missing data with a specific value, such as 0 or the mean of the column.
# Filling missing values with a specified value
df['Age'].fillna(df['Age'].mean(), inplace=True)The warning you received is related to pandas’ behavior when using chained assignments. In future versions, using inplace=True with column-level operations might not work as expected. To fix this issue and ensure compatibility with future versions of pandas, you should avoid using inplace=True in this context. Instead, you can assign the result of the fillna() operation directly back to the DataFrame.
Here’s the corrected code:
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, np.nan, 40],
'Salary': [50000, 60000, np.nan, 80000]
}
df = pd.DataFrame(data)
# Filling missing values in the 'Salary' column and assigning back to the original DataFrame
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
print(df)5.2.5.3 Dropping Missing Data
Remove rows or columns that contain missing values.
# Dropping rows with missing values
df.dropna(inplace=True)
df5.2.6 Sorting Data in DataFrames
Sorting can be performed based on one or multiple columns using the sort_values() method.
# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)5.2.7 Grouping Data (cf. MS Excel ‘PivotTable’)
Grouping data is useful for performing aggregate operations on subsets of data.
5.2.7.1 Option 1: Select Numeric Columns Before Grouping
# Grouping by a column and calculating the mean
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}
df = pd.DataFrame(data)
print(df)
# Grouping by the 'Department' column and calculating the mean of only numeric columns
grouped = df.groupby('Department')[['Age', 'Salary']].mean()
print(grouped)5.2.7.2 Option 2: Use numeric_only=True (if available)
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}
df = pd.DataFrame(data)
# Grouping by the 'Department' column and calculating the mean, ignoring non-numeric columns
grouped = df.groupby('Department').mean(numeric_only=True)
print(grouped)5.2.8 Merging DataFrames
Merging combines multiple DataFrames into one, similar to SQL JOIN operations.
5.2.8.1 Using merge()
# Merging two DataFrames on a common column
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Score': [85, 90]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Grade': ['A', 'B']})
print(df1)
print('---')
print(df2)
print('---')
merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)5.2.8.2 Using concat()
# Concatenating DataFrames
df3 = pd.concat([df1, df2], axis=1)
print(df3)5.2.9 Exporting DataFrames
You can export DataFrames to various file formats, including CSV, Excel, and JSON.
# Install the correct library
!pip install openpyxl
import pandas as pd
import openpyxl
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}
df = pd.DataFrame(data)
# Saving DataFrame to a CSV file
df.to_csv('output.csv', index=False)
# Saving DataFrame to an Excel file
df.to_excel('output.xlsx', index=False)5.2.10 Conclusion
DataFrames in pandas are versatile and powerful for handling tabular data in Python. They provide robust functionality for data manipulation, cleaning, and analysis, making them an essential tool for data scientists and analysts.
5.3 Loading and Saving Data
Pandas provides powerful methods to load data from various file formats and save it back after manipulation. Common formats include CSV, Excel, JSON, and SQL databases.
5.3.1 Loading Data
Pandas offers various functions to read data from external sources into DataFrames, allowing for quick data analysis and manipulation.
5.3.1.1 Reading CSV Files
The read_csv() function reads data from a CSV file into a DataFrame. This is one of the most common methods to import data.
import pandas as pd
# Reading a CSV file into a DataFrame
df = pd.read_csv('data.csv')
print(df.head()) # Display the first 5 rows5.3.1.2 Reading Excel Files
To read Excel files, use the read_excel() function. You can specify the sheet name if the Excel file contains multiple sheets.
# Reading data from an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df.head())5.3.1.3 Reading JSON Files
JSON files can be read using the read_json() function. JSON is commonly used for data exchange between systems.
# Reading data from a JSON file
df = pd.read_json('data.json')
print(df.head())5.3.2 Saving Data
Once you have manipulated your data, you can save it back to your preferred format using pandas functions.
5.3.2.1 Saving to CSV
The to_csv() function saves the DataFrame to a CSV file. You can choose whether to include the index or not.
# Saving DataFrame to a CSV file
df.to_csv('output.csv', index=False)5.3.2.2 Saving to Excel
To save data to an Excel file, use the to_excel() function. You can also specify the sheet name.
# Saving DataFrame to an Excel file
df.to_excel('output.xlsx', index=False, sheet_name='Results')5.3.2.3 Saving to JSON
You can save data to JSON format using the to_json() function, which is useful for data interchange between applications.
# Saving DataFrame to a JSON file
df.to_json('output.json')5.4 Exploring Summary Information and Statistical Values
Exploring data is an essential step in understanding the structure, distribution, and characteristics of your dataset. Pandas provides several methods to inspect and summarize DataFrames.
5.4.1 Viewing Basic Information
The info() method provides a concise summary of the DataFrame, including the index, column names, non-null counts, and data types.
import pandas as pd
# Expanding the data with more people and overlapping cities
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack'],
'Age': [25, 30, 35, 28, 22, 40, 33, 26, 29, 31],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'San Francisco', 'Chicago', 'Boston', 'New York', 'Boston']
}
# Creating the DataFrame
df = pd.DataFrame(data)
df# Viewing basic information about the DataFrame
print(df.info())5.4.2 Viewing Statistical Summaries
The describe() method generates descriptive statistics that summarize the central tendency, dispersion, and shape of the data distribution, excluding NaN values.
# Viewing statistical summary of numerical columns
print(df.describe())You can also include non-numerical data by specifying include='all'.
# Including all columns, including non-numeric
print(df.describe(include='all'))5.4.3 Counting Unique Values
The value_counts() function counts the unique values in a column, which is particularly useful for categorical data.
# Counting unique values in a column
print(df['City'].value_counts())5.4.4 Displaying First and Last Rows
You can use head() and tail() to view the first and last few rows of the DataFrame.
# Displaying the first 5 rows
print(df.head())
# Displaying the last 5 rows
print(df.tail())5.5 Handling Missing Data
Handling missing data is a crucial aspect of data cleaning, as missing values can lead to incorrect analysis results.
5.5.1 Identifying Missing Data
Use isna() or isnull() to detect missing values. These methods return a DataFrame of booleans indicating the presence of missing values.
# Creating a DataFrame with missing values for demonstrating missing data handling
data_with_missing = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack'],
'Age': [25, 30, None, 28, 22, 40, 33, None, 29, 31], # Introducing missing values in 'Age'
'City': ['New York', 'Los Angeles', 'Chicago', None, 'Los Angeles', 'San Francisco', 'Chicago', 'Boston', 'New York', None] # Introducing missing values in 'City'
}
# Creating the DataFrame with missing data
df_missing = pd.DataFrame(data_with_missing)
df_missing# Identifying missing data
print(df_missing.isna())
# Counting missing values in each column
print(df_missing.isna().sum())5.5.2 Removing Missing Data
You can remove rows or columns with missing values using the dropna() method.
# Dropping rows with any missing values
df_cleaned1 = df_missing.dropna()
print(df_cleaned1)
# Dropping columns with missing values
df_cleaned2 = df_missing.dropna(axis=1)
print(df_cleaned2)5.5.3 Filling Missing Data
Instead of dropping missing data, you can fill it with specified values using fillna(). Common strategies include filling with zeros, the mean, or forward/backward filling.
# Filling missing values with a specified value
df_missing['Age'].fillna(0)
# Filling missing values with the column mean
df_missing['Age'].fillna(df_missing['Age'].mean())
df_missing
# Forward filling missing values
df_missing.fillna(method='ffill')5.6 Working with Indices
The index of a DataFrame is used to uniquely identify rows. Managing indices effectively allows for more efficient data manipulation and retrieval.
5.6.1 Setting and Resetting Index
You can set a column as the index using set_index() and reset it back to the default integer index using reset_index().
# Setting 'Name' as the index
df.set_index('Name', inplace=True)
print(df)
# Resetting the index to default
df.reset_index(inplace=True)
print(df)5.6.2 Sorting by Index
The sort_index() method sorts the DataFrame by its index.
# Sorting the DataFrame by index
df.sort_index(inplace=True)
print(df)5.6.3 Changing the Index Name
You can rename the index by assigning a new name directly.
# Renaming the index
df.index.name = 'ID'
print(df)5.7 Filtering Data
Filtering allows you to select rows based on specific conditions, enabling targeted analysis and manipulation.
5.7.1 Boolean Indexing
Boolean indexing uses conditional statements to filter data.
# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)5.7.2 Using isin() Method
The isin() method filters rows based on multiple values in a column.
# Filtering rows where Name is Alice or Bob
filtered_df = df[df['Name'].isin(['Alice', 'Bob'])]
print(filtered_df)5.7.3 Filtering with Multiple Conditions
You can combine multiple conditions using & (and) or | (or).
# Filtering with multiple conditions
filtered_df = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print(filtered_df)5.8 Merging DataFrames
Merging DataFrames is an essential operation when working with multiple datasets that need to be combined. Pandas offers several functions for merging, joining, and concatenating DataFrames.
5.8.1 Using concat() Function
The concat() function concatenates DataFrames along rows (default) or columns. It is useful when you have DataFrames with the same columns and want to stack them.
import pandas as pd
# Creating sample DataFrames
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})
# Concatenating DataFrames vertically
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)
# Output:
# Name Age
# 0 Alice 25
# 1 Bob 30
# 2 Charlie 35
# 3 David 40You can also concatenate DataFrames horizontally by specifying axis=1.
# Concatenating DataFrames horizontally
df_concat_horizontal = pd.concat([df1, df2], axis=1)
print(df_concat_horizontal)5.8.2 Using merge() Function
The merge() function combines two DataFrames based on a key column or index. It works similarly to SQL JOIN operations (inner, outer, left, right).
# Creating sample DataFrames for merging
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Score': [85, 90]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Grade': ['A', 'B']})
# Merging DataFrames on the 'Name' column
df_merged = pd.merge(df1, df2, on='Name', how='inner')
print(df_merged)
# Output:
# Name Score Grade
# 0 Alice 85 A
# 1 Bob 90 B5.8.2.1 SQL Join Methods
5.8.2.1.1 1. Inner Join
An inner join returns rows when there is a match in both tables.
5.8.2.1.2 2. Left Join (or Left Outer Join)
A left join returns all rows from the left table, and the matched rows from the right table. If no match is found, the result is NULL on the right side.
5.8.2.1.3 3. Right Join (or Right Outer Join)
A right join returns all rows from the right table, and the matched rows from the left table. If no match is found, the result is NULL on the left side.
5.8.2.1.4 4. Full Join (or Full Outer Join)
A full join returns rows when there is a match in either table. It returns all rows from both tables and fills NULL where there is no match.
5.8.2.2 Diagram of SQL Joins:
1. INNER JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| common | | common |
+--------+ +--------+
2. LEFT JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| all | | common |
| from | +--------+
| left |
+--------+
3. RIGHT JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| common | | all |
+--------+ | from |
| right |
+--------+
4. FULL JOIN:
+--------+ +--------+
| LEFT | | RIGHT |
+--------+ +--------+
+--------+ +--------+
| all | | all |
| from | | from |
| both | | both |
+--------+ +--------+
5.8.2.3 Python Example with DataFrames
Let’s create two DataFrames and perform different types of joins using the merge() function in pandas, which simulates SQL joins.
import pandas as pd
# Creating two DataFrames
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Charlie', 'Eve'],
'City': ['New York', 'Chicago', 'Los Angeles']
})
# Display the DataFrames
print("DataFrame 1:")
print(df1)
print("DataFrame 2:")
print(df2)
# Inner Join
inner_join = pd.merge(df1, df2, on='Name', how='inner')
print("\nInner Join:")
print(inner_join)
# Left Join
left_join = pd.merge(df1, df2, on='Name', how='left')
print("\nLeft Join:")
print(left_join)
# Right Join
right_join = pd.merge(df1, df2, on='Name', how='right')
print("\nRight Join:")
print(right_join)
# Full Outer Join
full_join = pd.merge(df1, df2, on='Name', how='outer')
print("\nFull Outer Join:")
print(full_join)5.8.3 Using join() Function
The join() function is used to join two DataFrames based on their indices. It is particularly useful for merging DataFrames with overlapping indices.
# Creating sample DataFrames with indices
df1 = pd.DataFrame({'Score': [85, 90]}, index=['Alice', 'Bob'])
df2 = pd.DataFrame({'Grade': ['A', 'B']}, index=['Alice', 'Bob'])
# Joining DataFrames on indices
df_joined = df1.join(df2)
print(df_joined)
# Output:
# Score Grade
# Alice 85 A
# Bob 90 B5.9 Reshaping Data
Reshaping allows you to change the layout of a DataFrame, making it suitable for different types of analysis.
5.9.1 Using melt() Function
The melt() function unpivots a DataFrame from a wide format to a long format, making it easier to analyze and visualize.
# Creating a sample DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Math': [90, 80],
'Science': [85, 95]
})
# Melting the DataFrame
df_melted = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'],
var_name='Subject', value_name='Score')
print(df_melted)
# Output:
# Name Subject Score
# 0 Alice Math 90
# 1 Bob Math 80
# 2 Alice Science 85
# 3 Bob Science 955.9.2 Using pivot_table() Function
The pivot_table() function reshapes data by creating a new summary table, which is useful for aggregating data.
# Creating a DataFrame for pivoting
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
'Subject': ['Math', 'Math', 'Science', 'Science'],
'Score': [90, 80, 85, 95]
})
print(df)
print('---')
# Pivoting the DataFrame
df_pivot = df.pivot_table(values='Score', index='Name', columns='Subject',
aggfunc='mean')
print(df_pivot)
# Output:
# Subject Math Science
# Name
# Alice 90 85
# Bob 80 955.9.3 Using stack() and unstack()
stack() compresses columns into rows, while unstack() does the opposite, expanding rows into columns.
# Stacking and unstacking a DataFrame
df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
}, index=['X', 'Y'])
# Stacking the DataFrame
stacked = df.stack()
print(stacked)
# Unstacking the stacked DataFrame
unstacked = stacked.unstack()
print(unstacked)5.10 Applying Functions to DataFrames
Applying functions to DataFrames allows you to transform data easily, using built-in or custom functions.
5.10.1 Applying Functions to Series
The apply() function applies a function along an axis of the DataFrame.
# Creating a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Applying a function to each column
df_squared = df.apply(lambda x: x ** 2)
print(df_squared)
# Output:
# A B
# 0 1 16
# 1 4 25
# 2 9 36
# Applying a function to each column
# def square(x):
# y = x ** 2
# return y
# df_squared = df.apply(square)5.10.2 Applying Functions to Entire DataFrames
The applymap() function applies a function to each element of the DataFrame.
# Applying a function to each element
df_negated = df.applymap(lambda x: -x)
print(df_negated)
# Output:
# A B
# 0 -1 -4
# 1 -2 -5
# 2 -3 -6