5 Data Analysis in Python

5.1 Series

A Series is a one-dimensional array-like object provided by pandas that can hold data of any type (integers, floats, strings, Python objects, etc.). It is similar to a column in an Excel spreadsheet or a single-column DataFrame in pandas. Each element in a Series has an associated index, which is used to access individual elements.

5.1.1 Creating a Series

You can create a Series in pandas using various data structures such as lists, dictionaries, or NumPy arrays. You can also specify the index manually.

5.1.1.1 Creating a Series from a List

The simplest way to create a Series is by passing a list to the pd.Series() constructor.

import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)

Output:

0    10
1    20
2    30
3    40
dtype: int64

In this example, the default index starts at 0 and increments by 1.

5.1.1.2 Creating a Series with Custom Index

You can specify custom indices when creating a Series, making it more descriptive.

# Creating a Series with custom indices
data = [100, 200, 300]
index = ['a', 'b', 'c']
custom_series = pd.Series(data, index=index)
print(custom_series)

Output:

a    100
b    200
c    300
dtype: int64

5.1.1.3 Creating a Series from a Dictionary

You can also create a Series from a dictionary, where keys become the indices, and values become the Series values.

# Creating a Series from a dictionary
data_dict = {'apple': 50, 'banana': 30, 'cherry': 20}
fruit_series = pd.Series(data_dict)
print(fruit_series)

Output:

apple     50
banana    30
cherry    20
dtype: int64

5.1.2 Selecting Elements

Selecting elements from a Series can be done using indexing and slicing, similar to Python lists and NumPy arrays. The indexing can be done using integer-based indexing or label-based indexing if a custom index is set.

5.1.2.1 1.2.1 Accessing Single Elements

You can access single elements by using their index position or label.

# Accessing elements by index position
print(series[0])    # Output: 10

# Accessing elements by index label
print(custom_series['b'])  # Output: 200

5.1.2.2 Slicing a Series

Slicing allows you to select a range of elements from a Series.

# Slicing using index positions
print(series[1:3])  # Output: 
# 1    20
# 2    30
# dtype: int64

# Slicing using index labels
print(custom_series['a':'b'])  # Output:
# a    100
# b    200
# dtype: int64

5.1.3 Series Operations

Series in pandas support vectorized operations, which means you can perform operations on entire Series without writing loops. These operations are element-wise, and pandas will align data using the index.

5.1.3.1 Arithmetic Operations

You can perform arithmetic operations like addition, subtraction, multiplication, and division directly on Series.

# Arithmetic operations
series1 = pd.Series([1, 2, 3, 4])
series2 = pd.Series([10, 20, 30, 40])

# Adding two Series
sum_series = series1 + series2
print(sum_series)

# Output:
# 0    11
# 1    22
# 2    33
# 3    44
# dtype: int64

5.1.3.2 Mathematical Functions

You can use mathematical functions like mean(), sum(), max(), and others on Series.

# Using mathematical functions
values = pd.Series([5, 10, 15, 20])
print("Mean:", values.mean())   # Output: Mean: 12.5
print("Sum:", values.sum())     # Output: Sum: 50
print("Max:", values.max())     # Output: Max: 20

5.1.3.3 Applying Functions with `apply()`

The apply() function allows you to apply a custom function to each element in the Series.

# Applying a custom function to each element
def square(x):
    return x * x

squared_series = values.apply(square)
print(squared_series)

# Output:
# 0     25
# 1    100
# 2    225
# 3    400
# dtype: int64

5.1.3.4 Handling Missing Data

Series can handle missing data (NaN values), and you can manipulate these values using methods like fillna(), dropna(), and isna().

# Series with missing values
data_with_nan = pd.Series([1, 2, None, 4])
print(data_with_nan)

# Filling missing values
filled_series = data_with_nan.fillna(0)
print(filled_series)

# Output:
# 0    1.0
# 1    2.0
# 2    0.0
# 3    4.0
# dtype: float64

5.1.4 Conclusion

Series in pandas is a powerful tool for one-dimensional data manipulation, allowing for efficient data access, arithmetic operations, and handling of missing data. Understanding Series is fundamental to working with more complex pandas data structures like DataFrames.

5.2 DataFrame

A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in pandas. Think of it as a table of data, similar to an Excel spreadsheet or a SQL table, which can hold data of various types (strings, integers, floats, etc.).

5.2.1 Creating and Modifying DataFrames

DataFrames can be created from various sources, including lists, dictionaries, NumPy arrays, and even other pandas objects like Series.

5.2.1.1 Creating a DataFrame from a Dictionary

The most common way to create a DataFrame is by using a dictionary, where the keys become column names, and the values are lists representing column data.

import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

5.2.1.2 Creating a DataFrame from a List of Lists

You can create a DataFrame from a list of lists, specifying column names separately.

# Creating a DataFrame from a list of lists
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)

5.2.1.3 Creating a DataFrame from a NumPy Array

DataFrames can also be created from NumPy arrays. You must specify column names as pandas does not infer them automatically.

import numpy as np

# Creating a DataFrame from a NumPy array
array_data = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(array_data, columns=['Column1', 'Column2'])
print(df)

5.2.1.4 Modifying a DataFrame

You can add, modify, or delete rows and columns in a DataFrame.

Adding a Column:

# Adding a new column to an existing DataFrame
df['Score'] = [90, 85, 88]
print(df)

Modifying a Column:

# Modifying an existing column
df['Age'] = df['Age'] + 1
print(df)

Deleting a Column:

# Deleting a column
df.drop('Score', axis=1, inplace=True)
print(df)

5.2.2 Selecting Rows and Columns

Selecting data from a DataFrame can be done using indexing, loc, and iloc.

5.2.2.1 Selecting Columns

Columns can be selected using square brackets, dot notation, or loc.

# Selecting a single column
names = df['Name']
print(names)

# Selecting multiple columns
subset = df[['Name', 'City']]
print(subset)

5.2.2.2 Selecting Rows

Rows can be selected by index position using iloc or by label using loc.

Using iloc (Index Location):

iloc selects rows and columns by integer index positions.

# Selecting rows using iloc
print(df.iloc[0])  # First row
print(df.iloc[1:3])  # Slicing rows

Using loc (Label Location):

loc selects rows and columns by label (index and column names).

# Selecting rows using loc
df.set_index('Name', inplace=True)  # Setting 'Name' column as index
print(df.loc['Alice'])  # Select row with index 'Alice'

5.2.3 Filtering Data in DataFrames

Filtering allows you to extract specific rows based on conditions.

Filtering Rows Based on Conditions:

# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

5.2.4 Adding and Removing Rows

You can add rows to a DataFrame using the append() method and remove rows using drop().

5.2.4.1 Adding Rows

# Adding a new row using append() # will be expired.
new_row = pd.DataFrame([['David', 28, 'Seattle']], columns=['Name', 'Age', 'City'])
df = df.append(new_row, ignore_index=True)
print(df)

Here’s the revised version using updated pandas methods such as pd.concat() instead of append():

import pandas as pd

# Creating the initial DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Adding a new row using pd.concat()
new_row = pd.DataFrame([['David', 28, 'Seattle']], columns=['Name', 'Age', 'City'])
df = pd.concat([df, new_row], ignore_index=True)
print(df)

5.2.4.2 Removing Rows

# Removing rows by index
df = df.drop(0)  # Removes the first row
print(df)

In this updated code: - Rows are added using pd.concat(), which is the recommended method in the newer versions of pandas. - The drop() method is still valid and is used to remove rows by their index.

5.2.5 Handling Missing Data in DataFrames

Handling missing data is essential for clean and accurate data analysis.

5.2.5.1 Identifying Missing Data

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, np.nan, 40],
    'Salary': [50000, 60000, np.nan, 80000]
}
df = pd.DataFrame(data)

# Checking for missing data
print(df.isna())  # Returns a DataFrame of booleans
print(df.isna().sum())  # Sum of missing values in each column

5.2.5.2 Filling Missing Data

You can fill missing data with a specific value, such as 0 or the mean of the column.

# Filling missing values with a specified value
df['Age'].fillna(df['Age'].mean(), inplace=True)

The warning you received is related to pandas’ behavior when using chained assignments. In future versions, using inplace=True with column-level operations might not work as expected. To fix this issue and ensure compatibility with future versions of pandas, you should avoid using inplace=True in this context. Instead, you can assign the result of the fillna() operation directly back to the DataFrame.

Here’s the corrected code:

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, np.nan, 40],
    'Salary': [50000, 60000, np.nan, 80000]
}
df = pd.DataFrame(data)

# Filling missing values in the 'Salary' column and assigning back to the original DataFrame
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

print(df)

5.2.5.3 Dropping Missing Data

Remove rows or columns that contain missing values.

# Dropping rows with missing values
df.dropna(inplace=True)
df

5.2.6 Sorting Data in DataFrames

Sorting can be performed based on one or multiple columns using the sort_values() method.

# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

5.2.7 Grouping Data (cf. MS Excel ‘PivotTable’)

Grouping data is useful for performing aggregate operations on subsets of data.

5.2.7.1 Option 1: Select Numeric Columns Before Grouping

# Grouping by a column and calculating the mean
import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 70000, 80000, 55000],
    'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}
df = pd.DataFrame(data)
print(df)

# Grouping by the 'Department' column and calculating the mean of only numeric columns
grouped = df.groupby('Department')[['Age', 'Salary']].mean()

print(grouped)

5.2.7.2 Option 2: Use numeric_only=True (if available)

import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 70000, 80000, 55000],
    'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}
df = pd.DataFrame(data)

# Grouping by the 'Department' column and calculating the mean, ignoring non-numeric columns
grouped = df.groupby('Department').mean(numeric_only=True)

print(grouped)

5.2.8 Merging DataFrames

Merging combines multiple DataFrames into one, similar to SQL JOIN operations.

5.2.8.1 Using `merge()`

# Merging two DataFrames on a common column
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Score': [85, 90]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Grade': ['A', 'B']})

print(df1)
print('---')
print(df2)
print('---')

merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)

5.2.8.2 Using `concat()`

# Concatenating DataFrames 
df3 = pd.concat([df1, df2], axis=1)
print(df3)

5.2.9 Exporting DataFrames

You can export DataFrames to various file formats, including CSV, Excel, and JSON.

# Install the correct library
!pip install openpyxl

import pandas as pd
import openpyxl

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [50000, 60000, 70000, 80000, 55000],
    'Department': ['HR', 'IT', 'IT', 'HR', 'HR']
}
df = pd.DataFrame(data)
# Saving DataFrame to a CSV file
df.to_csv('output.csv', index=False)

# Saving DataFrame to an Excel file
df.to_excel('output.xlsx', index=False)

5.2.10 Conclusion

DataFrames in pandas are versatile and powerful for handling tabular data in Python. They provide robust functionality for data manipulation, cleaning, and analysis, making them an essential tool for data scientists and analysts.

5.3 Loading and Saving Data

Pandas provides powerful methods to load data from various file formats and save it back after manipulation. Common formats include CSV, Excel, JSON, and SQL databases.

5.3.1 Loading Data

Pandas offers various functions to read data from external sources into DataFrames, allowing for quick data analysis and manipulation.

5.3.1.1 Reading CSV Files

The read_csv() function reads data from a CSV file into a DataFrame. This is one of the most common methods to import data.

import pandas as pd

# Reading a CSV file into a DataFrame
df = pd.read_csv('data.csv')
print(df.head())  # Display the first 5 rows

5.3.1.2 Reading Excel Files

To read Excel files, use the read_excel() function. You can specify the sheet name if the Excel file contains multiple sheets.

# Reading data from an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df.head())

5.3.1.3 Reading JSON Files

JSON files can be read using the read_json() function. JSON is commonly used for data exchange between systems.

# Reading data from a JSON file
df = pd.read_json('data.json')
print(df.head())

5.3.2 Saving Data

Once you have manipulated your data, you can save it back to your preferred format using pandas functions.

5.3.2.1 Saving to CSV

The to_csv() function saves the DataFrame to a CSV file. You can choose whether to include the index or not.

# Saving DataFrame to a CSV file
df.to_csv('output.csv', index=False)

5.3.2.2 Saving to Excel

To save data to an Excel file, use the to_excel() function. You can also specify the sheet name.

# Saving DataFrame to an Excel file
df.to_excel('output.xlsx', index=False, sheet_name='Results')

5.3.2.3 Saving to JSON

You can save data to JSON format using the to_json() function, which is useful for data interchange between applications.

# Saving DataFrame to a JSON file
df.to_json('output.json')

5.4 Exploring Summary Information and Statistical Values

Exploring data is an essential step in understanding the structure, distribution, and characteristics of your dataset. Pandas provides several methods to inspect and summarize DataFrames.

5.4.1 Viewing Basic Information

The info() method provides a concise summary of the DataFrame, including the index, column names, non-null counts, and data types.

import pandas as pd

# Expanding the data with more people and overlapping cities
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack'],
    'Age': [25, 30, 35, 28, 22, 40, 33, 26, 29, 31],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles', 'San Francisco', 'Chicago', 'Boston', 'New York', 'Boston']
}

# Creating the DataFrame
df = pd.DataFrame(data)
df

# Viewing basic information about the DataFrame
print(df.info())

5.4.2 Viewing Statistical Summaries

The describe() method generates descriptive statistics that summarize the central tendency, dispersion, and shape of the data distribution, excluding NaN values.

# Viewing statistical summary of numerical columns
print(df.describe())

You can also include non-numerical data by specifying include='all'.

# Including all columns, including non-numeric
print(df.describe(include='all'))

5.4.3 Counting Unique Values

The value_counts() function counts the unique values in a column, which is particularly useful for categorical data.

# Counting unique values in a column
print(df['City'].value_counts())

5.4.4 Displaying First and Last Rows

You can use head() and tail() to view the first and last few rows of the DataFrame.

# Displaying the first 5 rows
print(df.head())

# Displaying the last 5 rows
print(df.tail())

5.5 Handling Missing Data

Handling missing data is a crucial aspect of data cleaning, as missing values can lead to incorrect analysis results.

5.5.1 Identifying Missing Data

Use isna() or isnull() to detect missing values. These methods return a DataFrame of booleans indicating the presence of missing values.

# Creating a DataFrame with missing values for demonstrating missing data handling
data_with_missing = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack'],
    'Age': [25, 30, None, 28, 22, 40, 33, None, 29, 31],  # Introducing missing values in 'Age'
    'City': ['New York', 'Los Angeles', 'Chicago', None, 'Los Angeles', 'San Francisco', 'Chicago', 'Boston', 'New York', None]  # Introducing missing values in 'City'
}

# Creating the DataFrame with missing data
df_missing = pd.DataFrame(data_with_missing)
df_missing

# Identifying missing data
print(df_missing.isna())

# Counting missing values in each column
print(df_missing.isna().sum())

5.5.2 Removing Missing Data

You can remove rows or columns with missing values using the dropna() method.

# Dropping rows with any missing values
df_cleaned1 = df_missing.dropna()
print(df_cleaned1)

# Dropping columns with missing values
df_cleaned2 = df_missing.dropna(axis=1)
print(df_cleaned2)

5.5.3 Filling Missing Data

Instead of dropping missing data, you can fill it with specified values using fillna(). Common strategies include filling with zeros, the mean, or forward/backward filling.

# Filling missing values with a specified value
df_missing['Age'].fillna(0)

# Filling missing values with the column mean
df_missing['Age'].fillna(df_missing['Age'].mean())
df_missing

# Forward filling missing values
df_missing.fillna(method='ffill')

5.6 Working with Indices

The index of a DataFrame is used to uniquely identify rows. Managing indices effectively allows for more efficient data manipulation and retrieval.

5.6.1 Setting and Resetting Index

You can set a column as the index using set_index() and reset it back to the default integer index using reset_index().

# Setting 'Name' as the index
df.set_index('Name', inplace=True)
print(df)

# Resetting the index to default
df.reset_index(inplace=True)
print(df)

5.6.2 Sorting by Index

The sort_index() method sorts the DataFrame by its index.

# Sorting the DataFrame by index
df.sort_index(inplace=True)
print(df)

5.6.3 Changing the Index Name

You can rename the index by assigning a new name directly.

# Renaming the index
df.index.name = 'ID'
print(df)

5.7 Filtering Data

Filtering allows you to select rows based on specific conditions, enabling targeted analysis and manipulation.

5.7.1 Boolean Indexing

Boolean indexing uses conditional statements to filter data.

# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

5.7.2 Using `isin()` Method

The isin() method filters rows based on multiple values in a column.

# Filtering rows where Name is Alice or Bob
filtered_df = df[df['Name'].isin(['Alice', 'Bob'])]
print(filtered_df)

5.7.3 Filtering with Multiple Conditions

You can combine multiple conditions using & (and) or | (or).

# Filtering with multiple conditions
filtered_df = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print(filtered_df)

5.8 Merging DataFrames

Merging DataFrames is an essential operation when working with multiple datasets that need to be combined. Pandas offers several functions for merging, joining, and concatenating DataFrames.

5.8.1 Using `concat()` Function

The concat() function concatenates DataFrames along rows (default) or columns. It is useful when you have DataFrames with the same columns and want to stack them.

import pandas as pd

# Creating sample DataFrames
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})

# Concatenating DataFrames vertically
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)

# Output:
#       Name  Age
# 0    Alice   25
# 1      Bob   30
# 2  Charlie   35
# 3    David   40

You can also concatenate DataFrames horizontally by specifying axis=1.

# Concatenating DataFrames horizontally
df_concat_horizontal = pd.concat([df1, df2], axis=1)
print(df_concat_horizontal)

5.8.2 Using `merge()` Function

The merge() function combines two DataFrames based on a key column or index. It works similarly to SQL JOIN operations (inner, outer, left, right).

# Creating sample DataFrames for merging
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Score': [85, 90]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Grade': ['A', 'B']})

# Merging DataFrames on the 'Name' column
df_merged = pd.merge(df1, df2, on='Name', how='inner')
print(df_merged)

# Output:
#     Name  Score Grade
# 0  Alice     85     A
# 1    Bob     90     B

5.8.2.1 SQL Join Methods

5.8.2.1.1 1. Inner Join

An inner join returns rows when there is a match in both tables.

5.8.2.1.2 2. Left Join (or Left Outer Join)

A left join returns all rows from the left table, and the matched rows from the right table. If no match is found, the result is NULL on the right side.

5.8.2.1.3 3. Right Join (or Right Outer Join)

A right join returns all rows from the right table, and the matched rows from the left table. If no match is found, the result is NULL on the left side.

5.8.2.1.4 4. Full Join (or Full Outer Join)

A full join returns rows when there is a match in either table. It returns all rows from both tables and fills NULL where there is no match.

5.8.2.2 Diagram of SQL Joins:

1. INNER JOIN:
+--------+     +--------+
|  LEFT  |     |  RIGHT |
+--------+     +--------+
   +--------+     +--------+
   | common |     | common |
   +--------+     +--------+
   
2. LEFT JOIN:
+--------+     +--------+
|  LEFT  |     |  RIGHT |
+--------+     +--------+
   +--------+     +--------+
   | all    |     | common |
   | from   |     +--------+
   | left   |
   +--------+

3. RIGHT JOIN:
+--------+     +--------+
|  LEFT  |     |  RIGHT |
+--------+     +--------+
   +--------+     +--------+
   | common |     | all    |
   +--------+     | from   |
                  | right  |
                  +--------+

4. FULL JOIN:
+--------+     +--------+
|  LEFT  |     |  RIGHT |
+--------+     +--------+
   +--------+     +--------+
   | all    |     | all    |
   | from   |     | from   |
   | both   |     | both   |
   +--------+     +--------+

5.8.2.3 Python Example with DataFrames

Let’s create two DataFrames and perform different types of joins using the merge() function in pandas, which simulates SQL joins.

import pandas as pd

# Creating two DataFrames
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40]
})

df2 = pd.DataFrame({
    'Name': ['Alice', 'Charlie', 'Eve'],
    'City': ['New York', 'Chicago', 'Los Angeles']
})

# Display the DataFrames
print("DataFrame 1:")
print(df1)
print("DataFrame 2:")
print(df2)

# Inner Join
inner_join = pd.merge(df1, df2, on='Name', how='inner')
print("\nInner Join:")
print(inner_join)

# Left Join
left_join = pd.merge(df1, df2, on='Name', how='left')
print("\nLeft Join:")
print(left_join)

# Right Join
right_join = pd.merge(df1, df2, on='Name', how='right')
print("\nRight Join:")
print(right_join)

# Full Outer Join
full_join = pd.merge(df1, df2, on='Name', how='outer')
print("\nFull Outer Join:")
print(full_join)

5.8.3 Using `join()` Function

The join() function is used to join two DataFrames based on their indices. It is particularly useful for merging DataFrames with overlapping indices.

# Creating sample DataFrames with indices
df1 = pd.DataFrame({'Score': [85, 90]}, index=['Alice', 'Bob'])
df2 = pd.DataFrame({'Grade': ['A', 'B']}, index=['Alice', 'Bob'])

# Joining DataFrames on indices
df_joined = df1.join(df2)
print(df_joined)

# Output:
#        Score Grade
# Alice     85     A
# Bob       90     B

5.9 Reshaping Data

Reshaping allows you to change the layout of a DataFrame, making it suitable for different types of analysis.

5.9.1 Using `melt()` Function

The melt() function unpivots a DataFrame from a wide format to a long format, making it easier to analyze and visualize.

# Creating a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Math': [90, 80],
    'Science': [85, 95]
})

# Melting the DataFrame
df_melted = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'], 
                    var_name='Subject', value_name='Score')
print(df_melted)

# Output:
#     Name  Subject  Score
# 0  Alice     Math     90
# 1    Bob     Math     80
# 2  Alice  Science     85
# 3    Bob  Science     95

5.9.2 Using `pivot_table()` Function

The pivot_table() function reshapes data by creating a new summary table, which is useful for aggregating data.

# Creating a DataFrame for pivoting
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
    'Subject': ['Math', 'Math', 'Science', 'Science'],
    'Score': [90, 80, 85, 95]
})

print(df)
print('---')

# Pivoting the DataFrame
df_pivot = df.pivot_table(values='Score', index='Name', columns='Subject', 
                          aggfunc='mean')
print(df_pivot)

# Output:
# Subject  Math  Science
# Name                  
# Alice      90       85
# Bob        80       95

5.9.3 Using `stack()` and `unstack()`

stack() compresses columns into rows, while unstack() does the opposite, expanding rows into columns.

# Stacking and unstacking a DataFrame
df = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
}, index=['X', 'Y'])

# Stacking the DataFrame
stacked = df.stack()
print(stacked)

# Unstacking the stacked DataFrame
unstacked = stacked.unstack()
print(unstacked)

5.10 Applying Functions to DataFrames

Applying functions to DataFrames allows you to transform data easily, using built-in or custom functions.

5.10.1 Applying Functions to Series

The apply() function applies a function along an axis of the DataFrame.

# Creating a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Applying a function to each column
df_squared = df.apply(lambda x: x ** 2)
print(df_squared)

# Output:
#    A   B
# 0  1  16
# 1  4  25
# 2  9  36

# Applying a function to each column
# def square(x):
#     y = x ** 2
#     return y

# df_squared = df.apply(square)

5.10.2 Applying Functions to Entire DataFrames

The applymap() function applies a function to each element of the DataFrame.

# Applying a function to each element
df_negated = df.applymap(lambda x: -x)
print(df_negated)

# Output:
#    A  B
# 0 -1 -4
# 1 -2 -5
# 2 -3 -6

5.1 Series

5.1.1 Creating a Series

5.1.1.1 Creating a Series from a List

5.1.1.2 Creating a Series with Custom Index

5.1.1.3 Creating a Series from a Dictionary

5.1.2 Selecting Elements

5.1.2.1 1.2.1 Accessing Single Elements

5.1.2.2 Slicing a Series

5.1.3 Series Operations

5.1.3.1 Arithmetic Operations

5.1.3.2 Mathematical Functions

5.1.3.3 Applying Functions with apply()

5.1.3.4 Handling Missing Data

5.1.4 Conclusion

5.2 DataFrame

5.2.1 Creating and Modifying DataFrames

5.2.1.1 Creating a DataFrame from a Dictionary

5.2.1.2 Creating a DataFrame from a List of Lists

5.2.1.3 Creating a DataFrame from a NumPy Array

5.2.1.4 Modifying a DataFrame

5.2.2 Selecting Rows and Columns

5.2.2.1 Selecting Columns

5.2.2.2 Selecting Rows

5.2.3 Filtering Data in DataFrames

5.2.4 Adding and Removing Rows

5.2.4.1 Adding Rows

5.2.4.2 Removing Rows

5.2.5 Handling Missing Data in DataFrames

5.2.5.1 Identifying Missing Data

5.2.5.2 Filling Missing Data

5.2.5.3 Dropping Missing Data

5.2.6 Sorting Data in DataFrames

5.2.7 Grouping Data (cf. MS Excel ‘PivotTable’)

5.2.7.1 Option 1: Select Numeric Columns Before Grouping

5.2.7.2 Option 2: Use numeric_only=True (if available)

5.2.8 Merging DataFrames

5.2.8.1 Using merge()

5.2.8.2 Using concat()

5.2.9 Exporting DataFrames

5.2.10 Conclusion

5.3 Loading and Saving Data

5.3.1 Loading Data

5.3.1.1 Reading CSV Files

5.3.1.2 Reading Excel Files

5.3.1.3 Reading JSON Files

5.3.2 Saving Data

5.3.2.1 Saving to CSV

5.3.2.2 Saving to Excel

5.3.2.3 Saving to JSON

5.4 Exploring Summary Information and Statistical Values

5.4.1 Viewing Basic Information

5.4.2 Viewing Statistical Summaries

5.4.3 Counting Unique Values

5.4.4 Displaying First and Last Rows

5.5 Handling Missing Data

5.5.1 Identifying Missing Data

5.5.2 Removing Missing Data

5.5.3 Filling Missing Data

5.6 Working with Indices

5.6.1 Setting and Resetting Index

5.6.2 Sorting by Index

5.6.3 Changing the Index Name

5.7 Filtering Data

5.7.1 Boolean Indexing

5.7.2 Using isin() Method

5.7.3 Filtering with Multiple Conditions

5.8 Merging DataFrames

5.8.1 Using concat() Function

5.8.2 Using merge() Function

5.8.2.1 SQL Join Methods

5.8.2.1.1 1. Inner Join

5.8.2.1.2 2. Left Join (or Left Outer Join)

5.8.2.1.3 3. Right Join (or Right Outer Join)

5.8.2.1.4 4. Full Join (or Full Outer Join)

5.8.2.2 Diagram of SQL Joins:

5.8.2.3 Python Example with DataFrames

5.8.3 Using join() Function

5.9 Reshaping Data

5.9.1 Using melt() Function

5.9.2 Using pivot_table() Function

5.1.3.3 Applying Functions with `apply()`

5.2.8.1 Using `merge()`

5.2.8.2 Using `concat()`

5.7.2 Using `isin()` Method

5.8.1 Using `concat()` Function

5.8.2 Using `merge()` Function

5.8.3 Using `join()` Function

5.9.1 Using `melt()` Function

5.9.2 Using `pivot_table()` Function

5.9.3 Using `stack()` and `unstack()`