9 Text Preprocessing
Text preprocessing is an essential step in preparing raw textual data for analysis, especially when working with tasks like topic modeling, sentiment analysis, or classification. The goal is to convert the raw text into a clean and consistent format that can be analyzed using machine learning models. This section outlines the key steps involved in text preprocessing and demonstrates these steps through a Python code example.
9.1 Key Steps in Text Preprocessing
Lowercasing: Converting all text to lowercase ensures uniformity. Without this, words like “Book” and “book” would be treated as different entities, leading to unnecessary complexity in the analysis.
Tokenization: Tokenization is the process of splitting text into individual words or tokens. These tokens are the basic units on which further processing steps are applied.
Removing Punctuation: Punctuation marks often do not contribute to the meaning of text for analytical purposes. Stripping punctuation ensures a cleaner input for the model.
Stopword Removal: Stopwords are common words like “the,” “is,” “in,” “of,” etc., that appear frequently but contribute little to the overall meaning of the text. Removing stopwords helps reduce noise and improve the focus on the core content.
Lemmatization (Optional): Lemmatization is the process of reducing words to their base or dictionary form. While this is an optional step, it is useful for ensuring that words like “running” and “run” are treated as the same entity. In this section, however, we omit lemmatization for simplicity.
9.1.1 Python Code Example 1
Below is a Python code that demonstrates how to preprocess a dataset containing raw text, specifically in the “Product Description” column. This dataset includes information about books, and we aim to clean the text data to prepare it for further analysis.
The code performs the following operations: - Loads the CSV file that contains the product descriptions. - Converts the text to lowercase. - Removes punctuation marks. - Tokenizes the text into individual words. - Removes common stopwords. - Saves the processed text back into a new column named “preprocessed_text” and exports the updated CSV file.
import pandas as pd
import re
# Load the CSV file
= '/mnt/data/books_scraped_with_descriptions_all_pages.csv'
file_path = pd.read_csv(file_path)
data
# Preprocess function (without external libraries)
def preprocess_text_no_external(text):
# Lowercase the text
= text.lower()
text
# Remove punctuation using regex
= re.sub(r'[^\w\s]', '', text)
text
# Tokenize by splitting text
= text.split()
tokens
# Simple list of stopwords
= set(['a', 'the', 'and', 'is', 'in', 'it', 'to', 'this', 'of', 'that'])
stop_words
# Remove stopwords
= [word for word in tokens if word not in stop_words]
tokens
# Join tokens back to string
return ' '.join(tokens)
# Apply preprocessing to the "Product Description" column
'cleaned_description'] = data['Product Description'].dropna().apply(preprocess_text_no_external)
data[
# Add the "preprocessed text" column to the original data
'preprocessed_text'] = data['cleaned_description']
data[
# Save the updated data back to the input CSV file, including the new column
= '/mnt/data/books_scraped_with_preprocessed_descriptions.csv'
updated_file_path =False)
data.to_csv(updated_file_path, index
print("Preprocessing complete! The updated file has been saved with preprocessed text.")
9.1.2 Explanation of the Code 1
Data Loading: The CSV file containing book data is loaded into a pandas DataFrame, and the column “Product Description” is selected for preprocessing.
Text Cleaning: The
preprocess_text_no_external
function handles all preprocessing tasks, including:- Lowercasing the text.
- Removing punctuation using regular expressions.
- Tokenization by splitting the text on whitespace.
- Stopword Removal by filtering out common English stopwords.
Processed Output: The cleaned text is stored in a new column called “preprocessed_text” in the DataFrame.
File Export: Finally, the updated DataFrame is saved back to a CSV file, which includes both the original product description and the newly preprocessed text.
9.1.3 Python Code Example 2
import pandas as pd
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# Ensure required NLTK resources are downloaded
'punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab') nltk.download(
# Load the CSV file
= 'books_scraped_with_descriptions_all_pages.csv'
file_path = pd.read_csv(file_path)
data
# Select the "Product Description" column
= data['Product Description'].dropna()
descriptions
# Initialize the lemmatizer
= WordNetLemmatizer()
lemmatizer
# Preprocess function
def preprocess_text(text):
# Lowercase the text
= text.lower()
text
# Tokenize the text
= word_tokenize(text)
tokens
# Remove punctuation
= [word for word in tokens if word.isalnum()]
tokens
# Remove stopwords
= set(stopwords.words('english'))
stop_words = [word for word in tokens if word not in stop_words]
tokens
# Lemmatize tokens
= [lemmatizer.lemmatize(word) for word in tokens]
tokens
# Join tokens back to string
return ' '.join(tokens)
# Apply preprocessing to the "Product Description" column
'cleaned_description'] = descriptions.apply(preprocess_text)
data[# Add the "preprocessed text" column to the original data
'preprocessed_text'] = data['cleaned_description']
data[
# Save the updated data back to the input CSV file, including the new column
= 'books_scraped_with_preprocessed_descriptions.csv'
updated_file_path =False)
data.to_csv(updated_file_path, index data
9.1.4 Explanation of the Code 2
The provided code performs text preprocessing on a CSV file that contains book descriptions and saves the cleaned text back to a new CSV file. Let’s break it down step by step:
9.1.4.1 Loading the CSV File
# Load the CSV file
= 'books_scraped_with_descriptions_all_pages.csv'
file_path = pd.read_csv(file_path) data
- The
pd.read_csv()
function loads a CSV file located atfile_path
into a pandas DataFrame calleddata
. This DataFrame will hold all the data from the file, including the book descriptions.
9.1.4.2 Selecting the ‘Product Description’ Column
# Select the "Product Description" column
= data['Product Description'].dropna() descriptions
- The code extracts the ‘Product Description’ column from the DataFrame using
data['Product Description']
. - The
dropna()
method removes any rows with missing values (i.e., NaN values) in this column to ensure only valid descriptions are processed.
9.1.4.3 Initializing the Lemmatizer
# Initialize the lemmatizer
= WordNetLemmatizer() lemmatizer
- The
WordNetLemmatizer()
is initialized here. Lemmatization is the process of reducing words to their base or root form (e.g., “running” → “run”). This helps standardize different forms of the same word.
9.1.4.4 Defining the Preprocessing Function
# Preprocess function
def preprocess_text(text):
# Lowercase the text
= text.lower()
text
# Tokenize the text
= word_tokenize(text)
tokens
# Remove punctuation
= [word for word in tokens if word.isalnum()]
tokens
# Remove stopwords
= set(stopwords.words('english'))
stop_words = [word for word in tokens if word not in stop_words]
tokens
# Lemmatize tokens
= [lemmatizer.lemmatize(word) for word in tokens]
tokens
# Join tokens back to string
return ' '.join(tokens)
This function processes the raw text through the following steps:
- Lowercasing:
- The text is converted to lowercase using
text.lower()
to ensure uniformity (e.g., “Book” and “book” are treated the same).
- The text is converted to lowercase using
- Tokenization:
word_tokenize(text)
splits the text into individual tokens (words). For example, “This is a book” becomes['This', 'is', 'a', 'book']
.
- Removing Punctuation:
word.isalnum()
checks if each word is alphanumeric. Words that contain punctuation marks or symbols are excluded.
- Removing Stopwords:
- A set of common stopwords (e.g., “the,” “is,” “and”) is defined using
stopwords.words('english')
. These words are removed from the tokenized text since they usually carry little meaning in most natural language tasks.
- A set of common stopwords (e.g., “the,” “is,” “and”) is defined using
- Lemmatization:
- Each word is lemmatized using the
lemmatizer.lemmatize(word)
method to ensure that different forms of the same word are reduced to their root (e.g., “books” → “book”, “running” → “run”).
- Each word is lemmatized using the
- Rejoining the Tokens:
- After processing, the tokens are rejoined back into a single string using
' '.join(tokens)
.
- After processing, the tokens are rejoined back into a single string using
9.1.4.5 Applying the Preprocessing Function
# Apply preprocessing to the "Product Description" column
'cleaned_description'] = descriptions.apply(preprocess_text) data[
- The
preprocess_text()
function is applied to each entry in the ‘Product Description’ column usingdescriptions.apply(preprocess_text)
. - The cleaned and preprocessed text is then stored in a new column in the DataFrame called
'cleaned_description'
.
9.1.4.6 Adding the ‘Preprocessed Text’ Column
# Add the "preprocessed text" column to the original data
'preprocessed_text'] = data['cleaned_description'] data[
- This line adds a new column called
'preprocessed_text'
to thedata
DataFrame, which contains the preprocessed (cleaned) descriptions.
9.1.4.7 Saving the Updated Data to a CSV File
# Save the updated data back to the input CSV file, including the new column
= 'books_scraped_with_preprocessed_descriptions.csv'
updated_file_path =False) data.to_csv(updated_file_path, index
- The updated
data
DataFrame, now containing both the original and preprocessed descriptions, is saved back to a new CSV file named'books_scraped_with_preprocessed_descriptions.csv'
usingto_csv()
. - The
index=False
parameter ensures that the DataFrame index is not written to the file.
9.1.4.8 Final Output:
- The output CSV file contains all the original data, with the cleaned descriptions stored in the
preprocessed_text
column. - This file can be used for further analysis, such as topic modeling, sentiment analysis, or other natural language processing tasks.x
9.1.5 Importance of Preprocessing
Effective text preprocessing is crucial for generating meaningful insights from unstructured data. By reducing noise and standardizing the text, preprocessing enhances the performance of topic models and other natural language processing tasks.
In the context of topic modeling, preprocessing ensures that the algorithm focuses on meaningful words rather than irrelevant tokens, making the model’s output more interpretable and accurate.