9 Text Preprocessing

Text preprocessing is an essential step in preparing raw textual data for analysis, especially when working with tasks like topic modeling, sentiment analysis, or classification. The goal is to convert the raw text into a clean and consistent format that can be analyzed using machine learning models. This section outlines the key steps involved in text preprocessing and demonstrates these steps through a Python code example.

9.1 Key Steps in Text Preprocessing

Lowercasing: Converting all text to lowercase ensures uniformity. Without this, words like “Book” and “book” would be treated as different entities, leading to unnecessary complexity in the analysis.
Tokenization: Tokenization is the process of splitting text into individual words or tokens. These tokens are the basic units on which further processing steps are applied.
Removing Punctuation: Punctuation marks often do not contribute to the meaning of text for analytical purposes. Stripping punctuation ensures a cleaner input for the model.
Stopword Removal: Stopwords are common words like “the,” “is,” “in,” “of,” etc., that appear frequently but contribute little to the overall meaning of the text. Removing stopwords helps reduce noise and improve the focus on the core content.
Lemmatization (Optional): Lemmatization is the process of reducing words to their base or dictionary form. While this is an optional step, it is useful for ensuring that words like “running” and “run” are treated as the same entity. In this section, however, we omit lemmatization for simplicity.

9.1.1 Python Code Example 1

Below is a Python code that demonstrates how to preprocess a dataset containing raw text, specifically in the “Product Description” column. This dataset includes information about books, and we aim to clean the text data to prepare it for further analysis.

The code performs the following operations: - Loads the CSV file that contains the product descriptions. - Converts the text to lowercase. - Removes punctuation marks. - Tokenizes the text into individual words. - Removes common stopwords. - Saves the processed text back into a new column named “preprocessed_text” and exports the updated CSV file.

import pandas as pd
import re

# Load the CSV file
file_path = '/mnt/data/books_scraped_with_descriptions_all_pages.csv'
data = pd.read_csv(file_path)

# Preprocess function (without external libraries)
def preprocess_text_no_external(text):
    # Lowercase the text
    text = text.lower()
    
    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text)
    
    # Tokenize by splitting text
    tokens = text.split()
    
    # Simple list of stopwords
    stop_words = set(['a', 'the', 'and', 'is', 'in', 'it', 'to', 'this', 'of', 'that'])
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    # Join tokens back to string
    return ' '.join(tokens)

# Apply preprocessing to the "Product Description" column
data['cleaned_description'] = data['Product Description'].dropna().apply(preprocess_text_no_external)

# Add the "preprocessed text" column to the original data
data['preprocessed_text'] = data['cleaned_description']

# Save the updated data back to the input CSV file, including the new column
updated_file_path = '/mnt/data/books_scraped_with_preprocessed_descriptions.csv'
data.to_csv(updated_file_path, index=False)

print("Preprocessing complete! The updated file has been saved with preprocessed text.")

9.1.2 Explanation of the Code 1

Data Loading: The CSV file containing book data is loaded into a pandas DataFrame, and the column “Product Description” is selected for preprocessing.
Text Cleaning: The preprocess_text_no_external function handles all preprocessing tasks, including:
- Lowercasing the text.
- Removing punctuation using regular expressions.
- Tokenization by splitting the text on whitespace.
- Stopword Removal by filtering out common English stopwords.
Processed Output: The cleaned text is stored in a new column called “preprocessed_text” in the DataFrame.
File Export: Finally, the updated DataFrame is saved back to a CSV file, which includes both the original product description and the newly preprocessed text.

9.1.3 Python Code Example 2

import pandas as pd
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Ensure required NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

# Load the CSV file
file_path = 'books_scraped_with_descriptions_all_pages.csv'
data = pd.read_csv(file_path)

# Select the "Product Description" column
descriptions = data['Product Description'].dropna()

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Preprocess function
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove punctuation
    tokens = [word for word in tokens if word.isalnum()]
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join tokens back to string
    return ' '.join(tokens)

# Apply preprocessing to the "Product Description" column
data['cleaned_description'] = descriptions.apply(preprocess_text)
# Add the "preprocessed text" column to the original data
data['preprocessed_text'] = data['cleaned_description']

# Save the updated data back to the input CSV file, including the new column
updated_file_path = 'books_scraped_with_preprocessed_descriptions.csv'
data.to_csv(updated_file_path, index=False)
data

9.1.4 Explanation of the Code 2

The provided code performs text preprocessing on a CSV file that contains book descriptions and saves the cleaned text back to a new CSV file. Let’s break it down step by step:

9.1.4.1 Loading the CSV File

# Load the CSV file
file_path = 'books_scraped_with_descriptions_all_pages.csv'
data = pd.read_csv(file_path)

The pd.read_csv() function loads a CSV file located at file_path into a pandas DataFrame called data. This DataFrame will hold all the data from the file, including the book descriptions.

9.1.4.2 Selecting the ‘Product Description’ Column

# Select the "Product Description" column
descriptions = data['Product Description'].dropna()

The code extracts the ‘Product Description’ column from the DataFrame using data['Product Description'].
The dropna() method removes any rows with missing values (i.e., NaN values) in this column to ensure only valid descriptions are processed.

9.1.4.3 Initializing the Lemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

The WordNetLemmatizer() is initialized here. Lemmatization is the process of reducing words to their base or root form (e.g., “running” → “run”). This helps standardize different forms of the same word.

9.1.4.4 Defining the Preprocessing Function

# Preprocess function
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove punctuation
    tokens = [word for word in tokens if word.isalnum()]
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join tokens back to string
    return ' '.join(tokens)

This function processes the raw text through the following steps:

Lowercasing:
- The text is converted to lowercase using text.lower() to ensure uniformity (e.g., “Book” and “book” are treated the same).
Tokenization:
- word_tokenize(text) splits the text into individual tokens (words). For example, “This is a book” becomes ['This', 'is', 'a', 'book'].
Removing Punctuation:
- word.isalnum() checks if each word is alphanumeric. Words that contain punctuation marks or symbols are excluded.
Removing Stopwords:
- A set of common stopwords (e.g., “the,” “is,” “and”) is defined using stopwords.words('english'). These words are removed from the tokenized text since they usually carry little meaning in most natural language tasks.
Lemmatization:
- Each word is lemmatized using the lemmatizer.lemmatize(word) method to ensure that different forms of the same word are reduced to their root (e.g., “books” → “book”, “running” → “run”).
Rejoining the Tokens:
- After processing, the tokens are rejoined back into a single string using ' '.join(tokens).

9.1.4.5 Applying the Preprocessing Function

# Apply preprocessing to the "Product Description" column
data['cleaned_description'] = descriptions.apply(preprocess_text)

The preprocess_text() function is applied to each entry in the ‘Product Description’ column using descriptions.apply(preprocess_text).
The cleaned and preprocessed text is then stored in a new column in the DataFrame called 'cleaned_description'.

9.1.4.6 Adding the ‘Preprocessed Text’ Column

# Add the "preprocessed text" column to the original data
data['preprocessed_text'] = data['cleaned_description']

This line adds a new column called 'preprocessed_text' to the data DataFrame, which contains the preprocessed (cleaned) descriptions.

9.1.4.7 Saving the Updated Data to a CSV File

# Save the updated data back to the input CSV file, including the new column
updated_file_path = 'books_scraped_with_preprocessed_descriptions.csv'
data.to_csv(updated_file_path, index=False)

The updated data DataFrame, now containing both the original and preprocessed descriptions, is saved back to a new CSV file named 'books_scraped_with_preprocessed_descriptions.csv' using to_csv().
The index=False parameter ensures that the DataFrame index is not written to the file.

9.1.4.8 Final Output:

The output CSV file contains all the original data, with the cleaned descriptions stored in the preprocessed_text column.
This file can be used for further analysis, such as topic modeling, sentiment analysis, or other natural language processing tasks.x

9.1.5 Importance of Preprocessing

Effective text preprocessing is crucial for generating meaningful insights from unstructured data. By reducing noise and standardizing the text, preprocessing enhances the performance of topic models and other natural language processing tasks.

In the context of topic modeling, preprocessing ensures that the algorithm focuses on meaningful words rather than irrelevant tokens, making the model’s output more interpretable and accurate.