7 Regular Expressions (RegEx) in Python

Regular expressions (often abbreviated as regex or RegEx) are patterns used to match sequences of characters within strings. They provide a flexible and powerful way to search, edit, or manipulate text based on defined patterns.

For example, you can use regular expressions to:

Search for specific words or sequences in a text
Validate input formats (e.g., email addresses, phone numbers)
Extract information from structured or semi-structured data
Replace or modify parts of a string based on certain criteria

7.1 Basics of Regular Expressions

Here are some basic symbols in regular expressions:

. : Matches any single character except newline.
^ : Asserts the start of a string.
$ : Asserts the end of a string.
[]: Defines a set of characters. For example, [abc] matches either “a”, “b”, or “c”.
\d: Matches any digit (equivalent to [0-9]).
\w: Matches any word character (alphanumeric and underscore, equivalent to [a-zA-Z0-9_]).
\s: Matches any whitespace character (spaces, tabs, etc.).
*: Matches 0 or more occurrences of the preceding element.
+: Matches 1 or more occurrences.
?: Matches 0 or 1 occurrence.
{n,m}: Matches between n and m occurrences of the preceding element.

7.2 `findall()`: Finding All Matches

The findall() function in Python returns a list of all matches for a given pattern.

import re

data = "1234 abc가나다ABC_555_6 13 43435 2213433577869 23ab"

# Finding single digits
print(re.findall("[0-9]", data))  # ['1', '2', '3', '4', '5', '5', '5', ...]

# Finding sequences of digits
print(re.findall("[0-9]+", data))  # ['1234', '555', '6', '13', '43435', ...]

# Finding exactly two-digit numbers
print(re.findall("[0-9]{2}", data))  # ['12', '34', '55', '13', '43', ...]

# Finding numbers with 2 to 6 digits
print(re.findall("[0-9]{2,6}", data))  # ['1234', '555', '13', '43435', ...]

7.3 Matching Specific Patterns

data = "1234 abc가나다ABC_555_6 mbc kbs sbs 58672 newsline kkbc dreamair air airline air"

# Finding patterns that start with 'a' followed by two characters
print(re.findall("a..", data))  # ['abc', 'air', 'air']

# Finding strings that end with "air"
print(re.findall("air$", data))  # ['air']

7.4 Finding Numeric Patterns

\d is used to represent any digit, and it is one of the most commonly used patterns.

data = "johnson 80, Bong 100, David 50"
print(re.findall("\d", data))  # ['8', '0', '1', '0', '0', '5', '0']
print(re.findall("\d{2}", data))  # ['80', '10', '00', '50']

7.5 `split()`: Splitting Strings

Regular expressions can be used to split strings based on various patterns.

# Splitting based on space (default)
print("mbc,kbs sbs:ytn".split())  # ['mbc,kbs', 'sbs:ytn']

# Splitting based on commas
print("mbc,kbs sbs:ytn".split(","))  # ['mbc', 'kbs sbs:ytn']

# Splitting using a regex pattern (\W means non-alphanumeric characters)
print(re.split("\W+", "mbc,kbs sbs:ytn"))  # ['mbc', 'kbs', 'sbs', 'ytn']

7.6 `sub()`: Substituting Strings

The sub() function is used to replace parts of a string that match a pattern with another string.

number = "1234 abc가나다ABC_567_34234"
print(number)  # Original string

# Replacing any digits with '888'
m = re.sub("[0-9]+", "888", number)
print(m)  # '888 abc가나다ABC_888_888'

7.7 Text Preprocessing Examples

Thank you for sharing the image of your Excel data. To help you use this as an example in your regular expression textbook, we can focus on several key aspects of preprocessing this text. Here are some steps we could cover:

7.7.1 Removing Special Characters

The text contains placeholders like “@@@” and non-standard line breaks, which can be removed or standardized. For instance:

Regex Pattern: @@@|\n
- This pattern will match both the placeholder and newlines.

import re

text = "Hello. I'm @@@.\nToday I brought that bear doll.\nThis is a bear doll that I like...\n thank you. Thank you all."
clean_text = re.sub(r"@@@|\n", "", text)
print(clean_text)

7.7.2 Lowercasing Text

Uniformity in text processing often requires converting all text to lowercase.

Regex or Method: .lower()

lower_text = clean_text.lower()

7.7.3 Correcting Simple Typos

The text contains phrases like “thank you.” where punctuation might need correction. We can use regex to identify incorrect casing or punctuation.

Regex Pattern: [Tt]hank you[.,]

fixed_text = re.sub(r"[Tt]hank you[.,]?", "Thank you.", lower_text)

7.7.4 Identifying Sentences or Key Phrases

If you want to extract specific phrases or sentences for further analysis:

Regex Pattern: r"\b(?:brought|present|special)\b"

matches = re.findall(r"\b(?:brought|present|special)\b", fixed_text)
print(matches)  # Extracts words like 'brought', 'present', 'special'

7.7.5 Sentence Tokenization

You can split sentences based on punctuation using regex:

Regex Pattern: r"[.!?]\s+"

sentences = re.split(r"[.!?]\s+", fixed_text)

7.8 Apply Regular Expression in DataFrame

7.8.1 Step 1: Create a DataFrame

First, we create a DataFrame from the sample data that was visible in the image.

import pandas as pd

# Sample data based on the image
data = {
    'NO': [1, 2, 3],
    'Text_Typo': [
        "Hello. I'm @@@.\nToday I brought that bear doll.\nThis is a bear doll that I like.\nThis bear doll is special to me because my aunt gave it to me.\nI remember that it was my birthday present from my aunt.\nI love my aunt.\nThank you for listening to my presentation.",
        "Hello, I'm @@@. Today I brought a baseball bat. This is a bat that makes me funny. This is special to me because I wanted to be a baseball player when I was a child. And I was a baseball player in baseball club. I remember that I played baseball with my friend. I was very happy at that time. Thank you.",
        "Hello, I'm @@@. Let me show you this music box. It's special to me because it's my first music box. I was interested in the music."
    ]
}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

7.8.2 Step 2: Apply Regular Expressions

Let’s apply some regular expressions to clean up the text.

Remove the placeholder “@@@” and extra line breaks \n.
Lowercase the text for uniformity.
Replace certain common typos like missing or incorrect punctuation around “thank you”.
Extract key sentences using regular expressions.

7.8.2.1 Applying Regular Expression to Each Row

import re

# Function to clean text
def clean_text(text):
    # Remove the placeholder and line breaks
    text = re.sub(r"@@@|\n", "", text)
    
    # Fix common typo with "thank you"
    text = re.sub(r"[Tt]hank you[.,]?", "Thank you.", text)
    
    # Convert to lowercase
    text = text.lower()
    
    return text

# Apply the function to each row in the 'Text_Typo' column
df['Cleaned_Text'] = df['Text_Typo'].apply(clean_text)

# Display the updated DataFrame
df[['NO', 'Text_Typo','Cleaned_Text']]

7.8.3 Step 3: Extracting Specific Patterns

If you want to extract certain patterns like sentences containing the word “special” or phrases where the speaker brought an item, you can do this with regex as well.

7.8.3.1 Extract Sentences with the Word “Special”

# Function to extract sentences with the word "special"
def extract_special_sentences(text):
    # Split sentences based on punctuation
    sentences = re.split(r"[.!?]\s+", text)
    
    # Find sentences containing the word "special"
    special_sentences = [sent for sent in sentences if re.search(r"\bspecial\b", sent)]

    ## FULL CODE ##
    # special_sentences = []
    # for sent in sentences:
    #     if re.search(r"\bspecial\b", sent):
    #         special_sentences.append(sent)    
    
    return special_sentences

# Apply the function to extract sentences with "special"
df['Special_Sentences'] = df['Cleaned_Text'].apply(extract_special_sentences)

# Display the DataFrame with extracted sentences
df[['NO', 'Text_Typo','Cleaned_Text', 'Special_Sentences']]

This DataFrame now contains cleaned text and a column with sentences that mention the word “special.”

7.9 Regular Expression Practice Website

Regexr.com is an interactive website for practicing and learning regular expressions (regex). It offers tools to create and test regex patterns, along with explanations, a pattern library, and a reference guide. The site is suitable for both beginners and advanced users, providing real-time feedback and community-contributed examples to enhance learning.