7 Regular Expressions (RegEx) in Python
Regular expressions (often abbreviated as regex or RegEx) are patterns used to match sequences of characters within strings. They provide a flexible and powerful way to search, edit, or manipulate text based on defined patterns.
For example, you can use regular expressions to:
- Search for specific words or sequences in a text
- Validate input formats (e.g., email addresses, phone numbers)
- Extract information from structured or semi-structured data
- Replace or modify parts of a string based on certain criteria
7.1 Basics of Regular Expressions
Here are some basic symbols in regular expressions:
.
: Matches any single character except newline.^
: Asserts the start of a string.$
: Asserts the end of a string.[]
: Defines a set of characters. For example,[abc]
matches either “a”, “b”, or “c”.\d
: Matches any digit (equivalent to[0-9]
).\w
: Matches any word character (alphanumeric and underscore, equivalent to[a-zA-Z0-9_]
).\s
: Matches any whitespace character (spaces, tabs, etc.).*
: Matches 0 or more occurrences of the preceding element.+
: Matches 1 or more occurrences.?
: Matches 0 or 1 occurrence.{n,m}
: Matches between n and m occurrences of the preceding element.
7.2 findall()
: Finding All Matches
The findall()
function in Python returns a list of all matches for a given pattern.
import re
= "1234 abc가나다ABC_555_6 13 43435 2213433577869 23ab"
data
# Finding single digits
print(re.findall("[0-9]", data)) # ['1', '2', '3', '4', '5', '5', '5', ...]
# Finding sequences of digits
print(re.findall("[0-9]+", data)) # ['1234', '555', '6', '13', '43435', ...]
# Finding exactly two-digit numbers
print(re.findall("[0-9]{2}", data)) # ['12', '34', '55', '13', '43', ...]
# Finding numbers with 2 to 6 digits
print(re.findall("[0-9]{2,6}", data)) # ['1234', '555', '13', '43435', ...]
7.3 Matching Specific Patterns
= "1234 abc가나다ABC_555_6 mbc kbs sbs 58672 newsline kkbc dreamair air airline air"
data
# Finding patterns that start with 'a' followed by two characters
print(re.findall("a..", data)) # ['abc', 'air', 'air']
# Finding strings that end with "air"
print(re.findall("air$", data)) # ['air']
7.4 Finding Numeric Patterns
\d
is used to represent any digit, and it is one of the most commonly used patterns.
= "johnson 80, Bong 100, David 50"
data print(re.findall("\d", data)) # ['8', '0', '1', '0', '0', '5', '0']
print(re.findall("\d{2}", data)) # ['80', '10', '00', '50']
7.5 split()
: Splitting Strings
Regular expressions can be used to split strings based on various patterns.
# Splitting based on space (default)
print("mbc,kbs sbs:ytn".split()) # ['mbc,kbs', 'sbs:ytn']
# Splitting based on commas
print("mbc,kbs sbs:ytn".split(",")) # ['mbc', 'kbs sbs:ytn']
# Splitting using a regex pattern (\W means non-alphanumeric characters)
print(re.split("\W+", "mbc,kbs sbs:ytn")) # ['mbc', 'kbs', 'sbs', 'ytn']
7.6 sub()
: Substituting Strings
The sub()
function is used to replace parts of a string that match a pattern with another string.
= "1234 abc가나다ABC_567_34234"
number print(number) # Original string
# Replacing any digits with '888'
= re.sub("[0-9]+", "888", number)
m print(m) # '888 abc가나다ABC_888_888'
7.7 Text Preprocessing Examples
Thank you for sharing the image of your Excel data. To help you use this as an example in your regular expression textbook, we can focus on several key aspects of preprocessing this text. Here are some steps we could cover:
7.7.1 Removing Special Characters
The text contains placeholders like “@@@” and non-standard line breaks, which can be removed or standardized. For instance:
- Regex Pattern:
@@@|\n
- This pattern will match both the placeholder and newlines.
import re
= "Hello. I'm @@@.\nToday I brought that bear doll.\nThis is a bear doll that I like...\n thank you. Thank you all."
text = re.sub(r"@@@|\n", "", text)
clean_text print(clean_text)
7.7.2 Lowercasing Text
Uniformity in text processing often requires converting all text to lowercase.
- Regex or Method:
.lower()
= clean_text.lower() lower_text
7.7.3 Correcting Simple Typos
The text contains phrases like “thank you.” where punctuation might need correction. We can use regex to identify incorrect casing or punctuation.
- Regex Pattern:
[Tt]hank you[.,]
= re.sub(r"[Tt]hank you[.,]?", "Thank you.", lower_text) fixed_text
7.7.4 Identifying Sentences or Key Phrases
If you want to extract specific phrases or sentences for further analysis:
- Regex Pattern:
r"\b(?:brought|present|special)\b"
= re.findall(r"\b(?:brought|present|special)\b", fixed_text)
matches print(matches) # Extracts words like 'brought', 'present', 'special'
7.7.5 Sentence Tokenization
You can split sentences based on punctuation using regex:
- Regex Pattern:
r"[.!?]\s+"
= re.split(r"[.!?]\s+", fixed_text) sentences
7.8 Apply Regular Expression in DataFrame
7.8.1 Step 1: Create a DataFrame
First, we create a DataFrame from the sample data that was visible in the image.
import pandas as pd
# Sample data based on the image
= {
data 'NO': [1, 2, 3],
'Text_Typo': [
"Hello. I'm @@@.\nToday I brought that bear doll.\nThis is a bear doll that I like.\nThis bear doll is special to me because my aunt gave it to me.\nI remember that it was my birthday present from my aunt.\nI love my aunt.\nThank you for listening to my presentation.",
"Hello, I'm @@@. Today I brought a baseball bat. This is a bat that makes me funny. This is special to me because I wanted to be a baseball player when I was a child. And I was a baseball player in baseball club. I remember that I played baseball with my friend. I was very happy at that time. Thank you.",
"Hello, I'm @@@. Let me show you this music box. It's special to me because it's my first music box. I was interested in the music."
]
}
# Create DataFrame
= pd.DataFrame(data)
df print(df)
7.8.2 Step 2: Apply Regular Expressions
Let’s apply some regular expressions to clean up the text.
- Remove the placeholder “@@@” and extra line breaks
\n
. - Lowercase the text for uniformity.
- Replace certain common typos like missing or incorrect punctuation around “thank you”.
- Extract key sentences using regular expressions.
7.8.2.1 Applying Regular Expression to Each Row
import re
# Function to clean text
def clean_text(text):
# Remove the placeholder and line breaks
= re.sub(r"@@@|\n", "", text)
text
# Fix common typo with "thank you"
= re.sub(r"[Tt]hank you[.,]?", "Thank you.", text)
text
# Convert to lowercase
= text.lower()
text
return text
# Apply the function to each row in the 'Text_Typo' column
'Cleaned_Text'] = df['Text_Typo'].apply(clean_text)
df[
# Display the updated DataFrame
'NO', 'Text_Typo','Cleaned_Text']] df[[
7.8.3 Step 3: Extracting Specific Patterns
If you want to extract certain patterns like sentences containing the word “special” or phrases where the speaker brought an item, you can do this with regex as well.
7.8.3.1 Extract Sentences with the Word “Special”
# Function to extract sentences with the word "special"
def extract_special_sentences(text):
# Split sentences based on punctuation
= re.split(r"[.!?]\s+", text)
sentences
# Find sentences containing the word "special"
= [sent for sent in sentences if re.search(r"\bspecial\b", sent)]
special_sentences
## FULL CODE ##
# special_sentences = []
# for sent in sentences:
# if re.search(r"\bspecial\b", sent):
# special_sentences.append(sent)
return special_sentences
# Apply the function to extract sentences with "special"
'Special_Sentences'] = df['Cleaned_Text'].apply(extract_special_sentences)
df[
# Display the DataFrame with extracted sentences
'NO', 'Text_Typo','Cleaned_Text', 'Special_Sentences']] df[[
This DataFrame now contains cleaned text and a column with sentences that mention the word “special.”
7.9 Regular Expression Practice Website
Regexr.com is an interactive website for practicing and learning regular expressions (regex). It offers tools to create and test regex patterns, along with explanations, a pattern library, and a reference guide. The site is suitable for both beginners and advanced users, providing real-time feedback and community-contributed examples to enhance learning.