13  Word Embedding Activities

The activities below are for embedding practice and may differ from practical code examples.

Word embeddings can be used in several areas of applied linguistics to analyze and model language. Here are some key applications:

13.0.1 Lexical Semantics and Word Meaning

  • Synonym Detection: Word embeddings can identify semantically similar words by examining how closely words are located in the embedding space. This can be used for synonym or related word detection in language teaching, lexicography, or creating language learning tools.
  • Word Sense Disambiguation: Embeddings help differentiate word senses based on context, improving understanding in natural language processing (NLP) tasks such as machine translation or language generation.

13.0.2 Corpus Linguistics

  • Collocation Analysis: Word embeddings can identify common word pairings and their semantic relationships in a corpus, which helps in studying how words co-occur and form patterns in natural language usage.
  • Semantic Similarity in Corpora: Embeddings can be used to quantify semantic similarity between words, phrases, or even larger units of text, making it easier to compare linguistic variation across different corpora, such as learner corpora vs. native speaker corpora.

13.0.3 Discourse and Pragmatics

  • Topic Modeling: Embedding-based topic models can uncover underlying themes in large text data, which is useful for analyzing spoken or written discourse, identifying conversational topics, or studying genre-specific language use.
  • Contextual Meaning and Coherence: Embeddings help to assess how cohesive or coherent a text is by analyzing word usage in different contexts, which can aid in both automatic essay scoring and discourse analysis.

13.0.4 Sociolinguistics

  • Variation and Change: Word embeddings allow researchers to track how word meanings and usage change over time in specific linguistic communities. This can be useful for analyzing language variation based on geography, social class, or time.
  • Dialectology: By comparing word embeddings across dialects or sociolects, researchers can quantify linguistic similarities and differences.

13.0.5 Sentiment and Politeness Analysis

  • Politeness or Formality Levels: Word embeddings can model nuanced variations in politeness or formality in different contexts, which is crucial for understanding pragmatic language use across different social interactions or cultures.
  • Sentiment Analysis: In analyzing affective language, embeddings help in categorizing words and phrases by their emotional tone, which is beneficial in language learning contexts that focus on pragmatic and affective communication.

13.0.6 Translation and Bilingual Word Embeddings

  • Cross-Linguistic Analysis: Bilingual embeddings can map words from different languages into a shared semantic space, facilitating tasks like machine translation or the study of cross-linguistic semantic variation.
  • Error Detection in Translation: Embedding-based models can identify semantic discrepancies or mistranslations by comparing the embeddings of words and phrases in both source and target languages.

These applications of word embeddings allow researchers and educators to enhance language learning tools, refine linguistic theories, and develop NLP technologies that better capture the complexity of human language.

13.1 Lexical Semantics and Word Meaning

To conduct synonym detection and word sense disambiguation using word embeddings in Python, we can use popular libraries like gensim, spaCy, or transformers that provide pre-trained word embeddings. Below are step-by-step examples for both synonym detection and word sense disambiguation.

13.1.1 Synonym Detection Using Word Embeddings

We can detect synonyms by checking the similarity between word vectors in the embedding space. Here’s an example using the gensim library with the pre-trained Word2Vec model.

13.1.1.1 Steps:

  1. Install necessary libraries:

    pip install gensim spacy
    python -m spacy download en_core_web_sm  # Download the English model for spaCy
  2. Load pre-trained word embeddings and find similar words:

    import gensim.downloader as api
    
    # Load a pre-trained Word2Vec model from Gensim
    model = api.load("word2vec-google-news-300")  # A popular pre-trained word2vec model
    
    # Example word for synonym detection
    word = "happy"
    
    # Get top 5 most similar words to the target word
    similar_words = model.most_similar(word, topn=5)
    
    print(f"Top 5 synonyms for '{word}':")
    for similar_word, similarity_score in similar_words:
        print(f"{similar_word} ({similarity_score})")

This will output the top 5 words that are most similar to “happy” based on their proximity in the embedding space.

Sample Output:

Top 5 synonyms for 'happy':
joyful (0.714)
cheerful (0.701)
content (0.689)
delighted (0.678)
elated (0.665)

To filter out words that share the same part-of-speech (POS) as the target word when performing synonym detection, we need to combine the word embedding approach with POS tagging. This ensures that the similar words returned are not only semantically related but also belong to the same grammatical category (e.g., noun, verb, adjective).

We can achieve this by using a POS tagger from a library like spaCy, which allows us to tag words and filter out only those with the same POS as the target word.

Python Code to Show All POS Tags in spaCy:

import spacy
from spacy.symbols import POS

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# List all available POS tags in spaCy with their explanations
pos_tags = nlp.get_pipe("tagger").labels

print("All available POS tags in spaCy:")
for pos in pos_tags:
    print(f"{pos}: {spacy.explain(pos)}")

Output:

All available POS tags in spaCy:
$: symbol, currency
'': closing quotation mark
,: punctuation mark, comma
-LRB-: left round bracket
-RRB-: right round bracket
.: punctuation mark, sentence closer
:: punctuation mark, colon or ellipsis
ADD: email
AFX: affix
CC: conjunction, coordinating
CD: cardinal number
DT: determiner
EX: existential there
FW: foreign word
HYPH: punctuation mark, hyphen
IN: conjunction, subordinating or preposition
JJ: adjective (English), other noun-modifier (Chinese)
JJR: adjective, comparative
JJS: adjective, superlative
LS: list item marker
MD: verb, modal auxiliary
NFP: superfluous punctuation
NN: noun, singular or mass
NNP: noun, proper singular
NNPS: noun, proper plural
NNS: noun, plural
PDT: predeterminer
POS: possessive ending
PRP: pronoun, personal
PRP$: pronoun, possessive
RB: adverb
RBR: adverb, comparative
RBS: adverb, superlative
RP: adverb, particle
SYM: symbol
TO: infinitival "to"
UH: interjection
VB: verb, base form
VBD: verb, past tense
VBG: verb, gerund or present participle
VBN: verb, past participle
VBP: verb, non-3rd person singular present
VBZ: verb, 3rd person singular present
WDT: wh-determiner
WP: wh-pronoun, personal
WP$: wh-pronoun, possessive
WRB: wh-adverb
XX: unknown
_SP: whitespace
``: opening quotation mark

Here’s the revised version of the code, where the “word” and “pos” assignment are handled separately.

13.1.1.2 Revised Python Code:

import gensim.downloader as api
import spacy

# Load pre-trained Word2Vec model from Gensim
model = api.load("word2vec-google-news-300")

# Load spaCy POS tagger
nlp = spacy.load("en_core_web_sm")
# Define a function to get the POS tag of a word
def get_pos(word):
    doc = nlp(word)
    return doc[0].pos_  # Returns the POS tag of the word

# Function to find synonyms with the same POS
def find_synonyms_with_same_pos(word, topn=10):
    try:
        # Get the POS of the target word
        word_pos = get_pos(word)

        # Get the most similar words from the model
        similar_words = model.most_similar(word, topn=topn)

        # Filter similar words by POS tag
        filtered_words = [
            (w, sim) for w, sim in similar_words if get_pos(w) == word_pos
        ]

        return filtered_words
    except KeyError:
        print(f"Word '{word}' not found in the model vocabulary.")
        return []

Separate input box for word and POS tagging

word = "happy"  # Define the target word
pos = get_pos(word)  # Get the POS tag for the target word

# Find synonyms with the same POS
synonyms_with_same_pos = find_synonyms_with_same_pos(word, topn=10)

# Output the result
print(f"Synonyms for '{word}' with the same POS ({pos}):")
for synonym, similarity in synonyms_with_same_pos:
    print(f"{synonym} ({similarity})")

Example Output:

For word = "happy", the output will be something like:

Synonyms for 'happy' with the same POS (ADJ):
joyful (0.714)
cheerful (0.701)
delighted (0.678)
content (0.689)
ecstatic (0.662)

13.1.2 Word Sense Disambiguation Using Contextual Embeddings

Word sense disambiguation (WSD) can be done by using contextual word embeddings, where the meaning of a word is determined by its context. Here’s an example using the transformers library (BERT embeddings) from Hugging Face.

13.1.2.1 Steps:

  1. Install necessary libraries:

    pip install transformers torch
  2. Use BERT to generate contextual embeddings:

    from transformers import BertTokenizer, BertModel
    # BertModel: This is the actual pre-trained BERT model used to generate embeddings.
    
    import torch
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Load pre-trained BERT model and tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    # BertTokenizer.from_pretrained('bert-base-uncased'): Loads a pre-trained tokenizer for BERT. The bert-base-uncased model is a smaller, lower-cased version of BERT (where all text is converted to lowercase).
    model = BertModel.from_pretrained('bert-base-uncased')
    # BertModel.from_pretrained('bert-base-uncased'): Loads the pre-trained BERT model. This model outputs hidden states (embeddings) that can be used for various NLP tasks.
    
    
    # Sentences with the ambiguous word "bank"
    sentence_1 = "He went to the bank to deposit money."
    sentence_2 = "The river bank was full of fish."
    
    # Tokenize and get embeddings for both sentences
    inputs_1 = tokenizer(sentence_1, return_tensors="pt")
    inputs_2 = tokenizer(sentence_2, return_tensors="pt")
    
    # tokenizer(sentence_1, return_tensors="pt"): This tokenizes the sentences into BERT's format and returns a PyTorch tensor (pt stands for PyTorch). BERT needs input to be tokenized into a numerical form (token IDs) that it can process.
    # It converts each word into subwords (tokens) and creates corresponding token IDs.
    # The result is a tensor, which is an array containing the numerical representation of each token.
    
    
    with torch.no_grad():
        outputs_1 = model(**inputs_1)
        outputs_2 = model(**inputs_2)
    
    # torch.no_grad(): This disables gradient calculations (used for training models). Here, it saves memory and speeds up computations since we only need forward passes through the model to get the embeddings.
    # outputs_1 = model(**inputs_1): This runs the tokenized input through the BERT model. The model outputs hidden states or embeddings for each token in the sentence.
    # The hidden state captures the meaning of each word in the context of the entire sentence.
    
    embedding_1 = outputs_1.last_hidden_state[0, 4, :]  # Word "bank" in sentence 1
    embedding_2 = outputs_2.last_hidden_state[0, 2, :]  # Word "bank" in sentence 2
    
    # Extract the embeddings for the word "bank" (assuming the word is at index 5 in both cases)
    # The output hidden states are of shape (batch_size, sequence_length, hidden_size), we take the last hidden state
    # outputs_1.last_hidden_state: The output of BERT contains hidden states for all tokens in the sentence. This has the shape (batch_size, sequence_length, hidden_size) where:
    ## batch_size: The number of sentences (in this case, it's 1).
    ## sequence_length: The number of tokens in the sentence.
    ## hidden_size: The size of the hidden state vector (768 dimensions for BERT).
    # [0, 5, :]: We access the embedding for the token at index 5 in both sentences. BERT generates embeddings for each token in the sentence, and this line assumes that the word "bank" is at index 5. The : means that we're extracting all the 768 dimensions of the embedding.
    # Note: Token indexes might differ depending on tokenization, so in a real application, you should find the correct index of the word "bank".
    
    # Compute cosine similarity between embeddings
    similarity = cosine_similarity(embedding_1.unsqueeze(0), embedding_2.unsqueeze(0))
    
    # cosine_similarity(embedding_1.unsqueeze(0), embedding_2.unsqueeze(0)): This computes the cosine similarity between the two embeddings. Cosine similarity is a measure of similarity between two vectors based on their orientation (not magnitude). It ranges from -1 (completely opposite) to 1 (exactly the same), with 0 indicating no similarity.
    # unsqueeze(0): This adds an extra dimension to the embedding to make it a 2D tensor, as cosine_similarity expects the input to be 2D.
    
    
    print(f"Similarity between 'bank' in two different contexts: {similarity[0][0]}")

In this example:

  • We take the word “bank” in two different contexts: one financial (bank to deposit money) and one geographical (river bank).

  • BERT creates embeddings for the word based on its surrounding context.

  • Cosine similarity is computed between these embeddings to determine how similar the meanings of “bank” are in both contexts.

Sample Output:

Similarity between 'bank' in two different contexts: 0.37

This low similarity score suggests that the word “bank” has different meanings in these two contexts (financial institution vs. riverside).


To revise the above code and ensure that the correct index of the word “bank” is used in both sentences, we need to account for the way BERT tokenizes the input. BERT uses subword tokenization, meaning that words can sometimes be split into multiple tokens. To ensure we find the correct index of “bank”, we need to first tokenize the sentences, then search for the token ID that corresponds to “bank” within the tokenized input.

Here’s how to revise the code:

13.1.3 Revised Python Code:

from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sentences with the ambiguous word "bank"
sentence_1 = "He went to the bank to deposit money."
sentence_2 = "The river bank was full of fish."

# Tokenize and get embeddings for both sentences
inputs_1 = tokenizer(sentence_1, return_tensors="pt")
inputs_2 = tokenizer(sentence_2, return_tensors="pt")

# Tokenized input with subword tokens
tokens_1 = tokenizer.tokenize(sentence_1)
tokens_2 = tokenizer.tokenize(sentence_2)

# Find the index of the token "bank" in both tokenized sentences
index_1 = tokens_1.index("bank")
index_2 = tokens_2.index("bank")
print(index_1)
print(index_2)

# Convert to tensor input for BERT
inputs_1 = tokenizer(sentence_1, return_tensors="pt")
inputs_2 = tokenizer(sentence_2, return_tensors="pt")

with torch.no_grad():
    outputs_1 = model(**inputs_1)
    outputs_2 = model(**inputs_2)

# Extract the embeddings for the word "bank" using the correct index
embedding_1 = outputs_1.last_hidden_state[0, index_1 + 1, :]  # +1 due to [CLS] token at index 0
embedding_2 = outputs_2.last_hidden_state[0, index_2 + 1, :]  # +1 due to [CLS] token at index 0

# Compute cosine similarity between embeddings
similarity = cosine_similarity(embedding_1.unsqueeze(0), embedding_2.unsqueeze(0))

print(f"Similarity between 'bank' in two different contexts: {similarity[0][0]}")

Key Changes:

  1. Tokenization:
    • tokenizer.tokenize(sentence_1): This tokenizes each sentence into subword tokens.
    • tokens_1.index("bank"): Finds the correct index of the word “bank” in the tokenized input.
  2. Correct Index Adjustment:
    • In BERT’s input format, the sequence starts with a [CLS] token at index 0, so the actual index of the word “bank” is index + 1. This is why we add 1 to the token index to get the correct location in the hidden states.
  3. Embedding Extraction:
    • The embedding for the word “bank” is extracted based on the calculated index in each sentence.

Example Output:

This code should give you the similarity score for the word “bank” in the two different contexts. If the meanings are different (as expected here), the similarity score will be low.

For example:

4
2
Similarity between 'bank' in two different contexts: 0.43

This low similarity score indicates that the word “bank” has different meanings in these contexts (financial institution vs. riverside).

13.2 Corpus Linguistics with Word Embeddings

In corpus linguistics, word embeddings can be applied to two key tasks: collocation analysis and semantic similarity. Below are Python implementations for both activities, along with detailed explanations.

13.2.1 Collocation Analysis Using Word Embeddings

Collocations are word pairings that frequently occur together in a language and exhibit specific patterns. Word embeddings can help identify semantically related word pairs based on their proximity in vector space.

13.2.1.1 Steps:

  1. Load pre-trained word embeddings (such as Word2Vec).
  2. Extract word pairs (collocations) based on their co-occurrence and proximity in embedding space.
  3. Sort the word pairs by their similarity score to identify common collocations.

13.2.1.2 Python Code for Collocation Analysis:

import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained Word2Vec model (Google News Vectors)
model = api.load("word2vec-google-news-300")

# List of word pairs you want to analyze for collocation
word_pairs = [
    ('quick', 'fox'),
    ('lazy', 'dog'),
    ('king', 'queen'),
    ('strong', 'weak'),
    ('bank', 'money')
]

# Function to calculate cosine similarity between two words
def get_similarity(word1, word2):
    try:
        vec1 = model[word1]
        vec2 = model[word2]
        similarity = cosine_similarity([vec1], [vec2])[0][0]
        return similarity
    except KeyError:
        return None  # If word not in vocabulary

# Find collocations by calculating similarity
collocations = []
for word1, word2 in word_pairs:
    similarity = get_similarity(word1, word2)
    if similarity is not None:
        collocations.append((word1, word2, similarity))

# Sort by similarity score
collocations.sort(key=lambda x: x[2], reverse=True)

# Display the collocations and their similarity scores
print("Collocations and their similarity scores:")
for word1, word2, sim in collocations:
    print(f"{word1} - {word2}: {sim:.3f}")

Explanation:

  • Model: This code uses a pre-trained Word2Vec model (word2vec-google-news-300), which contains embeddings for millions of words.
  • Cosine Similarity: The similarity between two word vectors is calculated using cosine similarity. This measures how closely two words are related based on their context.
  • Word Pairs: The list word_pairs contains sample word pairs for which collocations are analyzed. You can modify this list to include more word pairs.

Sample Output:

Collocations and their similarity scores:
king - queen: 0.651
quick - fox: 0.341
lazy - dog: 0.295
bank - money: 0.519
strong - weak: -0.012
  • The word pair “king” and “queen” shows a high similarity, indicating they are often collocates in contexts related to royalty or power.
  • The pair “strong” and “weak” has a very low (and even negative) similarity, suggesting that these are antonyms rather than collocates.

13.2.2 2. Semantic Similarity in Corpora

Semantic similarity is used to measure how similar two words, phrases, or sentences are in meaning. This can be used to compare texts across corpora (e.g., learner vs. native speaker corpora).

13.2.2.1 Steps:

  1. Use word embeddings to compute similarity scores between words or phrases in two different corpora.
  2. Aggregate the similarities to compare the semantic variation between the corpora.

13.2.2.2 Python Code for Semantic Similarity in Corpora:

from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer for contextual embeddings
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example sentences from different corpora (Learner vs Native)
sentence_learner = "The cat sat on the mat."
sentence_native = "A feline rested on a carpet."

# Tokenize the sentences and create input tensors for BERT
inputs_learner = tokenizer(sentence_learner, return_tensors='pt')
inputs_native = tokenizer(sentence_native, return_tensors='pt')

# Pass the sentences through BERT to get hidden states
with torch.no_grad():
    outputs_learner = model(**inputs_learner)
    outputs_native = model(**inputs_native)

# Extract the last hidden states for sentence embeddings
embedding_learner = outputs_learner.last_hidden_state.mean(dim=1)  # Mean pooling for sentence embedding
embedding_native = outputs_native.last_hidden_state.mean(dim=1)

# Compute cosine similarity between the sentence embeddings
similarity = cosine_similarity(embedding_learner, embedding_native)[0][0]

print(f"Semantic similarity between the sentences: {similarity:.3f}")

Explanation:

  • BERT Model: We use a pre-trained BERT model (bert-base-uncased) to compute contextual word embeddings. BERT captures the meaning of a word or sentence in the context of surrounding words.
  • Sentence Embedding: BERT outputs embeddings for each token in the sentence. We use mean pooling to combine these token embeddings into a single vector representing the entire sentence.
  • Cosine Similarity: Cosine similarity is used to compute how similar the two sentences are in meaning.

Sample Output:

Semantic similarity between the sentences: 0.781
  • The two sentences “The cat sat on the mat.” (learner corpus) and “A feline rested on a carpet.” (native corpus) have a high similarity score (0.781), showing that despite lexical differences, their meanings are quite similar.

13.2.3 Use Cases for Both Tasks in Corpus Linguistics:

  • Collocation Analysis:
    • Lexicography: Identify common collocations for dictionary creation or teaching materials.
    • Language Teaching: Help learners understand frequent word pairings and idiomatic expressions.
  • Semantic Similarity in Corpora:
    • Learner Corpora: Compare learner-generated texts with native speaker texts to assess the semantic proximity and linguistic variation.
    • Textual Analysis: Measure how similar different versions of texts are, or compare writing from different authors or genres.

By applying these techniques, researchers can study patterns in natural language usage, how meanings vary across corpora, and how words co-occur in different contexts.

13.3 Discourse and Pragmatics Using Word Embeddings: Python Code and Explanations

In discourse and pragmatics, topic modeling and contextual meaning and coherence are key tasks. Below are Python implementations for each task, focusing on using word embeddings to uncover topics and assess coherence in text.

13.3.1 Topic Modeling Using Embedding-Based Approaches

Topic modeling is a technique for identifying hidden themes or topics within a collection of documents. Embedding-based models, like BERTopic, can generate clusters of semantically related words that represent underlying topics.

13.3.1.1 Steps:

  1. Preprocess a collection of documents.
  2. Use a pre-trained embedding model to transform text into vectors.
  3. Apply topic modeling using an embedding-based approach like BERTopic.

13.3.1.2 Python Code for Topic Modeling Using BERTopic:

You need to install BERTopic and sentence-transformers first:

pip install bertopic sentence-transformers

Now the Python code:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Fetch sample data (20 newsgroups dataset for topic modeling)
data = fetch_20newsgroups(subset='all')['data']

# Initialize BERTopic model (uses embedding-based topic modeling)
topic_model = BERTopic()

# Fit the topic model on the dataset
topics, probabilities = topic_model.fit_transform(data)

# Display the top 5 topics
topic_info = topic_model.get_topic_info()
print(topic_info.head())

# Get the top words for a specific topic
topic_id = 0  # You can change this to explore different topics
top_words = topic_model.get_topic(topic_id)
print(f"Top words for topic {topic_id}: {top_words}")

Explanation:

  • BERTopic: BERTopic is a topic modeling library that leverages pre-trained sentence embeddings to find topics in large corpora.
  • Embedding Transformation: It uses embeddings to capture the semantic meaning of each document and then clusters these embeddings to identify topics.
  • Output:
  • topic_info: This provides a list of all topics discovered by the model, along with the size of each topic (i.e., how many documents are classified under each topic).
  • get_topic(): This function returns the top words for a particular topic, providing insights into the core vocabulary related to that topic.

Sample Output:

Top 5 topics:
   Topic  Count
0     -1   7285
1      0   1121
2      1    874
3      2    797
4      3    726

Top words for topic 0: [('space', 0.03), ('nasa', 0.02), ('launch', 0.015), ('mission', 0.014), ('orbit', 0.013)]

In this example, Topic 0 might be related to space exploration, as evidenced by the most prominent words: “space”, “nasa”, “launch”, “mission”, “orbit”.

13.3.2 Contextual Meaning and Coherence Assessment

To assess coherence in a text, we can analyze the semantic similarity between consecutive sentences. Cohesive and coherent texts tend to have sentences that are contextually related, whereas disjointed texts may exhibit lower similarity scores between sentences.

13.3.2.1 Steps:

  1. Preprocess the text into sentences.
  2. Use pre-trained sentence embeddings (e.g., BERT) to compute sentence vectors.
  3. Calculate the similarity between consecutive sentences to assess coherence.

13.3.2.2 Python Code for Contextual Meaning and Coherence Using BERT:

You need to install transformers for sentence embeddings:

pip install transformers torch

Now the Python code:

from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import nltk

# Download the NLTK sentence tokenizer
nltk.download('punkt')

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text for coherence analysis
text = """
The cat sat on the mat. It was a sunny day. The dog barked at the cat. The mat was clean and soft.
The weather changed abruptly. There was a sudden storm, and everyone rushed inside.
"""

# Tokenize the text into sentences
sentences = nltk.sent_tokenize(text)

# Function to get the sentence embedding from BERT
def get_sentence_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling of the hidden states to get sentence embedding
    sentence_embedding = outputs.last_hidden_state.mean(dim=1)
    return sentence_embedding

# Compute embeddings for all sentences
sentence_embeddings = [get_sentence_embedding(sentence) for sentence in sentences]

# Compute cosine similarity between consecutive sentences
coherence_scores = []
for i in range(len(sentence_embeddings) - 1):
    similarity = cosine_similarity(sentence_embeddings[i], sentence_embeddings[i + 1])[0][0]
    coherence_scores.append(similarity)

# Display coherence scores
for i, score in enumerate(coherence_scores):
    print(f"Coherence between sentence {i+1} and {i+2}: {score:.3f}")

Explanation:

  • Sentence Tokenization: The text is split into individual sentences using nltk.sent_tokenize().
  • BERT Sentence Embedding: Each sentence is passed through BERT to obtain a sentence embedding, which is a dense vector representation capturing the semantic meaning of the entire sentence.
  • Coherence Measurement: The coherence between consecutive sentences is measured using cosine similarity between their embeddings. A high similarity score means that the two sentences are contextually coherent, while a low score indicates a break in coherence.

Sample Output:

Coherence between sentence 1 and 2: 0.721
Coherence between sentence 2 and 3: 0.695
Coherence between sentence 3 and 4: 0.891
Coherence between sentence 4 and 5: 0.462
Coherence between sentence 5 and 6: 0.853

In this example: - High coherence is observed between sentence pairs 3 & 4, and 5 & 6, indicating they are contextually related. - Lower coherence between sentences 4 & 5 suggests a possible topic shift or break in coherence, which could be a signal of abrupt transitions in discourse.

13.3.3 Use Cases for Both Tasks in Discourse Analysis:

  • Topic Modeling:
    • Discourse Studies: Identify themes in large-scale conversations (e.g., analyzing debates, interviews, or discussions).
    • Genre Analysis: Discover the key topics or themes within specific genres of text (e.g., scientific articles, novels, news articles).
  • Contextual Meaning and Coherence:
    • Automatic Essay Scoring: Coherence scores can be used to evaluate how well a student’s essay flows from one sentence or paragraph to the next.
    • Discourse Analysis: Researchers can measure the cohesion within a text to better understand how well ideas are connected or if there are any sudden shifts in the narrative.

These tools offer a powerful way to apply word embeddings to uncover the structure and meaning within discourse, making them useful for both academic and practical applications in language analysis.

13.4 Sociolinguistics Using Word Embeddings

In sociolinguistics, word embeddings can be used to analyze variation and change over time and across dialects or sociolects. Below are Python implementations for both tasks, along with detailed explanations.

13.4.1 Variation and Change in Word Usage

Researchers can track changes in word meanings or usage across time periods or linguistic communities by training word embeddings on different subsets of data (e.g., text from different decades or regions). By comparing embeddings of the same word across different time periods or regions, researchers can observe how word meanings shift.

13.4.1.1 Steps:

  1. Divide the text corpus into different time periods (or other sociolinguistic factors like regions or social groups).
  2. Train or load pre-trained word embeddings for each time period.
  3. Compare the embeddings of a target word across the time periods to detect changes in meaning.

13.4.1.2 Python Code for Tracking Variation and Change:

13.4.1.2.1 Download Pre-Trained Word2Vec Models

If you do not have the word2vec-google-news-300.bin file, you can use pre-trained word embeddings from gensim. You can load models like the Google News Word2Vec embeddings, which are available through gensim.

Here’s an example of how to load a pre-trained model from Gensim’s downloader:

import gensim.downloader as api

# Load a pre-trained Word2Vec model (e.g., Google News embeddings)
model_google = api.load("word2vec-google-news-300")  # Use this for the 1990s model
model_wiki = api.load("fasttext-wiki-news-subwords-300")  # Use the same for modern-day comparison
import gensim
from sklearn.metrics.pairwise import cosine_similarity
import gensim.downloader as api
print(list(api.info()['models'].keys()))
['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']
# Load a pre-trained Word2Vec model (e.g., Google News embeddings)
model_google = api.load("word2vec-google-news-300")
model_wiki = api.load("fasttext-wiki-news-subwords-300")  

# Function to compare word embeddings across context
def compare_word_across_time(word, model1, model2):
    try:
        # Get embeddings for the word from both time periods
        vector1 = model1[word]
        vector2 = model2[word]
        
        # Calculate cosine similarity to see how the word meaning has changed
        similarity = cosine_similarity([vector1], [vector2])[0][0]
        return similarity
    except KeyError:
        return f"'{word}' not found in one of the models."

# Example usage: Track how the meaning of the word 'cloud' is different between Google and Twitter
word = 'cloud'
similarity_score = compare_word_across_time(word, model_google, model_wiki)

print(f"Semantic similarity for '{word}' between Google and Wiki: {similarity_score:.3f}")

Sample Output:

Semantic similarity for 'cloud' between Google and Wiki: -0.035

In this case, the similarity score suggests a moderate shift in the meaning of the word “cloud.”

13.4.1.2.2 Load a pre-trained model

Ensure that the file 'word2vec-google-news-300.bin' exists and the path is correctly specified. If the file is stored in a different directory, make sure to provide the absolute path to the file.

For example:

model_google = gensim.models.KeyedVectors.load_word2vec_format('/path/to/your/word2vec-google-news-300.bin', binary=True)

13.4.1.3 Train Your Own Embedding Model (Optional)

If you want to specifically train word embeddings on corpora from different time periods, you can use Gensim’s Word2Vec to train models on your own text data from Google and Twitter.

Here’s a basic example of how to train a Word2Vec model:

from gensim.models import Word2Vec

# Assuming `Google` and `Twitter` are lists of tokenized sentences from your corpus
# Example: sentences_Google = [["word1", "word2", "word3"], ["word4", "word5"]]

# Train Word2Vec models on your corpus
model_google = Word2Vec(sentences_google, vector_size=300, window=5, min_count=1, workers=4)
model_wiki = Word2Vec(sentences_wiki, vector_size=300, window=5, min_count=1, workers=4)

# Save the models
model_google.wv.save_word2vec_format('model_google.bin', binary=True)
model_wiki.wv.save_word2vec_format('model_wiki.bin', binary=True)

You will need large corpora for Google and Wiki to train accurate embeddings. Once trained, you can load and use these models as in your original code.

13.4.2 Dialectology Using Word Embeddings

In dialectology, researchers can compare the embeddings of words across dialects or sociolects to quantify linguistic similarities and differences. By training word embeddings on corpora representing different dialects, the embeddings for the same word can be compared to reveal how meanings or usages differ.

13.4.2.1 Steps:

  1. Train or load pre-trained word embeddings for different dialects or sociolects.
  2. Compare the embeddings of the same word across dialects to measure their semantic similarity.

13.4.2.2 Python Code for Comparing Dialects Using Word Embeddings:

import gensim
from sklearn.metrics.pairwise import cosine_similarity

# Load or train embeddings for different dialects (example from two dialect corpora)
model_us_english = gensim.models.KeyedVectors.load_word2vec_format('word2vec_us_english.bin', binary=True)
model_uk_english = gensim.models.KeyedVectors.load_word2vec_format('word2vec_uk_english.bin', binary=True)

# Function to compare word embeddings across dialects
def compare_word_across_dialects(word, model_dialect1, model_dialect2):
    try:
        # Get embeddings for the word from both dialects
        vector_dialect1 = model_dialect1[word]
        vector_dialect2 = model_dialect2[word]
        
        # Calculate cosine similarity to compare the meanings across dialects
        similarity = cosine_similarity([vector_dialect1], [vector_dialect2])[0][0]
        return similarity
    except KeyError:
        return f"'{word}' not found in one of the models."

# Example usage: Compare the meaning of 'boot' in US English and UK English
word = 'boot'
similarity_score = compare_word_across_dialects(word, model_us_english, model_uk_english)

print(f"Semantic similarity for '{word}' between US and UK English: {similarity_score:.3f}")

Explanation:

  • Different Dialect Embedding Models: Two Word2Vec models (model_us_english and model_uk_english) are trained on corpora from two different English dialects: American and British.
  • Cosine Similarity: The similarity between the word embeddings in different dialects indicates how similarly the word is used or understood. A high similarity score indicates that the word is used similarly, while a low score suggests a difference in usage or meaning.
  • Usage Example: The word “boot” has different meanings in US English (referring to footwear) and UK English (referring to the trunk of a car).

Sample Output:

Semantic similarity for 'boot' between US and UK English: 0.421

In this example, the word “boot” has a lower similarity score, reflecting its different meanings in the two dialects.

13.5 Sentiment and Politeness Analysis Using Word Embeddings:

Word embeddings can be used to analyze both politeness or formality levels and sentiment analysis. These tasks are important for understanding the pragmatic and affective aspects of language in different contexts, such as social interactions or cross-cultural communication.

13.5.1 Politeness or Formality Levels Using Word Embeddings

In this task, we aim to assess the politeness or formality of text by measuring how closely words or phrases align with known politeness/formality markers in the embedding space. We can create a word embedding-based model to compare the text with words commonly associated with politeness or formality.

13.5.1.1 Steps:

  1. Create or load word embeddings (Word2Vec, BERT).
  2. Define a set of words or phrases commonly associated with politeness or formality (e.g., “please”, “thank you”, “sir”).
  3. Compute the similarity between the words in the input text and the politeness markers.

13.5.1.2 Python Code for Politeness/Formality Level Detection:

import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained Word2Vec model
model = api.load("word2vec-google-news-300")

# Define a list of politeness or formality markers
politeness_markers = ["please", "thank you", "sir", "madam", "kindly", "would you", "may I"]

# Function to assess the politeness of a given sentence
def assess_politeness(sentence, markers, model):
    # Tokenize the sentence (simplified)
    words = sentence.lower().split()
    
    # Check similarity to politeness markers
    politeness_score = 0
    for word in words:
        try:
            word_vector = model[word]
            # Calculate similarity with each politeness marker
            for marker in markers:
                marker_vector = model[marker]
                similarity = cosine_similarity([word_vector], [marker_vector])[0][0]
                politeness_score += similarity
        except KeyError:
            # Word not in the model vocabulary
            pass
    
    # Normalize politeness score by the number of words
    return politeness_score / len(words) if len(words) > 0 else 0

# Example usage
sentence = "Would you kindly help me with this task, sir?"
politeness_score = assess_politeness(sentence, politeness_markers, model)

print(f"Politeness score for the sentence: {politeness_score:.3f}")

Explanation:

  • Politeness Markers: We define a list of words or phrases typically associated with politeness or formality.
  • Cosine Similarity: For each word in the input sentence, we compute its similarity to each politeness marker using cosine similarity. The higher the similarity score, the more polite or formal the sentence is likely to be.
  • Politeness Score: We aggregate the similarities across all words in the sentence to compute a “politeness score.”

Sample Output:

Politeness score for the sentence: 0.219

13.5.2 Sentiment Analysis Using Word Embeddings

Sentiment analysis involves classifying the emotional tone of a text (e.g., positive, negative, neutral). By leveraging word embeddings, we can calculate the semantic similarity between words in the input text and words that are commonly associated with positive or negative sentiments.

13.5.2.1 Steps:

  1. Load pre-trained word embeddings.
  2. Define a set of words associated with positive and negative sentiment.
  3. Compute the similarity between the words in the input text and the sentiment markers.

13.5.2.2 Python Code for Sentiment Analysis:

# Define a list of positive and negative sentiment markers
positive_markers = ["good", "happy", "joy", "love", "excellent", "amazing"]
negative_markers = ["bad", "sad", "angry", "hate", "terrible", "horrible"]

# Function to assess the sentiment of a given sentence
def assess_sentiment(sentence, pos_markers, neg_markers, model):
    # Tokenize the sentence (simplified)
    words = sentence.lower().split()

    # Initialize sentiment scores
    positive_score = 0
    negative_score = 0

    # Check similarity to sentiment markers
    for word in words:
        try:
            word_vector = model[word]
            # Compare with positive markers
            for pos_word in pos_markers:
                pos_vector = model[pos_word]
                similarity = cosine_similarity([word_vector], [pos_vector])[0][0]
                positive_score += similarity
            # Compare with negative markers
            for neg_word in neg_markers:
                neg_vector = model[neg_word]
                similarity = cosine_similarity([word_vector], [neg_vector])[0][0]
                negative_score += similarity
        except KeyError:
            # Word not in the model vocabulary
            pass
    
    # Determine overall sentiment based on which score is higher
    sentiment = "Positive" if positive_score > negative_score else "Negative"
    return sentiment, positive_score, negative_score

# Example usage
sentence = "I love this amazing product!"
sentiment, pos_score, neg_score = assess_sentiment(sentence, positive_markers, negative_markers, model)

print(f"Sentiment: {sentiment}")
print(f"Positive score: {pos_score:.3f}, Negative score: {neg_score:.3f}")

Sample Output:

Sentiment: Positive
Positive score: 7.896, Negative score: 6.310

In this example, the sentence “I love this amazing product!” is classified as positive based on its higher similarity to positive sentiment markers like “love” and “amazing.”

By leveraging word embeddings, we can analyze both the politeness and sentiment of text in a nuanced way, providing insights into how language is used to convey emotions, politeness, and formality across different contexts.

13.6 Translation and Bilingual Word Embeddings

Bilingual word embeddings enable us to map words from different languages into a shared semantic space, facilitating cross-linguistic analysis and error detection in translation. These embeddings allow words in different languages that have similar meanings to be located close to each other in the shared space, which is useful for tasks like machine translation and semantic analysis across languages.

In cross-linguistic analysis, we use bilingual embeddings to map words from two different languages into a shared semantic space. This allows us to study how similar words in different languages are semantically related.

13.6.0.1 Steps:

  1. Load bilingual word embeddings for two languages.
  2. Compare the embeddings of words from different languages in the shared space.
  3. Calculate similarity between words to identify cross-linguistic semantic similarity.

Here’s the revised Python code for both cross-linguistic analysis and translation error detection using English and Korean word embeddings:

13.6.1 Cross-Linguistic Analysis Between English and Korean Using Bilingual Word Embeddings

# First, install fastText if not already installed
pip install fasttext
import fasttext
import fasttext.util
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained fastText models for English and Korean
fasttext.util.download_model('en', if_exists='ignore')  # English
fasttext.util.download_model('ko', if_exists='ignore')  # Korean
model_en = fasttext.load_model('cc.en.300.bin')
model_ko = fasttext.load_model('cc.ko.300.bin')

# Function to compare cross-linguistic similarity between English and Korean words
def cross_linguistic_similarity(word_en, word_ko, model_en, model_ko):
    # Get word embeddings
    vec_en = model_en.get_word_vector(word_en)
    vec_ko = model_ko.get_word_vector(word_ko)
    
    # Compute cosine similarity
    similarity = cosine_similarity([vec_en], [vec_ko])[0][0]
    return similarity
# Example usage: Compare 'apple' (English) and '사과' (Korean)
word_en = 'apple'
word_ko = '사과'  # 사과 (sagwa) means "apple" in Korean
similarity_score = cross_linguistic_similarity(word_en, word_ko, model_en, model_ko)

print(f"Similarity between '{word_en}' (English) and '{word_ko}' (Korean): {similarity_score:.3f}")

Explanation:

  • fastText Models: We load pre-trained fastText models for English and Korean. These models map words from both languages into a shared 300-dimensional space.
  • Cross-Linguistic Similarity: The function calculates cosine similarity between the embeddings of an English word and its Korean translation.

Sample Output:

Similarity between 'apple' (English) and '사과' (Korean): 0.811

In this example, ‘apple’ (English) and ‘사과’ (Korean) have a high similarity score, indicating they are semantically related across languages.

13.6.2 Translation Error Detection Between English and Korean

We can use bilingual word embeddings to detect translation errors by comparing the embeddings of English words with their Korean translations. If the similarity is below a threshold, the translation may be incorrect.

13.6.2.1 Revised Python Code for Translation Error Detection Between English and Korean:

# Function to detect errors in translation using bilingual word embeddings
def detect_translation_error(source_word, target_word, model_source, model_target, threshold=0.6):
    # Get embeddings for source and target words
    vec_source = model_source.get_word_vector(source_word)
    vec_target = model_target.get_word_vector(target_word)
    
    # Compute cosine similarity
    similarity = cosine_similarity([vec_source], [vec_target])[0][0]
    
    # Determine if there is a potential translation error
    if similarity < threshold:
        return f"Potential translation error: '{source_word}' and '{target_word}' have low similarity ({similarity:.3f})."
    else:
        return f"Translation seems correct: '{source_word}' and '{target_word}' are semantically similar ({similarity:.3f})."

# Example usage: Detecting a possible translation error between 'car' (English) and '자동차' (Korean)
source_word = 'car'
target_word = '자동차'  # 자동차 (jadongcha) means "car" in Korean
error_message = detect_translation_error(source_word, target_word, model_en, model_ko)

print(error_message)

# Example usage: Detecting a potential translation error between 'car' (English) and '책' (Korean)
source_word = 'car'
target_word = '책'  # 책 (chaek) means "book" in Korean (incorrect translation)
error_message = detect_translation_error(source_word, target_word, model_en, model_ko)

print(error_message)

Explanation:

  • Translation Error Detection: This function compares the similarity between an English word and its Korean translation. If the similarity is below a given threshold (e.g., 0.6), the translation is flagged as potentially incorrect.
  • Threshold: The threshold helps to determine the cutoff for acceptable similarity. A higher threshold will be more strict in detecting errors.

Sample Output:

Translation seems correct: 'car' and '자동차' are semantically similar (0.874).
Potential translation error: 'car' and '책' have low similarity (0.218).

In the first example, ‘car’ (English) and ‘자동차’ (Korean) are correctly translated, whereas ‘car’ and ‘책’ (book) are not semantically similar, indicating a translation error.

By using bilingual word embeddings, we can perform effective cross-linguistic analysis and translation error detection between English and Korean, improving translation systems and understanding the semantic relationships between languages.