17  Topic Modeling with R (Example)

17.1 Why Use Topic Modeling in Linguistics?

In linguistics, topic modeling is particularly valuable because it allows researchers to:

  • Identify patterns and themes in large collections of texts.
  • Analyze thematic structures without needing pre-labeled data.
  • Gain insights into the linguistic or cultural elements underlying different topics.

Common applications include examining corpora such as:

  • Academic articles to find recurring themes.
  • Social media data to analyze discourse.
  • Historical texts to trace the evolution of ideas over time.

17.2 Getting Started with R for Topic Modeling

To conduct topic modeling in R, we use the topicmodels package, which implements Latent Dirichlet Allocation (LDA), the most popular topic modeling algorithm. Here’s how to get started.

17.3 Install and Load Necessary Packages

First, we need to install and load several packages: - tm for text mining and preprocessing, - topicmodels for creating the LDA model, and - tidyverse and tidytext for data manipulation and visualization.

# Install packages if not already installed
install.packages("tm")
install.packages("topicmodels")
install.packages("tidyverse")
install.packages("tidytext")
# Load the libraries
library(tm)
Warning: package 'tm' was built under R version 4.3.3
Loading required package: NLP
library(topicmodels)
Warning: package 'topicmodels' was built under R version 4.3.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::annotate() masks NLP::annotate()
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::lag()        masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)

17.4 Data Import and Preparation

Before we dive into topic modeling, let’s learn how to import data and prepare it for analysis. This step is essential for linguistics research since it ensures that we start with clean, manageable text data.

17.4.1 Importing the Data

For this example, we assume we have a CSV file named data_topicmodeling.csv. This file contains a collection of academic texts, with each row representing a document. The CSV file should be in the same working directory as your R script or project, or you should specify the full path.

# Import the dataset
data <- read_csv("data_topicmodeling.csv")
Rows: 1526 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): author, title_eng, journal_short, abstract
dbl (1): year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Add a unique identifier for each document
data <- tibble::rowid_to_column(data, "doc_id")

This code reads in the CSV file and creates a new column called doc_id, assigning a unique identifier to each document. The doc_id column will help us keep track of documents when we analyze and interpret the results.

17.4.2 Adding Custom Stop Words

In academic texts, there may be specific words or phrases frequently used across documents but not meaningful for topic modeling. For example, terms like “study,” “results,” or “analysis” may appear often but do not contribute to distinct thematic content. In this case, we use a custom stop words file.

# Load custom stop words
stop_words <- readLines("academic_stopwords.txt", encoding = "UTF-8")

Here, academic_stopwords.txt is a text file with one stop word per line, such as “method,” “study,” or “analysis.” Make sure this file is in your working directory or specify its full path. Loading these stop words allows us to exclude them during preprocessing, helping to focus the model on more meaningful terms.

17.5 Pre-processing with POS Tagging

Pre-processing with POS tagging helps us target specific types of words relevant to topic analysis. In linguistics, POS tagging can reveal the syntactic categories of words (e.g., noun, verb, adjective), allowing us to select only the most meaningful words for topic modeling.

17.5.1 Install and Load udpipe for POS Tagging

udpipe is an R package that provides access to pre-trained language models for POS tagging and dependency parsing. First, ensure udpipe is installed, then load it and download the English language model.

# Install and load the udpipe package
# install.packages('udpipe')
library(udpipe)

# Download the English language model
# ud_model <- udpipe_download_model(language = "english")

# Load the language model
# ud_model <- udpipe_load_model(ud_model$file_model)
ud_model <- udpipe_load_model("english-ewt-ud-2.5-191206.udpipe")

17.5.2 Define the POS Tagging Function

The function ud_tagger_f below uses udpipe to annotate each document in our dataset. This function:

  • Applies POS tagging on the abstract column of our dataset.
  • Filters out terms listed in the stop_words file.
  • Selects only words classified as verbs, nouns, adjectives, or adverbs (since these words typically carry more thematic meaning).
# Define the POS tagging function
ud_tagger_f <- function(input) {
    udpipe_annotate(ud_model, x = input$abstract, doc_id = input$doc_id, year = input$year) %>%
    as.data.frame() %>%
    filter(!lemma %in% stop_words) %>%
    subset(upos %in% c("VERB", "NOUN", "ADJ", "ADV"))
}

17.5.3 Apply POS Tagging to the Dataset

The next step is to apply our ud_tagger_f function to the dataset and save the results. This tagged dataset will focus only on the filtered POS categories and exclude irrelevant words.

# Apply POS tagging to the data
data_tag <- ud_tagger_f(data)

17.5.4 Clean the Tokenized Data

With the POS-tagged data, we can now proceed to additional cleaning steps: 1. Remove numeric characters. 2. Delete punctuation. 3. Filter out single-character words, as they typically add little value. 4. Delete empty lemmas (words that become empty after cleaning).

# Select relevant columns and clean tokens
text_cleaning_tokens <- data_tag %>% select(doc_id, lemma)

# Remove numbers and punctuation
text_cleaning_tokens$lemma <- gsub('[[:digit:]]+', '', text_cleaning_tokens$lemma)  # Remove numbers
text_cleaning_tokens$lemma <- gsub('[[:punct:]]+', '', text_cleaning_tokens$lemma)  # Remove punctuation

# Filter out one-letter words
text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(lemma) == 1))

# Remove empty lemmas
tokens <- text_cleaning_tokens %>% filter(!(lemma == ""))

# Convert lemmas to lowercase
tokens$lemma <- tolower(tokens$lemma)

# Add an index column for tracking purposes
tokens <- tokens %>% mutate(ind = row_number())
head(tokens, 3)
  doc_id      lemma ind
1      1    purpose   1
2      1    explain   2
3      1 difference   3

In the above code:

  • gsub('[[:digit:]]+', '', text_cleaning_tokens$lemma) removes any numeric characters.

  • gsub('[[:punct:]]+', '', text_cleaning_tokens$lemma) removes punctuation.

  • filter(!(nchar(lemma) == 1)) filters out any single-character words.

  • Finally, we remove any empty rows created by filtering.

The result is a cleaned dataset called tokens, ready for topic modeling.

17.6 Model Building: Creating the Document-Feature Matrix (DFM)

After pre-processing, the next step in topic modeling is to create a DFM, which transforms our tokenized text data into a matrix suitable for topic modeling.

17.6.1 Aggregate Token Counts per Document

First, we need to calculate the frequency of each term (lemma) within each document. This step ensures that the DFM accurately reflects the number of times each term appears in each document.

# Aggregate token counts per document
library(dplyr)
tokens_doc <- tokens %>%
  group_by(doc_id, lemma) %>%
  summarise(count = n())
`summarise()` has grouped output by 'doc_id'. You can override using the
`.groups` argument.

Here, we use group_by(doc_id, lemma) to group the data by each document and term, and summarise(count = n()) to calculate the count of each term in each document.

17.6.2 Convert to Document-Feature Matrix (DFM)

We use the quanteda and tidytext packages to convert our data into a DFM format, which is required for topic modeling. The cast_dfm function in tidytext helps convert a tidy data frame into a DFM.

# Load the necessary packages
library(quanteda)
Warning: package 'quanteda' was built under R version 4.3.3
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.

Attaching package: 'quanteda'
The following object is masked from 'package:tm':

    stopwords
The following objects are masked from 'package:NLP':

    meta, meta<-
library(tidytext)

# Convert to Document-Feature Matrix
dfm <- tokens_doc %>%
  cast_dfm(doc_id, lemma, count)

The cast_dfm function takes three arguments:

  • doc_id: the document identifier column,

  • lemma: the term or feature column, and

  • count: the frequency count of each term in each document.

The resulting DFM will have documents as rows, terms as columns, and frequency counts as the matrix values.

To check the dimensions of the DFM:

# Check dimensions of the DFM
dim(dfm)
[1] 1517 6476
# Expected output: [1517, 6773]

Here, 1517 represents the number of documents, and 6773 represents the unique terms (or features) across all documents.

17.6.3 Trim the DFM

To reduce sparsity in the DFM and improve the model’s focus, we apply frequency thresholds. Removing terms that appear very infrequently helps in reducing noise.

# Trim the DFM
trimmed_dfm <- dfm_trim(dfm, min_termfreq = 5, min_docfreq = 10)

In this command: - min_termfreq = 5 removes terms that appear fewer than 5 times across all documents. - min_docfreq = 10 removes terms that appear in fewer than 10 documents.

After trimming, we check the new dimensions of the DFM:

# Check dimensions after trimming
dim(trimmed_dfm)
[1] 1517 1286
# Expected output: [1517, 1287]

In this example, the DFM is reduced to 1287 terms after removing low-frequency terms.

17.6.4 Save the Trimmed DFM

Saving the trimmed DFM allows us to use this matrix as input for our topic model in the next steps. It’s often helpful to save it as a CSV file for future use or verification.

# Convert the DFM to a data frame using `convert`
dfm_dataframe <- convert(trimmed_dfm, to = "data.frame")

# Save the data frame to a CSV file
write.csv(dfm_dataframe, "trimmed_dfm.csv")

This file, trimmed_dfm.csv, is now ready for further analysis or to serve as the input for topic modeling.

To display the top features (most frequent terms) in the trimmed Document-Feature Matrix (DFM), we can use the topfeatures function from the quanteda package. This function will show the 20 most frequent terms in the trimmed_dfm.

# Display the top 20 features in the trimmed DFM
top_terms <- topfeatures(trimmed_dfm, n = 20)
top_terms
   student    english    teacher        use      learn   language     school 
      3765       2520       2311       2111       1936       1515       1383 
   learner      class       test elementary   teaching   activity      write 
      1246       1027        985        985        931        921        916 
    effect    purpose vocabulary  classroom  education    program 
       746        728        703        682        661        646 
  • topfeatures(trimmed_dfm, n = 20) retrieves the 20 most frequent terms (features) in the trimmed_dfm.
  • top_terms will be a named vector where the names are terms, and the values are their frequencies across the corpus.

17.7 Building and Evaluating the LDA Model

This section covers Latent Dirichlet Allocation (LDA) with Perplexity Calculation. In this approach, we implement an LDA model and use cross-validation to determine the optimal number of topics by minimizing perplexity, a measure of model quality. This is done through 5-fold cross-validation, testing multiple topic counts to find the most appropriate configuration for the topic model.

17.7.1 Convert DFM to Document-Term Matrix (DTM)

To run an LDA model using the topicmodels package, we need the data in a DocumentTermMatrix format. The convert function from the quanteda package simplifies this process, converting the trimmed_dfm to a DTM format compatible with topicmodels.

# Load the necessary package
library(topicmodels)

# Convert the trimmed DFM to a Document-Term Matrix for topic modeling
dtm <- convert(trimmed_dfm, to = "topicmodels", omit_empty = TRUE)

# Check the dimensions of the DTM
dim(dtm) # Expected output: [1517, 1287]
[1] 1517 1286

17.7.2 Set Up Parallel Processing for Cross-Validation

To optimize the topic model, we use 5-fold cross-validation across multiple values of k (the number of topics). Parallel processing speeds up this process by allowing each fold or candidate k value to be processed on a separate CPU core.

# Load libraries for parallel processing
library(doParallel)
Loading required package: foreach

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
Loading required package: iterators
Loading required package: parallel
library(scales)

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
# Set up a cluster with one fewer core than available
cluster <- makeCluster(detectCores(logical = TRUE) - 1)
registerDoParallel(cluster)

# Load `topicmodels` on all cluster nodes
clusterEvalQ(cluster, {
   library(topicmodels)
})
[[1]]
[1] "topicmodels" "stats"       "graphics"    "grDevices"   "utils"      
[6] "datasets"    "methods"     "base"       

[[2]]
[1] "topicmodels" "stats"       "graphics"    "grDevices"   "utils"      
[6] "datasets"    "methods"     "base"       

[[3]]
[1] "topicmodels" "stats"       "graphics"    "grDevices"   "utils"      
[6] "datasets"    "methods"     "base"       

[[4]]
[1] "topicmodels" "stats"       "graphics"    "grDevices"   "utils"      
[6] "datasets"    "methods"     "base"       

[[5]]
[1] "topicmodels" "stats"       "graphics"    "grDevices"   "utils"      
[6] "datasets"    "methods"     "base"       

[[6]]
[1] "topicmodels" "stats"       "graphics"    "grDevices"   "utils"      
[6] "datasets"    "methods"     "base"       

[[7]]
[1] "topicmodels" "stats"       "graphics"    "grDevices"   "utils"      
[6] "datasets"    "methods"     "base"       

17.7.3 Define LDA Model Parameters and Cross-Validation Settings

Set the parameters for the LDA model:

  • burnin, iter, and keep control the Gibbs sampling process.

  • folds sets the number of cross-validation folds.

  • candidate_k provides the values of k (number of topics) to test.

# Define LDA model parameters
burnin = 1000
iter = 1000
keep = 50

# Define cross-validation settings
folds <- 5
n <- nrow(dtm)
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(10:30) # different candidate values for the number of topics

# Export necessary objects to all cluster nodes
clusterExport(cluster, c("dtm", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))

17.7.4 Perform 5-Fold Cross-Validation for Perplexity Calculation

The foreach loop runs cross-validation for each candidate value of k in parallel. For each candidate k, it: 1. Splits the data into training and validation sets. 2. Fits an LDA model on the training set. 3. Calculates the perplexity on the validation set.

# Run cross-validation for each candidate number of topics
system.time({
  results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar% {
     k <- candidate_k[j]
     results_1k <- matrix(0, nrow = folds, ncol = 2)
     colnames(results_1k) <- c("k", "perplexity")
     for(i in 1:folds){
        train_set <- dtm[splitfolds != i, ]
        valid_set <- dtm[splitfolds == i, ]
  
        fitted <- LDA(train_set, k = k, method = "Gibbs",
                      control = list(burnin = burnin, iter = iter, keep = keep))
        results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
        }
     return(results_1k)
  }
})
# Stop the cluster after cross-validation
stopCluster(cluster)

# Convert results to a data frame
results_df <- as.data.frame(results)

17.7.5 Visualize the Perplexity Scores for Different Topic Counts

To determine the optimal number of topics, plot the perplexity scores against the candidate values of k. The ggplot2 package provides a straightforward way to visualize this relationship.

# Load ggplot2 for plotting
library(ggplot2)

# Plot perplexity scores for each candidate number of topics
ggplot(results_df, aes(x = k, y = perplexity)) +
   geom_point() +
   geom_smooth(se = FALSE) +
   ggtitle("5-Fold Cross-Validation of Topic Modeling") +
   labs(x = "Candidate Number of Topics", y = "Perplexity")

In the plot, the x-axis represents the number of topics, and the y-axis shows the perplexity scores. The optimal number of topics minimizes perplexity, indicating a balance between model complexity and generalization to unseen data.

17.8 Training and Analyzing the LDA Model

After determining the optimal number of topics, we can train a final LDA model using the entire dataset. This section demonstrates how to build the model, interpret its results, and save relevant matrices for further analysis.

17.8.1 Define the Number of Topics and Train the LDA Model

In this example, we set K = 10 to define the number of topics and train the model using the Gibbs sampling method, a common approach for LDA.

# Load the topicmodels package
library(topicmodels)

# Define the number of topics
K <- 10

# Train the LDA model
lda_model <- LDA(dtm, k = K, method = "Gibbs", 
                 control = list(alpha = 50/K, delta = 0.1, verbose = 25L, 
                                seed = 123, burnin = 100, iter = 500))
K = 10; V = 1286; M = 1517
Sampling 600 iterations!
Iteration 25 ...
Iteration 50 ...
Iteration 75 ...
Iteration 100 ...
Iteration 125 ...
Iteration 150 ...
Iteration 175 ...
Iteration 200 ...
Iteration 225 ...
Iteration 250 ...
Iteration 275 ...
Iteration 300 ...
Iteration 325 ...
Iteration 350 ...
Iteration 375 ...
Iteration 400 ...
Iteration 425 ...
Iteration 450 ...
Iteration 475 ...
Iteration 500 ...
Iteration 525 ...
Iteration 550 ...
Iteration 575 ...
Iteration 600 ...
Gibbs sampling completed!

In this code: - alpha = 50/K and delta = 0.1 are hyperparameters that control the distribution of topics in documents and words in topics, respectively. - seed = 123 ensures reproducibility. - burnin = 100 and iter = 500 control the number of Gibbs sampling iterations.

17.8.2 Extract Posterior Distributions

The posterior function extracts the posterior distributions from the trained LDA model, giving us insight into the topics and terms.

# Obtain the posterior distributions of the LDA model
tmResult <- posterior(lda_model)

# Check the attributes of the result
attributes(tmResult)
$names
[1] "terms"  "topics"
# Save the document-topic distribution to a CSV file
write.csv(tmResult$topics, "tmResult.csv")

The posterior function returns a list with: - tmResult$topics: a document-topic distribution matrix (theta), where each row represents a document and each column represents a topic. - tmResult$terms: a topic-term distribution matrix (beta), where each row represents a topic and each column represents a term.

17.8.3 Analyze the Topic-Term Distribution (beta)

The beta matrix shows the probability of each word appearing in each topic. This information can help us interpret the topics by identifying the most likely words for each one.

# Extract the topic-term distribution matrix
beta <- tmResult$terms

# Save the beta matrix to a CSV file
write.csv(beta, 'beta.csv')

# Check the dimensions of the beta matrix
dim(beta)  # Expected output: K x ncol(DTM)
[1]   10 1286

To verify that each row of beta represents a probability distribution (i.e., sums to 1):

# Verify that each row in beta sums to 1
rowSums(beta)
 1  2  3  4  5  6  7  8  9 10 
 1  1  1  1  1  1  1  1  1  1 

17.8.4 Analyze the Document-Topic Distribution (theta)

Each row in the theta matrix represents a document, and each column represents a topic. The values indicate the probability of each topic in each document.

# Extract the document-topic distribution matrix
theta <- tmResult$topics

# Save the theta matrix to a CSV file
write.csv(theta, 'theta.csv')

17.8.5 Create and Save a Matrix of Top Words for Each Topic

To facilitate interpretation, it’s often helpful to list the top words for each topic in a more human-readable format. We can transpose the beta matrix and round values to simplify the results.

# Round the beta values to three decimal places for readability
phi_matrix <- round(beta, 3)

# Transpose and convert to data frame for saving
T_phi_matrix <- as.data.frame(t(phi_matrix))
names(T_phi_matrix) <- paste("Topic", 1:K)

# Save the transposed matrix with topic words
write.csv(T_phi_matrix, "lda_topic_words.csv")

# Display the first few words for each topic
T_phi_matrix[1:K, ]
               Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
able             0.000   0.000   0.004   0.000       0   0.000       0   0.000
acquire          0.000   0.003   0.000   0.000       0   0.000       0   0.000
arise            0.000   0.000   0.000   0.000       0   0.000       0   0.000
basis            0.000   0.005   0.000   0.000       0   0.000       0   0.001
characteristic   0.000   0.009   0.000   0.000       0   0.000       0   0.000
correct          0.000   0.000   0.000   0.000       0   0.000       0   0.003
difference       0.000   0.000   0.000   0.013       0   0.000       0   0.006
education        0.045   0.009   0.000   0.000       0   0.000       0   0.000
effectively      0.002   0.000   0.000   0.000       0   0.002       0   0.000
effectivly       0.000   0.000   0.000   0.000       0   0.001       0   0.000
               Topic 9 Topic 10
able             0.000    0.000
acquire          0.000    0.000
arise            0.000    0.000
basis            0.000    0.000
characteristic   0.000    0.000
correct          0.000    0.000
difference       0.014    0.000
education        0.000    0.000
effectively      0.001    0.001
effectivly       0.001    0.000

In this transposed matrix, each row corresponds to a word, and each column represents a topic, showing the top words for each topic.

This analysis yields several matrices:

  • beta: Topic-term distribution, showing the most probable words per topic.

  • theta: Document-topic distribution, showing topic proportions for each document.

  • Transposed beta: A matrix of top words per topic, which can be saved for easy inspection.

These outputs allow us to interpret and label topics meaningfully, bringing us closer to understanding the thematic structure in our dataset.

17.9 Extracting and Visualizing Topics

17.9.1 Extract the Top 15 Words for Each Topic

The terms function can be used to identify the top terms in each topic. In this example, we extract the 15 most probable words for each topic and save the results to a CSV file for easy reference.

# Extract the top 15 words for each topic
top_words <- terms(lda_model, 15)

# Save the top words to a CSV file
write.csv(top_words, "lda_topic_15words.csv")

This code produces a table where each column represents a topic, and each row lists one of the top 15 words associated with that topic. This table can be helpful for interpreting and labeling topics.

17.9.2 Visualize the LDA Model with LDAvis

Visualizing the LDA model provides a dynamic way to explore topics. The LDAvis package, combined with the topicmodels2LDAvis function below, generates an interactive visualization that displays: - The distance between topics (e.g., similar topics will appear closer together). - The most influential words within each topic.

17.9.2.1 Install and Load Required Libraries

Ensure that LDAvis and servr are installed, as they are required for creating and serving the visualization.

# Install and load required packages
# install.packages("LDAvis")
library(LDAvis)

# install.packages("servr")
library(servr)

17.9.2.2 Define a Helper Function for LDAvis

To visualize a model created with the topicmodels package, we need a custom function to convert the LDA model to the format required by LDAvis. The following topicmodels2LDAvis function extracts the necessary information from the lda_model.

# Helper function to convert the LDA model to LDAvis format
topicmodels2LDAvis <- function(x, ...){
    post <- topicmodels::posterior(x)
    if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
    mat <- x@wordassignments
    LDAvis::createJSON(
        phi = post[["terms"]], 
        theta = post[["topics"]],
        vocab = colnames(post[["terms"]]),
        doc.length = slam::row_sums(mat, na.rm = TRUE),
        term.frequency = slam::col_sums(mat, na.rm = TRUE)
    )
}

This function extracts:

  • phi (topic-term distribution matrix).

  • theta (document-topic distribution matrix).

  • vocab (list of terms).

  • doc.length (number of terms per document).

  • term.frequency (frequency of each term in the entire corpus).

17.9.2.3 Generate and Display the Visualization

Finally, we use the serVis function from LDAvis to create and serve the visualization locally in a web browser.

# Generate and view the LDA visualization
LDAvis::serVis(topicmodels2LDAvis(lda_model))

This command opens an interactive visualization in your browser, where:

  • Each circle represents a topic.

  • The size of each circle indicates the prevalence of the topic in the corpus.

  • Words within each topic are displayed on the right, ranked by their importance.

17.11 Hot and Cold Topics Analysis

To observe trends over time, we analyze topic proportions across years to identify “hot” (increasing) and “cold” (decreasing) topics.

17.11.1 Step 1: Prepare Data for Temporal Analysis

We first convert tokenized data to a “wide” format to facilitate yearly aggregation. This transformation makes each document a single row, with terms as columns, simplifying further analysis.

# Spread tokens to wide format
library(dplyr)

tokens_year <- tokens %>% 
  group_by(doc_id) %>% 
  mutate(ind = row_number()) %>% 
  tidyr::spread(key = ind, value = lemma)

# Replace NA values with empty strings
tokens_year[is.na(tokens_year)] <- "" 

# Unite terms back into a single column for each document
tokens_year <- tidyr::unite(tokens_year, abstract, -doc_id, sep = " ")
tokens_year$abstract <- trimws(tokens_year$abstract)

# Merge with original data to add additional document information (e.g., year)
tokens_year <- merge(x = tokens_year, y = data, by = "doc_id")

17.11.2 Step 2: Calculate Topic Proportions Over Time

By aggregating topic proportions by year, we can observe how each topic’s importance changes over time.

# Install and load reshape if not already installed
# install.packages('reshape')
library(reshape)

Attaching package: 'reshape'
The following object is masked from 'package:lubridate':

    stamp
The following object is masked from 'package:dplyr':

    rename
The following objects are masked from 'package:tidyr':

    expand, smiths
# Calculate the mean topic proportion per year
topic_proportion_per_year <- aggregate(theta, by = list(year = tokens_year$year), mean)

# Save topic proportions per year to a CSV
write.csv(topic_proportion_per_year, 'topic_proportion_per_year.csv')

This code creates a table where each row represents a year and each column represents the average proportion of a topic for that year.

17.12 Hot and Cold Topics Analysis

In this section, we conduct Hot and Cold Topics Analysis using a time-series approach to identify topics that have increased (hot topics) or decreased (cold topics) in prominence over time. This analysis involves calculating topic trends across years, assessing statistical significance, and visualizing the most prominent hot and cold topics.

17.12.1 Step 1: Calculate Mean Topic Proportions by Year

To identify hot and cold topics, we first calculate the mean proportion of each topic per year. This provides a time series of topic prevalence that we can analyze for trends.

library(tm)

# Extract unique years
years <- levels(factor(tokens_year$year))  # Check years, e.g., 2000 ~ 2019

# Set the number of topics
topics_n <- K    # In this case, 10

# Extract the document-topic distribution matrix
theta_2 <- posterior(lda_model)$topics

# Calculate the mean topic proportions by year
theta_mean_by_year_by <- by(theta_2, tokens_year$year, colMeans)
theta_mean_by_year <- do.call("rbind", theta_mean_by_year_by)  # Combine results into a matrix

# Rename columns and create a time series object
colnames(theta_mean_by_year) <- paste(1:topics_n)
theta_mean_by_year_ts <- ts(theta_mean_by_year, start = as.integer(years[1]))
theta_mean_by_year_time <- time(theta_mean_by_year)

# Save results to CSV
write.csv(theta_mean_by_year, "theta_mean_by_year.csv")
theta_mean_by_year_time
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
attr(,"tsp")
[1]  1 20  1

17.12.2 Step 2: Identify Hot and Cold Topics with Linear Regression

To determine which topics are trending up (hot) or down (cold) over time, we fit a linear regression model for each topic’s time series. The slope of each model indicates the direction and magnitude of change.

# Perform linear regression on topic proportions over time
theta_mean_lm <- apply(theta_mean_by_year, 2, function(x) lm(x ~ theta_mean_by_year_time))

# Extract regression coefficients and p-values
theta_mean_lm_coef <- lapply(theta_mean_lm, function(x) coef(summary(x)))

# Extract significance (p-values) and slopes
theta_mean_lm_coef_sign <- sapply(theta_mean_lm_coef, '[', "theta_mean_by_year_time", "Pr(>|t|)")
theta_mean_lm_coef_slope <- sapply(theta_mean_lm_coef, '[', "theta_mean_by_year_time", "Estimate")

# Separate positive and negative slopes for analysis
theta_mean_lm_coef_slope_pos <- theta_mean_lm_coef_slope[theta_mean_lm_coef_slope >= 0]
theta_mean_lm_coef_slope_neg <- theta_mean_lm_coef_slope[theta_mean_lm_coef_slope < 0]

# Save positive and negative slopes to CSV
write.csv(theta_mean_lm_coef_slope_pos, 'theta_mean_lm_coef_slope_positive.csv')
write.csv(theta_mean_lm_coef_slope_neg, 'theta_mean_lm_coef_slope_negative.csv')

17.12.3 Step 3: Evaluate Significance Levels

We test the statistical significance of each trend using various p-value thresholds to determine which trends are likely meaningful.

# Set p-value thresholds
p_level <- c(0.05, 0.01, 0.001, 0.0001)

# Identify topics with significant slopes at each p-level
significance_total <- sapply(p_level, function(x) (theta_mean_lm_coef_sign[theta_mean_lm_coef_sign < x]))
significance_neg <- sapply(1:length(p_level), function(x) intersect(names(theta_mean_lm_coef_slope_neg), names(significance_total[[x]])))
significance_pos <- sapply(1:length(p_level), function(x) intersect(names(theta_mean_lm_coef_slope_pos), names(significance_total[[x]])))

17.12.4 Step 4: Visualize Hot and Cold Topics

Using the lattice package, we can plot the time series of hot and cold topics to observe their changes over time.

library(lattice)

# Select top hot and cold topics for visualization
topics_cold <- as.numeric(names(sort(theta_mean_lm_coef_slope[significance_neg[[2]]], decreasing = FALSE)))
topics_hot <- as.numeric(names(sort(theta_mean_lm_coef_slope[significance_pos[[2]]], decreasing = TRUE)))

# Combine selected hot and cold topics into a single time series matrix
cold_and_hot_ts <- cbind(theta_mean_by_year_ts[, topics_cold[1:2]],
                         theta_mean_by_year_ts[, topics_hot[1:3]], deparse.level = 0)  # Adjust numbers as needed

# Rename columns to indicate topic numbers
colnames(cold_and_hot_ts) <- as.character(c(topics_cold[1:2], topics_hot[1:3]))

# Plot hot and cold topics over time
print(xyplot(cold_and_hot_ts, layout = c(2, 1), 
             screens = c(rep("Cold topics", 2), rep("Hot topics", 3)),  # Adjust as needed
             superpose = TRUE, ylim = c(0.05, 0.2), 
             ylab = expression(paste("Mean ", theta)), 
             xlab = "Year", type = c("l", "g"), 
             auto.key = list(space = "right"),  
             scales = list(x = list(alternating = FALSE))
))

This plot displays the time trends for the top hot and cold topics. Hot topics are increasing over time, while cold topics are decreasing, providing a visual representation of shifting thematic focus in the corpus.

17.12.5 Step 5: Summarize Time-Series Analysis Results

To further analyze, extract regression summaries for each topic’s time series.

# Extract detailed regression summaries for all topics
reg_result <- lapply(theta_mean_lm, function(x) summary(x))
reg_result
$`1`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.013061 -0.008160 -0.004247  0.006060  0.025992 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.1067095  0.0053069  20.108 8.78e-14 ***
theta_mean_by_year_time -0.0004004  0.0004430  -0.904    0.378    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01142 on 18 degrees of freedom
Multiple R-squared:  0.04342,   Adjusted R-squared:  -0.009723 
F-statistic: 0.817 on 1 and 18 DF,  p-value: 0.378


$`2`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.015217 -0.007411 -0.002695  0.008784  0.014994 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.1201246  0.0046455  25.858  1.1e-15 ***
theta_mean_by_year_time -0.0016232  0.0003878  -4.186 0.000556 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01 on 18 degrees of freedom
Multiple R-squared:  0.4932,    Adjusted R-squared:  0.4651 
F-statistic: 17.52 on 1 and 18 DF,  p-value: 0.0005558


$`3`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.015343 -0.005375 -0.002519  0.004338  0.019769 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.0716781  0.0041228  17.386 1.07e-12 ***
theta_mean_by_year_time 0.0022019  0.0003442   6.398 5.04e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008875 on 18 degrees of freedom
Multiple R-squared:  0.6946,    Adjusted R-squared:  0.6776 
F-statistic: 40.93 on 1 and 18 DF,  p-value: 5.042e-06


$`4`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.015483 -0.006485 -0.001401  0.008384  0.012499 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.0842500  0.0042190  19.969 9.89e-14 ***
theta_mean_by_year_time 0.0015020  0.0003522   4.265 0.000466 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.009082 on 18 degrees of freedom
Multiple R-squared:  0.5026,    Adjusted R-squared:  0.475 
F-statistic: 18.19 on 1 and 18 DF,  p-value: 0.0004661


$`5`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0094718 -0.0032604 -0.0003354  0.0039621  0.0100595 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.0952110  0.0026812  35.511   <2e-16 ***
theta_mean_by_year_time 0.0003921  0.0002238   1.752   0.0968 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.005772 on 18 degrees of freedom
Multiple R-squared:  0.1457,    Adjusted R-squared:  0.09823 
F-statistic:  3.07 on 1 and 18 DF,  p-value: 0.09678


$`6`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0170167 -0.0056107 -0.0003921  0.0017193  0.0190515 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.1281042  0.0037994  33.717  < 2e-16 ***
theta_mean_by_year_time -0.0024305  0.0003172  -7.663  4.5e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008179 on 18 degrees of freedom
Multiple R-squared:  0.7654,    Adjusted R-squared:  0.7524 
F-statistic: 58.73 on 1 and 18 DF,  p-value: 4.496e-07


$`7`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0124647 -0.0048475 -0.0009775  0.0037884  0.0177997 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.1003627  0.0036462  27.525 3.66e-16 ***
theta_mean_by_year_time -0.0003204  0.0003044  -1.053    0.306    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.007849 on 18 degrees of freedom
Multiple R-squared:  0.05798,   Adjusted R-squared:  0.005644 
F-statistic: 1.108 on 1 and 18 DF,  p-value: 0.3065


$`8`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.015223 -0.005032  0.001569  0.004744  0.010206 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.0895604  0.0035924   24.93 2.08e-15 ***
theta_mean_by_year_time 0.0008726  0.0002999    2.91  0.00934 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.007733 on 18 degrees of freedom
Multiple R-squared:  0.3199,    Adjusted R-squared:  0.2821 
F-statistic: 8.467 on 1 and 18 DF,  p-value: 0.009345


$`9`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.019007 -0.003908  0.001191  0.004708  0.012692 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             0.089360   0.003606  24.782 2.31e-15 ***
theta_mean_by_year_time 0.001012   0.000301   3.361  0.00348 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.007762 on 18 degrees of freedom
Multiple R-squared:  0.3856,    Adjusted R-squared:  0.3515 
F-statistic:  11.3 on 1 and 18 DF,  p-value: 0.003476


$`10`

Call:
lm(formula = x ~ theta_mean_by_year_time)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.016030 -0.002640  0.002092  0.003406  0.016762 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.1146395  0.0040032  28.637  < 2e-16 ***
theta_mean_by_year_time -0.0012061  0.0003342  -3.609  0.00201 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008618 on 18 degrees of freedom
Multiple R-squared:  0.4198,    Adjusted R-squared:  0.3876 
F-statistic: 13.03 on 1 and 18 DF,  p-value: 0.002006

The reg_result object contains detailed regression summaries for each topic, including coefficients, p-values, and R-squared values.

Summary

This hot and cold topics analysis highlights the changing prominence of topics over time:

  • Hot Topics: Topics with significantly positive trends.

  • Cold Topics: Topics with significantly negative trends.

These trends provide insights into the evolution of topics, revealing which areas of discussion have gained or lost interest. This analysis is especially valuable for diachronic studies and can uncover shifts in focus across different time periods in the dataset.