21 Assignment #03:Topic Modeling

21.1 Instructions:

In this assignment, you’ll perform a topic modeling analysis on a selected corpus to explore and interpret themes. Follow the steps below to complete your work, and submit your results as instructed. Provide clear explanations and visualizations to support your analysis.

21.2 Steps for the Assignment

Step 1. Select Your Corpus (5 points)

Corpus Selection:

Choose a text corpus that interests you. Potential sources include:
- Project Gutenberg: a collection of public domain books.
- Kaggle Datasets: various datasets that may include text.
- Other text-based datasets are also acceptable.

Corpus Details:

Provide a brief description of your chosen corpus (1-2 sentences).
Mention the source and why you selected this corpus (1-2 sentences).

Step 2. Conduct Topic Modeling Analysis (15 points)

Preprocessing (5 points):

Prepare the text for analysis by following standard preprocessing steps:
- Tokenization (splitting text into words or phrases).
- Stop word removal.
- Lemmatization or stemming.
- Vectorization (transforming text into numerical form, if needed).
Describe these preprocessing steps in your final submission.

Topic Modeling (5 points):

Implement a topic modeling technique, such as Latent Dirichlet Allocation (LDA) to identify topics within your corpus.
Provide a summary of each identified topic, including representative keywords or phrases.
Include a visualization

Topic Interpretation and Analysis (5 points):

Interpret the themes or patterns within the identified topics. Explain any notable or recurring themes and how they relate to the overall structure of your corpus.

21.3 Submission Guidelines

Report Format: Submit a PDF (or MS-Word) report (2 pages, approximately 500 words), including screenshots of visualizations.
Python or R Code

Due Date: Submit by November 28, 2024, via the course portal.

Total: 20 points