19 Assignment #01
19.1 Assignment 01: Web Scraping and YouTube Subtitle Extraction
19.1.1 Total Marks: 20
19.1.2 Objective:
In this assignment, you will use Python to perform web scraping and extract subtitles from YouTube playlists. You will scrape text from a website over multiple pages, process the text, and save it in an Excel file. Additionally, you will extract and process subtitles from videos in a YouTube playlist, saving the results in Excel format.
19.1.3 Assignment Instructions:
19.2 Task 1: Web Scraping (10 Marks)
19.2.1 Step 1: Select a Website for Crawling
Select a website from which you will scrape data. The website must have multiple pages that can be crawled.
Explain why you selected this website (e.g., relevance, data availability, structure).
19.2.2 Step 2: Crawl Data from More than 2 Pages (4 Marks)
Write a Python script using requests, BeautifulSoup, or other relevant libraries to scrape data from at least two pages of the selected website. Ensure that your script can navigate between pages (pagination).
19.2.3 Step 3: Preprocess the Scraped Text and Save to Excel (xlsx) (4 Marks)
- Preprocess the scraped text data (e.g., remove unnecessary characters, clean formatting).
- Save the processed text in an Excel file (
.xlsxformat) usingpandasoropenpyxl.
19.3 Task 2: YouTube Playlist Subtitle Extraction (10 Marks)
19.3.1 Step 1: Select a YouTube Playlist (2 Marks)
- Select a YouTube playlist that contains videos with subtitles.
- Explain why you chose this playlist (e.g., topic relevance, video variety).
19.3.2 Step 2: Extract Subtitles from More than 2 Videos (4 Marks)
- Write a Python script using
pytube,youtube-transcript-api, or other relevant libraries to extract subtitles from at least two videos in the playlist. - Print or log the extracted subtitles.
19.3.3 Step 3: Preprocess Subtitles and Save to Excel (xlsx) (4 Marks)
- Preprocess the extracted subtitles (e.g., remove unnecessary characters, clean formatting, timestamp adjustments).
- Save the cleaned subtitles in an Excel file (
.xlsxformat) usingpandasoropenpyxl.
19.3.4 Submission Guidelines:
- Task 1:
- Submit the Python script used for web scraping and text processing. The
.ipynbfile should include an explanation of why you chose the website using markdown. - Submit the Excel file containing the processed text data.
- Submit the Python script used for web scraping and text processing. The
- Task 2:
- Submit the Python script used for YouTube subtitle extraction and processing. The
.ipynbfile should include an explanation of why you chose the playlist using markdown. - Submit the Excel file containing the processed subtitles.
- Submit the Python script used for YouTube subtitle extraction and processing. The
19.3.5 Grading Criteria:
| Task | Marks |
|---|---|
| Task 1: Website Selection & Explanation | 2 |
| Task 1: Crawl Data from Multiple Pages | 4 |
| Task 1: Preprocess and Save to Excel | 4 |
| Task 2: YouTube Playlist Selection & Explanation | 2 |
| Task 2: Extract Subtitles from Multiple Videos | 4 |
| Task 2: Preprocess Subtitles and Save to Excel | 4 |
| Total | 20 |
19.3.6 Submission Deadline:
- Please submit your assignment by Oct. 10, 2024 via the course portal.