19 Assignment #01
19.1 Assignment 01: Web Scraping and YouTube Subtitle Extraction
19.1.1 Total Marks: 20
19.1.2 Objective:
In this assignment, you will use Python to perform web scraping and extract subtitles from YouTube playlists. You will scrape text from a website over multiple pages, process the text, and save it in an Excel file. Additionally, you will extract and process subtitles from videos in a YouTube playlist, saving the results in Excel format.
19.1.3 Assignment Instructions:
19.2 Task 1: Web Scraping (10 Marks)
19.2.1 Step 1: Select a Website for Crawling
Select a website from which you will scrape data. The website must have multiple pages that can be crawled.
Explain why you selected this website (e.g., relevance, data availability, structure).
19.2.2 Step 2: Crawl Data from More than 2 Pages (4 Marks)
Write a Python script using requests
, BeautifulSoup
, or other relevant libraries to scrape data from at least two pages of the selected website. Ensure that your script can navigate between pages (pagination).
19.2.3 Step 3: Preprocess the Scraped Text and Save to Excel (xlsx) (4 Marks)
- Preprocess the scraped text data (e.g., remove unnecessary characters, clean formatting).
- Save the processed text in an Excel file (
.xlsx
format) usingpandas
oropenpyxl
.
19.3 Task 2: YouTube Playlist Subtitle Extraction (10 Marks)
19.3.1 Step 1: Select a YouTube Playlist (2 Marks)
- Select a YouTube playlist that contains videos with subtitles.
- Explain why you chose this playlist (e.g., topic relevance, video variety).
19.3.2 Step 2: Extract Subtitles from More than 2 Videos (4 Marks)
- Write a Python script using
pytube
,youtube-transcript-api
, or other relevant libraries to extract subtitles from at least two videos in the playlist. - Print or log the extracted subtitles.
19.3.3 Step 3: Preprocess Subtitles and Save to Excel (xlsx) (4 Marks)
- Preprocess the extracted subtitles (e.g., remove unnecessary characters, clean formatting, timestamp adjustments).
- Save the cleaned subtitles in an Excel file (
.xlsx
format) usingpandas
oropenpyxl
.
19.3.4 Submission Guidelines:
- Task 1:
- Submit the Python script used for web scraping and text processing. The
.ipynb
file should include an explanation of why you chose the website using markdown. - Submit the Excel file containing the processed text data.
- Submit the Python script used for web scraping and text processing. The
- Task 2:
- Submit the Python script used for YouTube subtitle extraction and processing. The
.ipynb
file should include an explanation of why you chose the playlist using markdown. - Submit the Excel file containing the processed subtitles.
- Submit the Python script used for YouTube subtitle extraction and processing. The
19.3.5 Grading Criteria:
Task | Marks |
---|---|
Task 1: Website Selection & Explanation | 2 |
Task 1: Crawl Data from Multiple Pages | 4 |
Task 1: Preprocess and Save to Excel | 4 |
Task 2: YouTube Playlist Selection & Explanation | 2 |
Task 2: Extract Subtitles from Multiple Videos | 4 |
Task 2: Preprocess Subtitles and Save to Excel | 4 |
Total | 20 |
19.3.6 Submission Deadline:
- Please submit your assignment by Oct. 10, 2024 via the course portal.