19  Assignment #01


19.1 Assignment 01: Web Scraping and YouTube Subtitle Extraction

19.1.1 Total Marks: 20

19.1.2 Objective:

In this assignment, you will use Python to perform web scraping and extract subtitles from YouTube playlists. You will scrape text from a website over multiple pages, process the text, and save it in an Excel file. Additionally, you will extract and process subtitles from videos in a YouTube playlist, saving the results in Excel format.

19.1.3 Assignment Instructions:

19.2 Task 1: Web Scraping (10 Marks)

19.2.1 Step 1: Select a Website for Crawling

Select a website from which you will scrape data. The website must have multiple pages that can be crawled.

Explain why you selected this website (e.g., relevance, data availability, structure).

19.2.2 Step 2: Crawl Data from More than 2 Pages (4 Marks)

Write a Python script using requests, BeautifulSoup, or other relevant libraries to scrape data from at least two pages of the selected website. Ensure that your script can navigate between pages (pagination).

19.2.3 Step 3: Preprocess the Scraped Text and Save to Excel (xlsx) (4 Marks)

  1. Preprocess the scraped text data (e.g., remove unnecessary characters, clean formatting).
  2. Save the processed text in an Excel file (.xlsx format) using pandas or openpyxl.

19.3 Task 2: YouTube Playlist Subtitle Extraction (10 Marks)

19.3.1 Step 1: Select a YouTube Playlist (2 Marks)

  1. Select a YouTube playlist that contains videos with subtitles.
  2. Explain why you chose this playlist (e.g., topic relevance, video variety).

19.3.2 Step 2: Extract Subtitles from More than 2 Videos (4 Marks)

  1. Write a Python script using pytube, youtube-transcript-api, or other relevant libraries to extract subtitles from at least two videos in the playlist.
  2. Print or log the extracted subtitles.

19.3.3 Step 3: Preprocess Subtitles and Save to Excel (xlsx) (4 Marks)

  1. Preprocess the extracted subtitles (e.g., remove unnecessary characters, clean formatting, timestamp adjustments).
  2. Save the cleaned subtitles in an Excel file (.xlsx format) using pandas or openpyxl.

19.3.4 Submission Guidelines:

  1. Task 1:
    • Submit the Python script used for web scraping and text processing. The .ipynb file should include an explanation of why you chose the website using markdown.
    • Submit the Excel file containing the processed text data.
  2. Task 2:
    • Submit the Python script used for YouTube subtitle extraction and processing. The .ipynb file should include an explanation of why you chose the playlist using markdown.
    • Submit the Excel file containing the processed subtitles.

19.3.5 Grading Criteria:

Task Marks
Task 1: Website Selection & Explanation 2
Task 1: Crawl Data from Multiple Pages 4
Task 1: Preprocess and Save to Excel 4
Task 2: YouTube Playlist Selection & Explanation 2
Task 2: Extract Subtitles from Multiple Videos 4
Task 2: Preprocess Subtitles and Save to Excel 4
Total 20

19.3.6 Submission Deadline:

  • Please submit your assignment by Oct. 10, 2024 via the course portal.