Web Scraping: Efficient Data Collection from the Web

Research Data Scholarly Communication Services

Web Scraping: Efficient Data Collection from the Web

May 6, 2024August 20, 2024

Post Views: 2,081

— by Vivian Qiu

In March – May 2024, the HKU Libraries organised the Research Data Academy, RDA, a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data lifecycle. In the coming months, based on the RDA, several posts on key topics in the research lifecycle will be developed and shared on this blog.

The Research Data Lifecycle

The research data lifecycle includes everything from planning how data will be collected, to publication, to long term data preservation, to possible reuses of data (National Library of Medicine, n.d.). It consists of data creation and deposit, managing active data, data repositories and archives, and data catalogues and registries (JISC, 2021). Research data management is an important practice for researchers, aiming to keep data well-organised and documented, as well as easily sharable with other researchers and the public. In principle, data management activities should cover all aspects of the research data lifecycle.

Figure 1 – Research data lifecycle (JISC, 2021)

Data Collection

Data collection is crucial in laying the foundation for data analysis. Researchers may encounter issues in collecting large volume of data from the Web effectively. In this blog post, we will introduce some practical skills and tools for data collection via web scraping to help researchers address challenges in their research journey.

Web Scraping

1. Web scraping and ethical considerations

Web scraping is an automated method used to extract large amount of data from websites. Prior to utilizing web scraping, researchers should carefully read the website’s permissions and Terms of Service and observe whether it permits scraping. To ensure the web scraping activities are both legal and ethical, seven key actions are recommended (Chung, 2024):

Check legal and ethical implications
Be transparent and honest about your identity and intentions
Respect website rules
Use moderate scraping rates and intervals
Provide proper attribution and citation to the original sources
Remove or anonymize unnecessary personal or sensitive data
Avoid using scraped data for illegal or malicious purposes

2. Recommended tool: Python

Python, an interpreted, object-oriented programming language, is highly recommended for web scraping due to its extensive collection of convenient packages specifically designed for this purpose. The following table lists some recommended packages in Python (Chung, 2024).

Packages	Usage	Online resource for how to use the package
Requests	Establishing connections to target website Sending HTTP requests and handling responses	Requests documentation: https://requests.readthedocs.io/en/latest/
BeautifulSoup	Parsing HTML contents Extracting desired data	BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/
Scrapy	Handling pagination and iterate through multiple pages	Scrapy documentation: https://docs.scrapy.org/en/latest/
Pandas	Storing scraped data in a structured format	Pandas documentation: https://pandas.pydata.org/docs/index.html

Table 1 – Recommended Python Packages

3. Other web scrapping tools for simple static webpages

While Python is a powerful tool to scrap dynamic websites or in a large scale, for non-technical researchers, some handy tools may be useful for scrapping simple static webpages. Below are two examples.

A. Power Query in Microsoft Excel

Power Query (also known as Get & Transform in Excel) enables importing external data, and then shaping that data to meet different needs.

Guide: https://support.microsoft.com/en-us/office/about-power-query-in-excel-7104fbee-9e62-4cb9-a02e-5bfb1a6c536a

B. Web Scraper browser plug-in

Web Scraper is a Chrome plugin (and Firefox add-on) designed for regular and scheduled use to extract large amounts of data.

Guide: https://webscraper.io/how-to-videos

Webinar recording for current HKU staff and students

HKU staff and students can access the recorded session on web scrapping in Python, alongside a demonstration section which showcases the web scraping process using python packages on Google Colab.

Introduction to Web Scraping & Text Preprocessing in Python
https://hku.zoom.us/rec/share/ADs_ZXhGNeTJ5vL0pMpG9Qy_FWl1KdNXKytydWP47O15Adz8Z4RSaHTwclTrtXsx.pGLkzeLtehvWnDX2

Notes:
HKU current staff and students only. Please login via “SSO”.
Valid for 180 days only.

References

Chung, T. (2024, March 25). Introduction to Web Scraping & Text Preprocessing in Python.

JISC. (2021). Research data management toolkit. https://www.jisc.ac.uk/guides/research-data-management-toolkit

National Library of Medicine (n.d.). Research Lifecycle. https://www.nnlm.gov/guides/data-glossary/research-lifecycle

(The blog post is based on Research Data Academy, RDA, organised by HKU Libraries in 2024. The RDA is a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data lifecycle.)