Feature image

Web Scraping: Efficient Data Collection from the Web

In March – May 2024, the HKU Libraries organised the Research Data Academy, RDA, a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data lifecycle. In the coming months, based on the RDA, several posts on key topics in the research lifecycle will be developed and shared on this blog. 

The Research Data Lifecycle 

The research data lifecycle includes everything from planning how data will be collected, to publication, to long term data preservation, to possible reuses of data (National Library of Medicine, n.d.). It consists of data creation and deposit, managing active data, data repositories and archives, and data catalogues and registries (JISC, 2021). Research data management is an important practice for researchers, aiming to keep data well-organised and documented, as well as easily sharable with other researchers and the public. In principle, data management activities should cover all aspects of the research data lifecycle. 

Figure 1 – Research data lifecycle (JISC, 2021) 

Data Collection 

Data collection is crucial in laying the foundation for data analysis. Researchers may encounter issues in collecting large volume of data from the Web effectively. In this blog post, we will introduce some practical skills and tools for data collection via web scraping to help researchers address challenges in their research journey.  

Web Scraping 

1. Web scraping and ethical considerations 

Web scraping is an automated method used to extract large amount of data from websites. Prior to utilizing web scraping, researchers should carefully read the website’s permissions and Terms of Service and observe whether it permits scraping. To ensure the web scraping activities are both legal and ethical, seven key actions are recommended (Chung, 2024):  

  • Check legal and ethical implications 
  • Be transparent and honest about your identity and intentions 
  • Respect website rules 
  • Use moderate scraping rates and intervals 
  • Provide proper attribution and citation to the original sources 
  • Remove or anonymize unnecessary personal or sensitive data 
  • Avoid using scraped data for illegal or malicious purposes 

2. Recommended tool: Python 

Python, an interpreted, object-oriented programming language, is highly recommended for web scraping due to its extensive collection of convenient packages specifically designed for this purpose. The following table lists some recommended packages in Python (Chung, 2024). 

Packages  Usage  Online resource for how to use the package 
Requests
  • Establishing connections to target website
  • Sending HTTP requests and handling responses
Requests documentation:
https://requests.readthedocs.io/en/latest/
BeautifulSoup 
  • Parsing HTML contents
  • Extracting desired data
BeautifulSoup documentation:
https://beautiful-soup-4.readthedocs.io/en/latest/
Scrapy
  • Handling pagination and iterate through multiple pages
Scrapy documentation:
https://docs.scrapy.org/en/latest/
Pandas
  • Storing scraped data in a structured format
Pandas documentation:
https://pandas.pydata.org/docs/index.html

Table 1 – Recommended Python Packages

3. Other web scrapping tools for simple static webpages 

While Python is a powerful tool to scrap dynamic websites or in a large scale, for non-technical researchers, some handy tools may be useful for scrapping simple static webpages. Below are two examples. 

A. Power Query in Microsoft Excel 

Power Query (also known as Get & Transform in Excel) enables importing external data, and then shaping that data to meet different needs.   

Guide: https://support.microsoft.com/en-us/office/about-power-query-in-excel-7104fbee-9e62-4cb9-a02e-5bfb1a6c536a

B. Web Scraper browser plug-in 

Web Scraper is a Chrome plugin (and Firefox add-on) designed for regular and scheduled use to extract large amounts of data. 

Guide: https://webscraper.io/how-to-videos

Webinar recording for current HKU staff and students  

HKU staff and students can access the recorded session on web scrapping in Python, alongside a demonstration section which showcases the web scraping process using python packages on Google Colab.  

Introduction to Web Scraping & Text Preprocessing in Python
https://hku.zoom.us/rec/share/ADs_ZXhGNeTJ5vL0pMpG9Qy_FWl1KdNXKytydWP47O15Adz8Z4RSaHTwclTrtXsx.pGLkzeLtehvWnDX2

Notes:
HKU current staff and students only. Please login via “SSO”.
Valid for 180 days only. 

References 

Chung, T. (2024, March 25). Introduction to Web Scraping & Text Preprocessing in Python.  

JISC. (2021). Research data management toolkit. https://www.jisc.ac.uk/guides/research-data-management-toolkit 

National Library of Medicine (n.d.). Research Lifecycle. https://www.nnlm.gov/guides/data-glossary/research-lifecycle  

(The blog post is based on Research Data Academy, RDA, organised by HKU Libraries in 2024. The RDA is a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data lifecycle.)  

Share