Web Scraping: Efficient Data Collection from the Web
— by Vivian Qiu
In March – May 2024, the HKU Libraries organised the Research Data Academy, RDA, a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data lifecycle. In the coming months, based on the RDA, several posts on key topics in the research lifecycle will be developed and shared on this blog.
The Research Data Lifecycle
The research data lifecycle includes everything from planning how data will be collected, to publication, to long term data preservation, to possible reuses of data (National Library of Medicine, n.d.). It consists of data creation and deposit, managing active data, data repositories and archives, and data catalogues and registries (JISC, 2021). Research data management is an important practice for researchers, aiming to keep data well-organised and documented, as well as easily sharable with other researchers and the public. In principle, data management activities should cover all aspects of the research data lifecycle.
Figure 1 – Research data lifecycle (JISC, 2021)
Data Collection
Data collection is crucial in laying the foundation for data analysis. Researchers may encounter issues in collecting large volume of data from the Web effectively. In this blog post, we will introduce some practical skills and tools for data collection via web scraping to help researchers address challenges in their research journey.
Web Scraping
1. Web scraping and ethical considerations
Web scraping is an automated method used to extract large amount of data from websites. Prior to utilizing web scraping, researchers should carefully read the website’s permissions and Terms of Service and observe whether it permits scraping. To ensure the web scraping activities are both legal and ethical, seven key actions are recommended (Chung, 2024):
- Check legal and ethical implications
- Be transparent and honest about your identity and intentions
- Respect website rules
- Use moderate scraping rates and intervals
- Provide proper attribution and citation to the original sources
- Remove or anonymize unnecessary personal or sensitive data
- Avoid using scraped data for illegal or malicious purposes
2. Recommended tool: Python
Python, an interpreted, object-oriented programming language, is highly recommended for web scraping due to its extensive collection of convenient packages specifically designed for this purpose. The following table lists some recommended packages in Python (Chung, 2024).
Packages | Usage | Online resource for how to use the package |
Requests |
|
Requests documentation: https://requests.readthedocs.io/en/latest/ |
BeautifulSoup |
|
BeautifulSoup documentation: https://beautiful-soup-4.readthedocs.io/en/latest/ |
Scrapy |
|
Scrapy documentation: https://docs.scrapy.org/en/latest/ |
Pandas |
|
Pandas documentation: https://pandas.pydata.org/docs/index.html |
Table 1 – Recommended Python Packages
3. Other web scrapping tools for simple static webpages
While Python is a powerful tool to scrap dynamic websites or in a large scale, for non-technical researchers, some handy tools may be useful for scrapping simple static webpages. Below are two examples.
A. Power Query in Microsoft Excel
Power Query (also known as Get & Transform in Excel) enables importing external data, and then shaping that data to meet different needs.
B. Web Scraper browser plug-in
Web Scraper is a Chrome plugin (and Firefox add-on) designed for regular and scheduled use to extract large amounts of data.
Guide: https://webscraper.io/how-to-videos
Webinar recording for current HKU staff and students
HKU staff and students can access the recorded session on web scrapping in Python, alongside a demonstration section which showcases the web scraping process using python packages on Google Colab.
Introduction to Web Scraping & Text Preprocessing in Python
https://hku.zoom.us/rec/share/ADs_ZXhGNeTJ5vL0pMpG9Qy_FWl1KdNXKytydWP47O15Adz8Z4RSaHTwclTrtXsx.pGLkzeLtehvWnDX2
Notes:
HKU current staff and students only. Please login via “SSO”.
Valid for 180 days only.
References
Chung, T. (2024, March 25). Introduction to Web Scraping & Text Preprocessing in Python.
JISC. (2021). Research data management toolkit. https://www.jisc.ac.uk/guides/research-data-management-toolkit
National Library of Medicine (n.d.). Research Lifecycle. https://www.nnlm.gov/guides/data-glossary/research-lifecycle
(The blog post is based on Research Data Academy, RDA, organised by HKU Libraries in 2024. The RDA is a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data lifecycle.)