Data Cleaning: What and How
— by Vivian Qiu
In addition to data collection, data cleaning is another crucial process to lay the foundation for data analysis. Researchers may face challenges related to data quality, reusability, and obtaining reliable insights from the collected data. In this blog post, we would introduce some practical skills and tools for data cleaning to help researchers address these challenges.
Data cleaning process
Data cleaning refers to the process of taking appropriate actions to identify and rectify inaccurate data (Mourya & Gupta, 2012). It is essential in the data processing workflow which ensures data accuracy, consistency, and quality for analysis, leading to more reliable results and insights, and enhanced integrity of research findings.
The cleaning processes may include (Ramesh & Lee, 2024):
- Removing duplicates
- Removing leading or trailing whitespace
- Handling missing values
- Standardizing formats and data types
- Resolving inconsistent column names and values
- Identifying and removing outliers
Data cleaning tools
The table below lists several recommended tools, including software applications and programming languages, for data cleaning.
Software applications
Tool | Brief description | Pros | Cons |
Microsoft Excel |
A widely used spreadsheet application that offers basic data cleaning functionalities such as removing duplicates and correcting errors, etc. |
|
|
Google Sheets |
A web-based spreadsheet tool similar to Microsoft Excel, offering basic data cleaning features and collaborative capabilities. |
|
|
Open Refine |
An open-source tool designed specifically for data cleaning tasks. |
|
|
Alteryx |
A comprehensive data preparation and analytics platform that includes data cleaning functionalities. |
|
|
KNIME |
An open-source data analytics platform that supports data cleaning and pre-processing tasks. |
|
|
Table 1 – Software applications for data cleaning
Programming languages
Tool | Brief description | Pros | Cons |
Python |
A popular programming language with numerous libraries and packages for data cleaning and pre-processing. |
|
|
R |
A statistical programming language with extensive data manipulation and cleaning capabilities. |
|
|
MySQL |
A popular open-source relational database management system that can be used for data cleaning and transformation tasks. |
|
|
Table 2 – Programming languages for data cleaning
Webinar recording for current HKU staff and students
HKU staff and students can access the recorded sessions on data cleaning.
Introduction to Web Scraping & Text Preprocessing in Python
https://hku.zoom.us/rec/share/ADs_ZXhGNeTJ5vL0pMpG9Qy_FWl1KdNXKytydWP47O15Adz8Z4RSaHTwclTrtXsx.pGLkzeLtehvWnDX2
Mastering Data Cleaning Techniques
https://hku.zoom.us/rec/share/tJK94U0EuC672bBZOvIjYy8M5lIZqgfpIzkQjsQem8GlMbL9ukbnDaSyW-30sF-C.wzu73FXkR_OQjEGD
Notes:
For HKU current staff and students only. Please login via “SSO”.
Valid for 180 days only.
Extended reading
Event Summary – Unlocking Research Potential: Effective Data Management for Transdisciplinary Success
https://blog-sc.hku.hk/event-summary-unlocking-research-potential-effective-data-management-for-transdisciplinary-success/
References
Chung, T. (2024, March 25). Introduction to Web Scraping & Text Preprocessing in Python.
Gelevska, A. (2023, March 23). Knime vs Alteryx: Difference and Software Comparison. https://redfield.ai/knime-vs-alteryx/
Mourya, S. K., & Gupta, S. (2012). Data Mining and Data Warehousing. Alpha Science International, Ltd.
Ramesh, V. R., & Lee, J. (2024). Mastering Data Cleaning Techniques.
(The blog post is based on Research Data Academy, RDA, organised by HKU Libraries in 2024. The RDA is a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data life cycle.)