Data Cleaning: What and How

Research Data Scholarly Communication Services

May 9, 2024August 20, 2024

Post Views: 1,899

— by Vivian Qiu

In addition to data collection, data cleaning is another crucial process to lay the foundation for data analysis. Researchers may face challenges related to data quality, reusability, and obtaining reliable insights from the collected data. In this blog post, we would introduce some practical skills and tools for data cleaning to help researchers address these challenges.

Data cleaning process

Data cleaning refers to the process of taking appropriate actions to identify and rectify inaccurate data (Mourya & Gupta, 2012). It is essential in the data processing workflow which ensures data accuracy, consistency, and quality for analysis, leading to more reliable results and insights, and enhanced integrity of research findings.

The cleaning processes may include (Ramesh & Lee, 2024):

Removing duplicates
Removing leading or trailing whitespace
Handling missing values
Standardizing formats and data types
Resolving inconsistent column names and values
Identifying and removing outliers

Data cleaning tools

The table below lists several recommended tools, including software applications and programming languages, for data cleaning.

Software applications

Tool	Brief description	Pros	Cons
Microsoft Excel	A widely used spreadsheet application that offers basic data cleaning functionalities such as removing duplicates and correcting errors, etc.	Familiar and user-friendly interface for most users. Supports a wide range of data formats. Easy-to-use functions for data cleaning.	Limited processing power and scalability for large datasets.
Google Sheets	A web-based spreadsheet tool similar to Microsoft Excel, offering basic data cleaning features and collaborative capabilities.	Accessible online and easy to use. Allows real-time collaboration, e.g., sharing and simultaneous editing. Integrates with other Google tools.	Limited functionality compared to desktop software.
Open Refine	An open-source tool designed specifically for data cleaning tasks.	Powerful data transformation and cleaning capabilities. Supports large datasets. Provides convenient interface in browser.	For single user. Do not support descriptive statistics of variable.
Alteryx	A comprehensive data preparation and analytics platform that includes data cleaning functionalities.	Offers a range of data cleaning and transformation tools. Provides a drag-and-drop interface, simple and user-friendly. Integrates with other tools for advanced analytics.	Commercial software. With a learning curve for beginners.
KNIME	An open-source data analytics platform that supports data cleaning and pre-processing tasks.	Offers a visual workflow interface. Provides pre-built data cleaning nodes, with higher customization and reliability. Integrates with other tools. (Gelevska, 2023)	Initial setup and learning curve may be challenging for beginners.

Table 1 – Software applications for data cleaning

Programming languages

Tool	Brief description	Pros	Cons
Python	A popular programming language with numerous libraries and packages for data cleaning and pre-processing.	Vast collection of packages (e.g., Pandas, NumPy) dedicated to data cleaning. Flexible and customizable for complex tasks. Widely used in data science community.	Requires programming knowledge and coding skills.
R	A statistical programming language with extensive data manipulation and cleaning capabilities.	Rich ecosystem of packages (e.g., dplyr, tidyr) designed for data cleaning. Popular among statisticians and researchers for data analysis.	Steeper learning curve compared to other tools for beginners.
MySQL	A popular open-source relational database management system that can be used for data cleaning and transformation tasks.	Efficient for handling large datasets and complex data cleaning operations. Supports SQL queries for data filtering, transformation and cleaning. Can be integrated with other tools and programming languages.	Requires knowledge of SQL language and database management.

Table 2 – Programming languages for data cleaning

Webinar recording for current HKU staff and students

HKU staff and students can access the recorded sessions on data cleaning.

Introduction to Web Scraping & Text Preprocessing in Python
https://hku.zoom.us/rec/share/ADs_ZXhGNeTJ5vL0pMpG9Qy_FWl1KdNXKytydWP47O15Adz8Z4RSaHTwclTrtXsx.pGLkzeLtehvWnDX2

Mastering Data Cleaning Techniques
https://hku.zoom.us/rec/share/tJK94U0EuC672bBZOvIjYy8M5lIZqgfpIzkQjsQem8GlMbL9ukbnDaSyW-30sF-C.wzu73FXkR_OQjEGD

Notes:
For HKU current staff and students only. Please login via “SSO”.
Valid for 180 days only.

Extended reading

Event Summary – Unlocking Research Potential: Effective Data Management for Transdisciplinary Success
https://blog-sc.hku.hk/event-summary-unlocking-research-potential-effective-data-management-for-transdisciplinary-success/

References

Chung, T. (2024, March 25). Introduction to Web Scraping & Text Preprocessing in Python.

Gelevska, A. (2023, March 23). Knime vs Alteryx: Difference and Software Comparison. https://redfield.ai/knime-vs-alteryx/

Mourya, S. K., & Gupta, S. (2012). Data Mining and Data Warehousing. Alpha Science International, Ltd.

Ramesh, V. R., & Lee, J. (2024). Mastering Data Cleaning Techniques.

(The blog post is based on Research Data Academy, RDA, organised by HKU Libraries in 2024. The RDA is a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data life cycle.)