Data Cleaning: What and How

In addition to data collection, data cleaning is another crucial process to lay the foundation for data analysis. Researchers may face challenges related to data quality, reusability, and obtaining reliable insights from the collected data. In this blog post, we would introduce some practical skills and tools for data cleaning to help researchers address these challenges. 

Data cleaning process

Data cleaning refers to the process of taking appropriate actions to identify and rectify inaccurate data (Mourya & Gupta, 2012). It is essential in the data processing workflow which ensures data accuracy, consistency, and quality for analysis, leading to more reliable results and insights, and enhanced integrity of research findings.  

The cleaning processes may include (Ramesh & Lee, 2024):

  • Removing duplicates
  • Removing leading or trailing whitespace
  • Handling missing values
  • Standardizing formats and data types
  • Resolving inconsistent column names and values
  • Identifying and removing outliers
 

Data cleaning tools

The table below lists several recommended tools, including software applications and programming languages, for data cleaning. 

Software applications

Tool Brief description Pros Cons

Microsoft Excel
A widely used spreadsheet application that offers basic data cleaning functionalities such as removing duplicates and correcting errors, etc.
  1. Familiar and user-friendly interface for most users.
  2. Supports a wide range of data formats.
  3. Easy-to-use functions for data cleaning.
  1. Limited processing power and scalability for large datasets.

Google Sheets
A web-based spreadsheet tool similar to Microsoft Excel, offering basic data cleaning features and collaborative capabilities.
  1. Accessible online and easy to use.
  2. Allows real-time collaboration, e.g., sharing and simultaneous editing.
  3. Integrates with other Google tools.
  1. Limited functionality compared to desktop software.

Open Refine
An open-source tool designed specifically for data cleaning tasks.
  1. Powerful data transformation and cleaning capabilities.
  2. Supports large datasets.
  3. Provides convenient interface in browser.
  1. For single user.
  2. Do not support descriptive statistics of variable.

Alteryx
A comprehensive data preparation and analytics platform that includes data cleaning functionalities.
  1. Offers a range of data cleaning and transformation tools.
  2. Provides a drag-and-drop interface, simple and user-friendly.
  3. Integrates with other tools for advanced analytics.
  1. Commercial software.
  2. With a learning curve for beginners.

KNIME
An open-source data analytics platform that supports data cleaning and pre-processing tasks.
  1. Offers a visual workflow interface.
  2. Provides pre-built data cleaning nodes, with higher customization and reliability.
  3. Integrates with other tools. (Gelevska, 2023)
  1. Initial setup and learning curve may be challenging for beginners.

Table 1 – Software applications for data cleaning 

Programming languages

Tool Brief description Pros Cons

Python
A popular programming language with numerous libraries and packages for data cleaning and pre-processing.
  1. Vast collection of packages (e.g., Pandas, NumPy) dedicated to data cleaning.
  2. Flexible and customizable for complex tasks.
  3. Widely used in data science community.
  1. Requires programming knowledge and coding skills.

R
A statistical programming language with extensive data manipulation and cleaning capabilities. 
  1. Rich ecosystem of packages (e.g., dplyr, tidyr) designed for data cleaning.
  2. Popular among statisticians and researchers for data analysis.
  1. Steeper learning curve compared to other tools for beginners.

MySQL
A popular open-source relational database management system that can be used for data cleaning and transformation tasks.
  1. Efficient for handling large datasets and complex data cleaning operations.
  2. Supports SQL queries for data filtering, transformation and cleaning.
  3. Can be integrated with other tools and programming languages.
  1. Requires knowledge of SQL language and database management.

Table 2 – Programming languages for data cleaning

Webinar recording for current HKU staff and students  

HKU staff and students can access the recorded sessions on data cleaning.  

Introduction to Web Scraping & Text Preprocessing in Python 
https://hku.zoom.us/rec/share/ADs_ZXhGNeTJ5vL0pMpG9Qy_FWl1KdNXKytydWP47O15Adz8Z4RSaHTwclTrtXsx.pGLkzeLtehvWnDX2

Mastering Data Cleaning Techniques 
https://hku.zoom.us/rec/share/tJK94U0EuC672bBZOvIjYy8M5lIZqgfpIzkQjsQem8GlMbL9ukbnDaSyW-30sF-C.wzu73FXkR_OQjEGD

Notes:  
For HKU current staff and students only. Please login via “SSO”.  
Valid for 180 days only. 

Extended reading 

Event Summary – Unlocking Research Potential: Effective Data Management for Transdisciplinary Success 
https://blog-sc.hku.hk/event-summary-unlocking-research-potential-effective-data-management-for-transdisciplinary-success/ 

References 

Chung, T. (2024, March 25). Introduction to Web Scraping & Text Preprocessing in Python.  

Gelevska, A. (2023, March 23). Knime vs Alteryx: Difference and Software Comparison. https://redfield.ai/knime-vs-alteryx/

Mourya, S. K., & Gupta, S. (2012). Data Mining and Data Warehousing. Alpha Science International, Ltd.  

Ramesh, V. R., & Lee, J. (2024). Mastering Data Cleaning Techniques. 

(The blog post is based on Research Data Academy, RDA, organised by HKU Libraries in 2024. The RDA is a series of training sessions designed to strengthen participants’ data literacy skills, covering multiple areas in the research data life cycle.) 

Share