Data Discovery: Where to Find the Right Datasets
— by Chloe Ng
Every day, vast amounts of data are collected or generated by scholars, governments, NGOs, businesses, and other organizations. By discovering existing datasets, researchers can reuse data to save time and costs in data collection, conduct replication studies, compare findings, or incorporate validated components of research design (CESSDA, 2022). This post provides an overview of resources for data discovery.
Literature search
Research articles
For researchers unfamiliar with available data resources, conducting a literature review is a helpful first step. Reviewing research articles and their supporting data can reveal which datasets other researchers have used for similar topics. In published journal articles, the data availability statements inform readers where and how to access data that support the results and analysis. Even when datasets are not explicitly provided, the statements may provide information about organizations, government agencies, or research bodies that collect related data.

Data journals
While journal articles share data that support research findings, some datasets that have not yet been analysed are published in data journals. Data Journals are peer-reviewed journals that focus on publishing documented datasets rather than research findings derived from those datasets. Their primary aim is to facilitate data discovery and reuse by providing descriptions of new research datasets (Nature). The data papers / descriptors document the creation method, data records and technical validation, but do not report whether the datasets support any specific hypotheses or conclusions. Examples of data journals include:
- Scientific Data (ISSN: 2052-4463)
- Data in Brief (ISSN: 2352-3409)
- Research Data Journal for the Humanities and Social Sciences (ISSN: 2452-3666)
A list of data journals can be found in the appendix of the article by Walters (2020): https://insights.uksg.org/articles/10.1629/uksg.510#appendix.
Data repositories
Data repositories are another convenient channel for discovering research data. They provide structured systems for storing datasets, ensuring long-term accessibility and compliance with data management standards.
Disciplinary repositories
To locate research data within a specific field, researchers can look for disciplinary repositories, which follow discipline-specific metadata standards and data curation practices. Sharing data through these repositories increases visibility among subject experts and potential collaborators, making them the recommended route for publishing data within a specific field. For example, the Sequence Read Archive (SRA) data, hosted by the National Center for Biotechnology Information (NCBI), is a repository for storing raw sequencing data.

Researchers can identify appropriate discipline-specific repositories by referring to a data repository registry like Re3data.org, FAIRsharing.org, OpenDOAR or OpenAIRE Explore. Some of these registries provide a searchable overview of data repositories and allow filtering by subject area, data licenses, and other criteria, making it easier to locate repositories that meet specific research needs.
Institutional data repository – DataHub
At the institutional level, DataHub serves as the data repository for HKU researchers and students to share their research data. Researchers can explore DataHub to discover research and foster collaboration within the HKU research community.

Search engines or metadata aggregators
Google Dataset Search
Google offers a dedicated search engine for datasets, Google Dataset Search, which enables users to discover datasets hosted in thousands of repositories across the web through simple keyword searches. However, its scope is limited to indexed sources, so the absence of results does not necessarily mean that relevant data does not exist.

Alternatively, a general Google search can help identify organizations that host datasets. Researchers can include topic-specific keywords along with terms such as “datasets”, “data archive”, or “open data”.
Metadata aggregators
While Google Dataset Search covers a broader range of sources, metadata aggregators like DataCite or OpenAIRE Explore offer a more structured search for documented datasets.
DataCite collects metadata for every Digital Object Identifier (DOI) assigned to a research object, forming an index that can be queried to locate datasets, view usage metrics, and explore related works. All metadata is freely accessible through the data discovery platform, DataCite Commons. To look for research data specifically, users can refine results by filtering for “Dataset” as the work type.

Subscribed databases
Some data and statistics are only available through databases subscribed to by the Libraries. Examples include, but are not limited to, ICPSR, LSEG workspace, Passport, and Wharton Research Data Services (WRDS). These databases usually provide disciplinary and type-specific data, such as financial data, demographic information, marketing statistics, etc.
The complete list of subscribed databases is available on the Libraries’ website. For subject-specific recommendations, please refer to the subject guides, which provide curated resources tailored to different disciplines.
Lawful use of research data
The data owner determines how the data can be accessed, shared, or reused. Researchers must follow the data license and terms of use associated with any dataset. Common types of data licenses include:
- Public Domain: data can be freely used, modified, and shared without restriction;
- Copyright: all rights are reserved unless permission is granted; and
- Creative Commons (CC) licenses: some rights are reserved and specific conditions apply.
More information on data licensing is available in the How to License Research Data guide by Digital Curation Centre (Ball, 2014).
Evaluating data sources
Researchers shall always evaluate the trustworthiness of the data source, considering whether the dataset is hosted by a reputable organization or maintained by an authoritative body. Reliable data resources often provide both project-level and data-level documentation, which helps users understand the context and structure of the data, ensuring proper interpretation and reuse.
Extended readings
- Selecting a Repository for Data Sharing — Researcher Connect
- Access, Author Rights, and Agreements 1: Which Creative Commons License Works the Best for an Author? — Researcher Connect
References
Ball, A. (2014). How to License Research Data. DCC How-to Guides. https://www.dcc.ac.uk/guidance/how-guides/license-research-data
CESSDA. (2022). CESSDA Data Management Expert Guide. CESSDA ERIC. https://dmeg.cessda.eu/
Liu, D., Chang, R., & Tang, H. (2026). A near-global dataset of dissolved organic carbon concentrations and yields in forested headwater streams. Sci Data. https://doi.org/10.1038/s41597-025-06522-3
Nature. Scientific Data: Aims and scope. https://www.nature.com/sdata/aims-and-scope
Walters, W. H. (2020). Data journals: incentivizing data access and documentation within the scholarly communication system. Insights the UKSG journal, 33. https://doi.org/10.1629/uksg.510
Declaration of Generative AI use
I acknowledge the use of Generative AI tools in writing this post. I used:
- Microsoft Copilot to refine the language.
I declare that I reviewed and edited the contents as needed, and take full responsibility for the content of the post; And the information provided is complete and accurate.
