Data Citation: How to Credit Datasets
— by Chloe Ng
Following our previous blog post on data discovery, this post focuses on how to cite the data we have discovered and reused. Data citation refers to the practice of formally referencing data, in the same way that researchers provide bibliographic citations for articles, books, or other scholarly sources.
Reasons to cite datasets
Just as journal publications are acknowledged through formal references, datasets that inform research findings must also be credited. Citing data properly ensures compliance with the terms and conditions of use, especially when the data is shared under a licence requiring attribution. Proper data citation also enables others to locate the data used, replicate analyses, and verify results.
Referencing data with Data Availability Statements
The simplest way to reference the data supporting a publication is through a Data Availability Statement, which describes how to access the underlying data and provides a URL or identifier linking to the dataset.
While such a statement meets the basic requirement for referencing data, it does not give proper credit to dataset creators and does not treat data as a formally published research output (Ball & Duke, 2015). It is also difficult to track how datasets are reused. The issues can be addressed by providing a full data citation, including both an in‑text reference and a complete entry in the reference list.
Components in a data citation
Although there is no single universal standard and practices differ across disciplines, the six elements below are generally considered the minimum information required for data identification and retrieval (Bornatici & Fedrigo, 2023).
- Author(s)
- Title of the data
- Year of publication
- Version
The version number is important to allow for retrieval of the exact data version, as new data may be added (e.g., longitudinal data) or existing data may be corrected. When a version is not explicitly provided, the publication or access date should be used instead.
- Data publisher
The organisation responsible for disseminating and preserving the dataset, such as a data repository.
- Persistent identifier
A unique electronic identifier used to locate and access the data, such as a Digital Object Identifier (DOI). It should be persistent, machine-actionable, and globally unique. A DOI is preferred over a URL because it never breaks or expires, ensuring long-term access to the resource.
Below is an example of a data citation in APA style:
Fu, K. W. (2023). Weibo Data and Replication Code in the paper “Propagandization of Relative Gratification: How Chinese State Media Portray the International Pandemic” (Version 2). HKU DataHub. https://doi.org/10.25442/hku.22339684.v2
It includes the six key components:
- Author(s): Fu, K. W.
- Title of the data: Weibo Data and Replication Code in the paper “Propagandization of Relative Gratification: How Chinese State Media Portray the International Pandemic”
- Year of publication: 2023
- Version: Version 2
- Data publisher: HKU DataHub
- Persistent identifier: https://doi.org/10.25442/hku.22339684.v2
Citation styles
These core components should be arranged and formatted according to the requirements of the relevant citation style, or the specific guidelines of the journal publisher or data repository. If guidelines for citing datasets are not available, the citation format used for books is often treated as a generic template that can be adapted for other types of sources (IASSIST, 2012).
Creating citations with data repositories or citation management tools
Many data repositories provide recommended citation formats directly on the dataset’s record page, or offer downloadable citation files for easy import into reference management software.
In DataHub, the institutional data repository of the University of Hong Kong, a data citation can be copied by clicking “Cite” on the item record page and selecting from a range of citation styles, including but not limited to APA, Chicago, and MLA.

DataHub also allows users to export citations in formats such as RefWorks, BibTeX, and EndNote. These downloaded citation files can then be imported into citation management tools and organised alongside other references.

Alternatively, citation management tools such as Zotero and EndNote allow users to manually create a reference by selecting “Dataset” as the item type. This approach is useful when a data publisher does not provide a ready‑made citation.
Proper citation has always been crucial to academic integrity, and research data is no exception. Researchers should ensure that any datasets used are cited accurately in a way that supports long-term access and scholarly credit.
Extended Readings
- Data Discovery: Where to Find the Right Datasets
- Select the Citation Management Tool That Best Suits Your Needs: EndNote vs. Zotero — Researcher Connect
References
Ball, A., & Duke, M. (2015). How to Cite Datasets and Link to Publications (DCC How-to Guides, Issue. https://www.dcc.ac.uk/guidance/how-guides/cite-datasets
Bornatici, C., & Fedrigo, N. (2023). Data Citation: How and Why Citing (Your Own) Data. FORS Guides. https://doi.org/10.24449/FG-2023-00019
IASSIST. (2012). Quick Guide to Data Citation: Identify, Retrieve, Attribute. . https://www.icpsr.umich.edu/files/ICPSR/enewsletters/iassist.html
Declaration of Generative AI use
I acknowledge the use of Generative AI tools in writing this post. I used:
- Microsoft Copilot to refine the language.
I declare that I reviewed and edited the contents as needed, and take full responsibility for the content of the post; And the information provided is complete and accurate.
