A Large-Scale Open Dataset of Computer Science Research Papers (2020–2025)

Zaid Mundher, Manar Talat Ahmad

Abstract


The rapid growth of publications in different fields, such as computer science, required well-structured datasets to support data-driven research. This paper presents an open large-scale dataset of computer science research papers published between 2020 and 2025, collected from Crossref metadata using the Crossref REST API. A structured keyword-based retrieval framework was developed to collect papers and their associated metadata. Preprocessing techniques, including cleaning, normalization, and validation were also made on the collected data. The introduced dataset has 4,313,328 research paper records which represents one of the largest structured collections of computer science publications for the specified period. The dataset provides comprehensive metadata fields that enable large-scale analysis, research trend identification, collaboration network exploration, and the recommendation systems development.

Keywords


computer science research papers; Crossref; large-scale dataset; REST API; Zenodo

Full Text:

PDF

References


M. Yıldız and T. K. Yılmaz, “Bibliometric Analysis in Scientific Research using R: A Review of Scopus and Web of Science Databases,” Journal of Data Applications, pp. 31–46, 2024, DOI: 10.26650/JODA.1462396.

M. Juliardi and I. Malik, “Bibliometric Analysis of Data Science: Trends, Contributions, and Research Developments,” West Science Interdisciplinary Studies, Vol. 1, pp. 365–375, 2023, DOI: 10.58812/wsis.v1i07.81.

Chudlarský, Tomáš & Dvorak, Jan. (2020). Can Crossref Citations Replace Web of Science for Research Evaluation? The Share of Open Citations. Journal of Data and Information Science. 5. 10.2478/jdis-2020-0037.

Pentz, Ed. (2022). Role of Crossref in Journal Publishing Over the Next Decade. Science Editing. 9. 53-57. 10.6087/kcse.263.

Lammey, Rachael. (2019). How Publishers Can Work with Crossref on Data Citation. Science Editing. 6. 166-170. 10.6087/kcse.165.

Visser, Martijn & van Eck, Nees Jan & Waltman, Ludo. (2021). Large-Scale Comparison of Bibliographic Data Sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. Quantitative Science Studies. 2. 1-37. 10.1162/qss_a_00112.

Garrido, Irene & Loureiro, Maria & Gutleber, Johannes. (2025). The Value of an Open Scientific Data and Documentation Platform in a Global Project: The Case of Zenodo.

In The Economics of Big Science 2.0 (pp. 181–200). Springer. https://doi.org/10.1007/978-3-031-60931-2_14

Sicilia, M. & Barriocanal, Elena & Sánchez-Alonso, Salvador. (2017). Community Curation in Open Dataset Repositories: Insights from Zenodo. Procedia Computer Science. 106. 10.1016/j.procs.2017.03.009.

van Eck, Nees Jan & Waltman, Ludo. (2022). Crossref as a Source of Open Bibliographic Metadata. 10.31222/osf.io/smxe5.

Liang, Zhentao & Mao, Jin & Lu, Kun & Li, Gang. (2021). Finding Citations for PubMed: A Large-Scale Comparison between Five Freely Available Bibliographic Data Sources. 10.48550/arXiv.2111.00172.

Badalova, Fidan & Sienkiewicz, Julian & Mayr, Philipp. (2026). PreprintToPaper dataset: Connecting bioRxiv Preprints with Journal Publications. Scientific Data. 13. 10.1038/s41597-026-06867-3.

Hendricks, Ginny & Tkaczyk, Dominika & Lin, Jennifer & Feeney, Patricia. (2020). Crossref: The Sustainable Source of Community Owned Scholarly Metadata. Quantitative Science Studies. 1. 1-14. 10.1162/qss_a_00022.

Deda, Yohanis Ndapa. (2023). Bibliometric Analysis of Higher-Order Thinking Skills based on Google Scholar, Crossref, and Scopus Database. 127-136. 10.23917/varidika.v35i2.23223.

Pirmanto, Dovel. (2025). Analisis Bibliometrik Artificial Intelligence pada Database Crossref (2014 – 2024). Shaut Al-Maktabah : Jurnal Perpustakaan, Arsip dan Dokumentasi. 17. 87-106. 10.37108/shaut.v17i2.2409.

Borrego, Ángel & Ardanuy, Jordi & Arguimbau, Llorenç. (2023). Crossref as a Bibliographic Discovery Tool in the Arts and Humanities. Quantitative Science Studies. 4. 1-17. 10.1162/qss_a_00240.




DOI: https://doi.org/10.32520/stmsi.v15i4.6255

Article Metrics

Abstract view : 3 times
PDF - 0 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.