Structuring a data lake for the management of scientific information in Brazil
DOI:
https://doi.org/10.47909/978-9916-9331-4-5.112Keywords:
data gap, data interoperability, scientific information management, academic databasesAbstract
The initial steps involved in the establishment of a data lake (Laguna) were delineated. This data lake was fed with structured data from the data ecosystem of the Brazilian Current Research Information System (BrCris). The data lake was developed to manage scientific information and aggregate this content into an accessible system. A substantial amount of data was collected and processed across five phases: (1) collection; (2) selection and separation; (3) transformation and connection; (4) organization, classification, and indexing; and (5) retrieval and visualization. The study utilized a range of data extraction methodologies on disparate platforms, employing SQL or API to facilitate the process. A set of scientific journals was identified through a process of stratification, with the highest percentage belonging to the A1 category. The initial integration of OpenAlex and DOAJ data was conducted, marking a significant milestone in the development of the platform. The author data were disambiguated and cross-checked by DOI to identify citing and cited authors. A comprehensive set of relevant data was obtained to facilitate the formulation of robust inferences, including the standardized number of journals by stratification, the integration between disparate databases such as OpenAlex and DOAJ, the ontological system employed to address the disassociation of authors, and the representation of the cited author before journals and future authorities.
Downloads
References
Coimbra, F. S., & Dias, T. M. R. (2021). Use of open data to analyze the publication of articles in scientific events. Iberoamerican Journal of Science Measurement and Communication, 1(3), 1–13. https://doi.org/10.47909/ijsmc.123
Dias, T. M. R., Mena-Chalco, J. P., Segundo, W. L. R. C., Pinto, A. L., & Moreira, T. H. J. (2022). BrCris: Plataforma Para Integração, Análises E Visualização De Dados Técnicos-Científicos. Informação & Informação, 27, 622–638. https://doi.org/10.5433/1981-8920.2022v27n3p622
do Carmo, D., & da Silva Lemos, D. L. (2022). Padrões de qualidade para dados e metadados endereçados a aplicações em ciência de dados. In Advanced notes in information science (vol. 2, pp. 161–170). ColNes Publishing. https://doi.org/10.47909/anis.978-9916-9760-3-6.116
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., & Mitschang, B. (2019). Leveraging the data lake: Current state and challenges. In C. Ordonez, I. Y. Song, G. Anderst-Kotsis, A. Tjoa, I. Khalil (Eds.), Big Data analytics and knowledge discovery. DaWaK 2019. Lecture Notes in Computer Science (p. 11708). Springer. https://doi.org/10.1007/978-3-030-27520-4_13
Gontijo, M. C. A., Hamanaka, R. Y., & de Araujo, R. F. (2021). Research data management: A bibliometric and altmetric study based on dimensions. Iberoamerican Journal of Science Measurement and Communication, 1(3), 1–19. https://doi.org/10.47909/ijsmc.120
John, T., & Misra, P. (2017). Data lake for enterprises: Lambda architecture for building enterprise data systems. Packt Publishing.
Mascarenhas, H., Rodrigues Dias, T. M., & Dias, P. (2021). Academic mobility of doctoral students in Brazil: An analysis based on Lattes Platform. Iberoamerican Journal of Science Measurement and Communication, 1(3), 1–15. https://doi.org/10.47909/ijsmc.53
Nargesian, F., Zhu, E., Miller, R., & Pu, Q. (2019). Lake management: Challenges and opportunities. Proceedings of the VLDB Endowment, (2), 1986–1989. https://doi.org/10.14778/3352063.3352116
Netto, M. C. S., & Pinto, A. L. (2022). O silêncio dos dados diz muito, basta prestar atenção: breves experimentos sobre análise exploratória visual. In T. M. R. Dias (Ed.), Informação, Dados e Tecnologia. Advanced Notes in Information Science (vol. 2, pp. 15–23). ColNes Publishing. https://doi.org/10.47909/anis.978-9916-9760-3-6.118
Oliveira, L. F. R., & Martins, D. L. (2022). Coleta de dados para agregação de repositórios digitais: Entidades vinculadas à Secretaria Especial de Cultura do Brasil. In Advanced Notes in Information Science (vol. 2, pp. 171–181). ColNes Publishing. https://doi.org/10.47909/anis.978-9916-9760-3-6.106
Pinto, A. L., Segundo, W. L. R. C., Dias, T. M. R., Silva, V. S., Gomes, J., & Quoniam, L. M. (2022). Brazil Developing Current Research Information Systems (BrCRIS) as data sources for studies of research. Iberoamerican Journal of Science Measurement and Communication, 2(1), 1–12. https://doi.org/10.47909/ijsmc.135
Ravat, F., & Zhao, Y. (2019). Data lakes: Trends and perspectives. In S. Hartmann, J. Küng, S. Chakravarthy, G. Anderst-Kotsis, A. Tjoa, & I. Khalil (Eds.), Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science (pp. 11706, 304–313). Springer. https://doi.org/10.1007/978-3-030-27615-7_23
Segundo, W. L. R. C., & Sena, P. (2023, April 5–6). Laguna—FAIR research data infrastructure and open science support observatory [Conference session]. Expert Finder Systems, Coral Gables Miami.
Segundo, W., Dias, T. M., Moreira, T., Pinto, A. L., Silva, V., Gomes, J., Quoniam, L., Matas, L., Dias, A., & Schneider, J. (2022). Uma estratégia para coleta, integração e tratamento de dados científicos no contexto do BrCris. In T. M. R. Dias (Ed.), Informação, Dados e Tecnologia. Advanced Notes in Information Science (vol. 2, pp. 215–222). ColNes Publishing. https://doi.org/10.47909/anis.978-9916-9760-3-6.117
Silberschatz, A., Korth, H. F., & Sudarshan, S. (2011). Relational database design. In Database design concepts (6th ed.). McGraw-Hill.
Sousa, R. P. M., & Shintaku, M. (2022). Política de privacidade de dados: observações relevantes para sua implementação. In T. M. R. Dias (Ed.), Informação, Dados e Tecnologia. Advanced Notes in Information Science (vol. 2, pp. 82–91). ColNes Publishing. https://doi.org/10.47909/anis.978-9916-9760-3-6.112
Valles-Coral, M., Injante, R., Hernández-Torres, E., Pinedo, L., Navarro-Cabrera, J. R., Salazar-Ramírez, L., Cárdenas-García, Á., & Huancaruna, E. (2023). Aggregation of institutional repositories for the analysis of the scientific performance of Peruvian universities. Iberoamerican Journal of Science Measurement and Communication, 3. https://doi.org/10.47909/ijsmc.63
Vilas Boas, R. F., Campos, F. F., Andrade, D. A. F., & Canto, F. L. (2023). Revistas científicas registradas no DOAJ: análise a partir do Índice H5. BiblioCanto, 9(2), 100–115. https://doi.org/10.21680/2447-7842.2023v9n2ID33680
Wilkinson, M. D., Dumontier, M., Jsbrand Jan Aalbersberg, I., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3(1), 1–9. https://doi.org/10.1038/sdata.2016.18
Witt, A. S., & Silva, F. C. C. da. (2022). Analysis of citizen science in Brazil: A study of the projects registered in the Civis platform. Iberoamerican Journal of Science Measurement and Communication, 2(3). https://doi.org/10.47909/ijsmc.162
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Washington Luís Ribeiro de Carvalho Segundo, Fábio Lorensi do Canto, Patrícia da Silva Neubert, Adilson Luiz Pinto, Carlos Luis González-Valiente

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) which permits copying and redistributing the material in any medium or format, adapting, transforming and building upon the material as long as the license terms are followed.