Referencias

  • Agencia de los Derechos Fundamentales de la Unión Europea (2019). Data quality and artificial intelligence: mitigating bias and error to protect fundamental rights, Publications Office, https://data.europa.eu/doi/10.2811/546219 
  • Austermühl, F. (2001). Electronic tools for translators. Routledge.
  • Bonet-Jover, A., Sepúlveda-Torres, R., Saquete, E., Martínez-Barco, P., y Nieto-Pérez, M.(2024). RUN-AS: a novel approach to annotate news reliability for disinformation detection. Language Resources and Evaluation, 58(2), 609-639.
  • Botella-Gil, B., Espinosa-Zaragoza, I., Moreda, P., y Palomar, M. (2024). GPLSI: Corpus ClearSim.
  • Botella-Gil, B., Sepúlveda-Torres, R., Bonet-Jover, A., Martínez-Barco, P., y Saquete, E. (2024). Semi-automatic dataset annotation applied to automatic violent message detection. IEEE Access, 12, 19651-19664.
  • Creswell, J. W., y Plano Clark, V. L. (2018). Designing and Conducting Mixed Methods Research (3rd ed.). Thousand Oaks, CA: SAGE.
  • Cooke, A. (2001). A guide to finding quality information on the Internet: selection and evaluation strategies (2nd ed.). Library Association.
  • Jiménez Piano, M., y Ortiz-Repiso Jiménez, V. (2007). Evaluación y calidad de sedes web. Ediciones Trea.
  • Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S. Y., Bansal, H., Guha, E., Keh, S., Arora, K., Garg, S., Xin, R., Muennighoff, N., Heckel, R., Mercat, J., Chen, M., Gururangan, S., Wortsman, M., Albalak, A., Bitton, Y., Nezhurina, M., Abbas, A., Hsieh, C.-Y., Ghosh, D., Gardner, J., Kilian, M., Zhang, H., Shao, R., Pratt, S., Sanyal, S., Ilharco, G., Daras, G., Marathe, K., Gokaslan, A., Zhang, J., Chandu, K., Nguyen, T., Vasiljevic, I., Kakade, S., Song, S., Sanghavi, S., Faghri, F., Oh, S., Zettlemoyer, L., Lo, K., El-Nouby, A., Pouransari, H., Toshev, A., Wang, S., Groeneveld, D., Soldaini, L., Koh, P. W., Jitsev, J., Kollar, T., Dimakis, A. G., Carmon, Y., Dave, A., Schmidt, L., y Shankar, V. (2024). Datacomp-lm: In search of the next generation of training sets for language models. Advances in Neural Information Processing Systems, 37, 14200–14282.
  • Miró-Maestre, M., Estevanell-Valladares, E. L., Sepúlveda-Torres, R., y Suárez-Cueto, A. (2025). Enhancing Pragmatic Processing: A Two-Dimension Approach to Detecting Intentions in Spanish. Procesamiento del lenguaje natural, 74, 263-276.
  • Miró-Maestre, M., Martínez-Murillo, I., Lloret, E., Moreda, P., y Suárez-Cueto, A. (2024). COCOTEROS: A spanish corpus with contextual knowledge for natural language generation. In 40th Annual Conference of the Spanish Association for Natural Language Processing (p. 2024).
  • Penedo, G., Kydlíček, H., Lozhkov, A., Mitchell, M., Raffel, C. A., Von Werra, L., y Wolf, T. (2024a). The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37, 30811-30849.
  • Penedo, G., Kydlíček, H., Sabolčec, V., Messmer, B., Foroutan, N., Jaggi, M., von Werra, L., y Wolf, T. (2024b). FineWeb2: A sparkling update with 1000s of languages. HuggingFace. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
  • Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., y Launay, J. (2023). The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., y Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.
  • Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., y Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. URL: https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama/
  • Together Computer. (2023). RedPajama: An open-source reproduction of LLaMA training dataset. https://www.together.xyz/blog/redpajama
  • Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., y Levy, O. (2023). Lima: Less is more for alignment (arXiv:2305.11206). arXiv. https://arxiv.org/abs/2305.11206