A methodological framework for the use of AI tools in automated workflows for generating validated and structured historical datasets

José Antonio Motilla-Chávez, Diego Espitia, Diego Perez-Martinez, Edgardo Galán Vásquez, Edgardo Ugalde, Marcela Lomelí-Jasso, Eduardo Perez-Martinez, Hector G. Perez-Gonzalez, Fernando Carlín-Loza, Valeria Martínez-Ramírez, Martín Zumaya Hernández

enero 17170

Schematic overview of the automated workflow for OCR processing, LLM-assisted correction, and entity extraction, implemented through the n8n orchestration framework

Resumen

The growing availability of digitized historical collections has enabled large-scale computational research; however, transforming heterogeneous and noisy textual data into structured and analyzable formats remains a major challenge within the Digital Humanities. This article presents a reproducible workflow for historical text processing that integrates Optical Character Recognition (OCR), Large Language Models (LLM)-assisted post-correction, and Named Entity Recognition (NER) into a unified pipeline. Implemented within the n8n automation framework, the workflow emphasizes transparency, modularity, and human-in-the-loop validation, enabling scholars to maintain interpretive control over data transformation. The pipeline is evaluated on five historical corpora in Spanish (eighteenth and nineteenth centuries), demonstrating significant reductions in Character and Word Error Rates (up to -93.5% and -66.8% respectively) through lightweight, open-weight LLMs (Gemma 3 27B, Qwen 2.5 32B, LLaMA 3 70B). NER performance using FLAIR achieves F1 scores above 0.96 for persons and organizations, and a semantic-level similarity evaluation is added through CoNES to assess distributed lexical recovery. Beyond reporting benchmarks, the study reflects on the epistemic implications of automated processing in historical research and argues that reproducible data pipelines are essential infrastructures for scaling relational analyses such as co-entity networks and computational historiography. All results contribute toward a transparent methodological model that bridges humanistic inquiry and computational automation while preserving scholarly traceability.

Tipo

Artículo de revista

Publicación

International Journal of Digital Humanities

A methodological framework for the use of AI tools in automated workflows for generating validated and structured historical datasets

Resumen

Edgardo Galán Vásquez

Investigador Asociado C