With over 14,500 items and 28 million words—the equivalent of more than 300 books—the Islam West Africa Collection has long exceeded what any individual researcher could process manually. What began as years of fieldwork accumulation risked becoming "digital clutter": thousands of documents gathered but never fully accessible.

These open-source Python pipelines address that challenge by integrating Large Language Models (Google Gemini, OpenAI, Mistral) into document processing workflows. They automate labor-intensive tasks that would otherwise consume months of work—or simply never get done. The goal is not to replace scholarly judgment, but to make large documentary collections tractable so researchers can focus on interpretation rather than data wrangling.

What the pipelines do

  • OCR extraction & correction: Extract text from PDF scans using multimodal vision models, and correct errors in legacy OCR including ALTO XML with coordinate preservation
  • Named entity recognition: Identify people, places, and organizations with authority reconciliation and fuzzy matching to handle variant spellings common in West African names
  • Summarization: Generate French summaries for document discovery across large collections
  • Audio & video processing: Transcribe interviews and oral histories (including Hausa content) and summarize videos with visual descriptions
  • Handwritten text recognition: Read manuscripts in French, Arabic, or mixed scripts
  • Magazine indexing: Extract and index individual articles from digitized periodicals with complex layouts

AI-NER-Validator

A companion web application (AI-NER-Validator) provides quality control for automatically extracted entities. Researchers can review articles with AI-extracted entities side by side, validate or reject each entity, add missing ones manually, and export clean CSV files. The tool ensures human oversight remains central to the workflow.

Open source

The code is open source and documented for adaptation. As lightweight open-source models improve, these workflows could eventually run locally without relying on commercial APIs—an important consideration for data sovereignty and long-term sustainability.

Limitations

These tools are research aids, not replacements for scholarly judgment. LLMs operate as black boxes with algorithmic opacity. Models trained predominantly on Western data may misrepresent African contexts and naming conventions. Unlike traditional OCR, which signals failure through garbled text, AI-generated errors appear as fluent prose—shifting the burden from fixing visible mistakes to detecting hidden ones. Effective use requires domain expertise and familiarity with the source material.

View on GitHub