Building AI Pipelines for African Digital Collections: Lessons from the Islam West Africa Collection
How do you make 26 million words of West African history searchable?
The Islam West Africa Collection spans six countries and comprises over 14,500 documents. Managing this digital database involves grappling with scale.
I recently discussed this challenge with the African Literary Metadata (ALMEDA) team at Uppsala University, at the invitation of Ashleigh Harris. In my talk, "Building AI Pipelines for African Digital Collections: Lessons from the Islam West Africa Collection", I mapped out four specific workflows that I have developed to manage this volume of data:
- OCR correction: Cleansing "noisy" text from newspaper scans to improve readability.
- Text extraction: Deploying multimodal AI to interpret complex magazine layouts and handwritten notes.
- Named Entity Recognition (NER): Automating metadata to locate specific people, places, and organisations.
- Indexing: Building rich, searchable indices that open up entire publications to researchers.
We also raised a critical point: technology must not become a barrier. In order to meaningfully support African libraries, we must prioritise "minimal computing" and "digital sobriety". These approaches ensure that metadata workflows remain cost-effective and open source, rather than locking institutions into expensive, unmaintainable systems.
We also addressed the messy reality of the archive. Dealing with unstructured data, mixed languages, and irregular layouts requires tools that are not just powerful, but pragmatic.