IWAC Chatbot | Digital Humanities

The IWAC Chatbot introduces an innovative approach to exploring the Islam West Africa Collection (IWAC) through semantic search, leveraging Retrieval Augmented Generation (RAG). This allows users to move beyond traditional keyword searches and engage with the archive in a more intuitive way, asking complex questions in natural language.

Core Principle: The chatbot connects a Large Language Model (LLM) to the specific knowledge base of the IWAC collection at the moment of response generation. This aims to address the inherent limitations of LLMs, such as fixed knowledge cutoffs, potential for hallucinations, and lack of domain-specific depth, by grounding them in reliable documentary sources from the Collection.

Benefits include:

Querying the Collection for concepts, nuances, and contexts that are difficult to capture with keywords.
Revealing subtle thematic connections between documents, even without shared vocabulary.
Receiving summarized information drawn from multiple archive sources, with references provided for scholarly rigor.
Enhancing accessibility to the Collection's wealth of information for a broader audience, less reliant on traditional archival research expertise.

The chatbot operates through a three-stage process:

Preliminary Preparation (One-time):
- Documents from the collection (e.g., JSON files) are ingested and digitally cleaned.
- Content is segmented into logical chunks (e.g., paragraphs).
- Each chunk is assigned a unique semantic "signature" (vector embedding) representing its meaning.
- These chunks and their embeddings are stored in a specialized vector database (e.g., Chroma) for efficient similarity searching.
Real-time Information Retrieval (Per Question):
- The user's question is also converted into a semantic embedding.
- The system compares this question embedding against the stored document chunk embeddings to find the closest matches.
- The most relevant text segments are selected.
Response Generation:
- A comprehensive prompt is constructed for the LLM, combining the original question with the retrieved relevant text segments.
- The LLM generates a natural language response based on this contextualized information.
- Critically, the response is founded on information extracted directly from the IWAC documents, and the sources used are listed with links for verification.

The quality of the instructional prompt given to the chatbot is fundamental to its performance.

Skills