Data / AI

Edo Corpus

A machine readable corpus of the Edo language, featuring conversations and folk songs with English translations, to combat language extinction.

Year :

2025

Industry :

Arts & Culture / Data

Client :

Ubini

Project Duration :

Ongoing

Edo corpus
Edo corpus
Edo corpus

Problem :

Edo, a language spoken by over 2 million people in southern Nigeria, faces accelerating extinction in the digital age. The pool of monolingual Edo speakers is shrinking rapidly, particularly among younger generations who favour English in urban and semi-urban settings. Without machine-readable language data, Edo remains invisible to modern technology: no translation tools, no voice recognition, no digital learning resources exist.

This digital exclusion accelerates language death by disconnecting younger speakers and diaspora communities from their linguistic heritage. More critically, we're losing the culturally rich discourse preserved by elderly rural speakers: ceremonial language, traditional knowledge systems, proverbs, and folktales that cannot be recovered once these speakers are gone.

The window for capturing authentic, "pure" Edo (free from extensive code-switching) is closing within this generation.

Solution :

I am leading efforts to build a comprehensive parallel multimedia corpus of the Edo language: 2 million transcribed words (approximately 250 hours of recorded speech) with English translations by December 2027. This corpus prioritises recording elderly rural speakers and traditional specialists in high-yield contexts: funeral orations, ceremonial events, elder conversations, and traditional storytelling sessions.

Our sampling framework targets eight distinct language contexts, from traditional ceremonies with village elders to contemporary daily interactions, ensuring we capture Edo's full linguistic and cultural range.

The project operates in two phases: Phase 1 (through June 2026) focuses on recording 500,000 words from highest-priority contexts, whilst Phase 2 completes the remaining 1.5 million words. Beyond the corpus itself, we'll create a 5,000 to 8,000 headword Edo-English dictionary and make all materials freely accessible through digital archives and community distribution, providing the foundation for future language technologies and educational resources.

Challenge :

Creating a robust endangered language corpus requires navigating significant logistical, financial, and temporal challenges. We're racing against time to record elderly speakers (70+) in rural areas before this irreplaceable linguistic knowledge disappears. Operating on a self-funded model with modest community donations, we require £250,000 (£0.125 per word) to cover recording equipment, volunteer coordination, speaker honoraria, transcription, translation, and digital archiving. The work demands coordinating 6 to 10 volunteers across rural communities, often in areas with limited infrastructure, whilst maintaining rigorous quality standards: 90% Edo content minimum, complete metadata, and clear audio.

Technical challenges include managing dialectal variations, time-aligning 500 hours of audio with transcripts, and preparing data in formats suitable for computational linguistics. We must also navigate cultural protocols, ensure ethical consent procedures, and build trust within communities. Despite beginning with nine folk songs and three dialogues from our documentary film project, scaling from these initial resources to 2 million words represents an enormous undertaking that depends entirely on community support and volunteer commitment.

Summary :

Our Edo Language Corpus Project is a community-driven race against time to preserve one of Nigeria's most culturally significant languages through comprehensive digital documentation. With a goal of 2 million transcribed words by December 2027, we're capturing authentic Edo discourse, from ceremonial orations to grandmother-grandchild conversations, before the generation of fluent speakers is lost. Self-funded with community donations and supported by dedicated volunteers, we prioritise recording the highest-quality linguistic data: elderly rural speakers, traditional knowledge holders, and cultural specialists whose language represents Edo at its richest and most endangered.

The resulting parallel corpus, complete with English translations and cultural context, will enable the development of dictionaries, translation technologies, and educational resources whilst serving as a permanent archive of Edo's irreplaceable linguistic and cultural heritage. This is more than language preservation: it's safeguarding centuries of wisdom, identity, and expression for future generations.

More Projects

Data / AI

Edo Corpus

A machine readable corpus of the Edo language, featuring conversations and folk songs with English translations, to combat language extinction.

Year :

2025

Industry :

Arts & Culture / Data

Client :

Ubini

Project Duration :

Ongoing

Edo corpus
Edo corpus
Edo corpus

Problem :

Edo, a language spoken by over 2 million people in southern Nigeria, faces accelerating extinction in the digital age. The pool of monolingual Edo speakers is shrinking rapidly, particularly among younger generations who favour English in urban and semi-urban settings. Without machine-readable language data, Edo remains invisible to modern technology: no translation tools, no voice recognition, no digital learning resources exist.

This digital exclusion accelerates language death by disconnecting younger speakers and diaspora communities from their linguistic heritage. More critically, we're losing the culturally rich discourse preserved by elderly rural speakers: ceremonial language, traditional knowledge systems, proverbs, and folktales that cannot be recovered once these speakers are gone.

The window for capturing authentic, "pure" Edo (free from extensive code-switching) is closing within this generation.

Solution :

I am leading efforts to build a comprehensive parallel multimedia corpus of the Edo language: 2 million transcribed words (approximately 250 hours of recorded speech) with English translations by December 2027. This corpus prioritises recording elderly rural speakers and traditional specialists in high-yield contexts: funeral orations, ceremonial events, elder conversations, and traditional storytelling sessions.

Our sampling framework targets eight distinct language contexts, from traditional ceremonies with village elders to contemporary daily interactions, ensuring we capture Edo's full linguistic and cultural range.

The project operates in two phases: Phase 1 (through June 2026) focuses on recording 500,000 words from highest-priority contexts, whilst Phase 2 completes the remaining 1.5 million words. Beyond the corpus itself, we'll create a 5,000 to 8,000 headword Edo-English dictionary and make all materials freely accessible through digital archives and community distribution, providing the foundation for future language technologies and educational resources.

Challenge :

Creating a robust endangered language corpus requires navigating significant logistical, financial, and temporal challenges. We're racing against time to record elderly speakers (70+) in rural areas before this irreplaceable linguistic knowledge disappears. Operating on a self-funded model with modest community donations, we require £250,000 (£0.125 per word) to cover recording equipment, volunteer coordination, speaker honoraria, transcription, translation, and digital archiving. The work demands coordinating 6 to 10 volunteers across rural communities, often in areas with limited infrastructure, whilst maintaining rigorous quality standards: 90% Edo content minimum, complete metadata, and clear audio.

Technical challenges include managing dialectal variations, time-aligning 500 hours of audio with transcripts, and preparing data in formats suitable for computational linguistics. We must also navigate cultural protocols, ensure ethical consent procedures, and build trust within communities. Despite beginning with nine folk songs and three dialogues from our documentary film project, scaling from these initial resources to 2 million words represents an enormous undertaking that depends entirely on community support and volunteer commitment.

Summary :

Our Edo Language Corpus Project is a community-driven race against time to preserve one of Nigeria's most culturally significant languages through comprehensive digital documentation. With a goal of 2 million transcribed words by December 2027, we're capturing authentic Edo discourse, from ceremonial orations to grandmother-grandchild conversations, before the generation of fluent speakers is lost. Self-funded with community donations and supported by dedicated volunteers, we prioritise recording the highest-quality linguistic data: elderly rural speakers, traditional knowledge holders, and cultural specialists whose language represents Edo at its richest and most endangered.

The resulting parallel corpus, complete with English translations and cultural context, will enable the development of dictionaries, translation technologies, and educational resources whilst serving as a permanent archive of Edo's irreplaceable linguistic and cultural heritage. This is more than language preservation: it's safeguarding centuries of wisdom, identity, and expression for future generations.

More Projects

Data / AI

Edo Corpus

A machine readable corpus of the Edo language, featuring conversations and folk songs with English translations, to combat language extinction.

Year :

2025

Industry :

Arts & Culture / Data

Client :

Ubini

Project Duration :

Ongoing

Edo corpus
Edo corpus
Edo corpus

Problem :

Edo, a language spoken by over 2 million people in southern Nigeria, faces accelerating extinction in the digital age. The pool of monolingual Edo speakers is shrinking rapidly, particularly among younger generations who favour English in urban and semi-urban settings. Without machine-readable language data, Edo remains invisible to modern technology: no translation tools, no voice recognition, no digital learning resources exist.

This digital exclusion accelerates language death by disconnecting younger speakers and diaspora communities from their linguistic heritage. More critically, we're losing the culturally rich discourse preserved by elderly rural speakers: ceremonial language, traditional knowledge systems, proverbs, and folktales that cannot be recovered once these speakers are gone.

The window for capturing authentic, "pure" Edo (free from extensive code-switching) is closing within this generation.

Solution :

I am leading efforts to build a comprehensive parallel multimedia corpus of the Edo language: 2 million transcribed words (approximately 250 hours of recorded speech) with English translations by December 2027. This corpus prioritises recording elderly rural speakers and traditional specialists in high-yield contexts: funeral orations, ceremonial events, elder conversations, and traditional storytelling sessions.

Our sampling framework targets eight distinct language contexts, from traditional ceremonies with village elders to contemporary daily interactions, ensuring we capture Edo's full linguistic and cultural range.

The project operates in two phases: Phase 1 (through June 2026) focuses on recording 500,000 words from highest-priority contexts, whilst Phase 2 completes the remaining 1.5 million words. Beyond the corpus itself, we'll create a 5,000 to 8,000 headword Edo-English dictionary and make all materials freely accessible through digital archives and community distribution, providing the foundation for future language technologies and educational resources.

Challenge :

Creating a robust endangered language corpus requires navigating significant logistical, financial, and temporal challenges. We're racing against time to record elderly speakers (70+) in rural areas before this irreplaceable linguistic knowledge disappears. Operating on a self-funded model with modest community donations, we require £250,000 (£0.125 per word) to cover recording equipment, volunteer coordination, speaker honoraria, transcription, translation, and digital archiving. The work demands coordinating 6 to 10 volunteers across rural communities, often in areas with limited infrastructure, whilst maintaining rigorous quality standards: 90% Edo content minimum, complete metadata, and clear audio.

Technical challenges include managing dialectal variations, time-aligning 500 hours of audio with transcripts, and preparing data in formats suitable for computational linguistics. We must also navigate cultural protocols, ensure ethical consent procedures, and build trust within communities. Despite beginning with nine folk songs and three dialogues from our documentary film project, scaling from these initial resources to 2 million words represents an enormous undertaking that depends entirely on community support and volunteer commitment.

Summary :

Our Edo Language Corpus Project is a community-driven race against time to preserve one of Nigeria's most culturally significant languages through comprehensive digital documentation. With a goal of 2 million transcribed words by December 2027, we're capturing authentic Edo discourse, from ceremonial orations to grandmother-grandchild conversations, before the generation of fluent speakers is lost. Self-funded with community donations and supported by dedicated volunteers, we prioritise recording the highest-quality linguistic data: elderly rural speakers, traditional knowledge holders, and cultural specialists whose language represents Edo at its richest and most endangered.

The resulting parallel corpus, complete with English translations and cultural context, will enable the development of dictionaries, translation technologies, and educational resources whilst serving as a permanent archive of Edo's irreplaceable linguistic and cultural heritage. This is more than language preservation: it's safeguarding centuries of wisdom, identity, and expression for future generations.

More Projects