Open Source · Open Data

Somali NLP Initiative

Building the dataset that Somali language AI has been missing.

Somali is spoken by 25 million people across the Horn of Africa and the diaspora — but almost no AI language tools support it. We're assembling an open, high-quality text corpus and the modeling pipeline to train the first open Somali dialect classifier.

Records Collected Across all sources
Sources Active Wikipedia, BBC, HuggingFace, Språkbanken, TikTok
In Curated Corpus Passed all quality filters
Quality Pass Rate Share retained after review

Dataset Overview

How much text we've collected, which sources it comes from, and what passed quality review.

Compiling latest ingestion snapshot…

0
Records Collected
0%
Quality Pass Rate
0%
Ingestion Success Rate
0
Active Sources

Processing Summary

Ingestion Velocity

Records per orchestrated run (latest 8 ingestion cycles)

Latest total
0
Waiting for pipeline runs…

Source Balance

Actual share vs target mix

Largest gap
Targets reflect planned dataset volumes from project documentation.

Pipeline Efficiency

Retention through collected → curated

Corpus yield
    Retention values update after each orchestration run.

    Wikipedia

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    BBC

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    HuggingFace

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    Språkbanken

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    TikTok

    Ingesting
    Records 0
    Percentage 0%
    Quality Rate 0%

    Data Sources

    Five sources, each with distinct licenses, acquisition methods, and quality profiles.

    Mapping current corpus composition…

    Building source readiness overview…

    Active Sources

    Counting active feeds…

    Planned Adds

    Reviewing roadmap…

    Pipeline Types

    Compiling pipeline coverage…

    Coverage vs Target

    Calculating variance…

    Pipeline Stage Allocation

    Share of records at each collection stage per source

    Acquisition Method Treemap

    Relative volume by acquisition approach

    Collection Timeline

    Tracking latest orchestrations…

    Quota Status

    Daily quota usage and limits per source

    Collection Runs

    Orchestration run history and statistics

    Collection Runs
    Run ID Timestamp Records Quota Hits

    Source Readiness Checklist

    Filter by acquisition method or SLA to focus on workstreams.

    Source Readiness Comparison - Shows volume, quality, and update status for all data sources
    Source Records Share Avg Length Acquisition Refresh SLA Owner Stage Last Updated

    Integration Roadmap

    Upcoming sources and sunset decisions

    Data Quality

    We apply 5 filters to every collected record. Only text that passes all criteria enters the corpus.

    Compiling retention story…

    Corpus Yield

    No runs yet

    Records Rejected

    No runs yet

    Non-Somali Blocks

    No runs yet

    Dedup Yield

    No runs yet

    Quality Criteria

    Filter families by source with share of rejections

    Retention Funnel

    Records collected to curated corpus, with quality filter annotations

    Retention funnel updates after each run.

    Filter Drilldown

    Ranked rejections with policy context

    Top Filters

    Rejections by filter reason

    Quality Issues

    Active alerts and recommended actions

    Severity Source Alert Recommendation
    Loading alerts…

    Quality Decisions

    Active waivers and manual review queues

    What's Next

    Data collection is complete. Here's the plan for building and releasing the dialect classifier.

    Complete

    Data Collection

    25,959 records collected from 5 sources. Quality filters applied. Curated corpus ready for preprocessing.

    Completed May 2026 · View sources →

    Up Next

    Preprocessing

    Tokenization, train/validation/test splitting (80/10/10), and dialect label assignment for annotated samples.

    Target: after corpus reaches 50,000 records

    Planned

    Model Training

    Fine-tune XLM-R and multilingual BERT as dialect classification baselines. Target: weighted F1 ≥ 0.80 per dialect class (Northern, Southern, Central).

    Planned Q3 2026

    Planned

    Evaluation & Error Analysis

    Per-dialect confusion matrices, error analysis on misclassified samples, and bias audit across data sources.

    Planned Q4 2026

    Planned

    Public Release

    Model card, HuggingFace Hub release, citation guide, and full reproducibility documentation. All training code, data, and model weights open-sourced.

    Planned Q1 2027

    Help Us Get There Faster

    The corpus needs more text and the models need annotation. Here's where you can help:

    • Linguists & native speakers: Help annotate dialect samples — contact us to get involved in the labeling effort.
    • Researchers: Cite this dataset in your work and share findings with the community.
    • Developers: Contribute to the pipeline, add new data sources, or improve quality filters.
    Get in Touch

    Help Build the Future of Somali NLP

    The Somali Dialect Classifier is an open-source project that thrives on community contributions. Whether you're a developer, linguist, researcher, or simply passionate about Somali language technology, there are many ways to get involved and make an impact.

    Submit Text

    Share Somali text samples, articles, or documents to expand our corpus.

    Improve Pipeline

    Contribute code to enhance data collection, processing, or quality checks.

    Add Sources

    Identify and integrate new high-quality Somali language data sources.

    Research

    Use the corpus for research and share insights on Somali dialects.

    View on GitHub