Fully Open Source

Somali Dialect Classifier

An Open, End-to-End Somali Dialect ML Platform

The Somali NLP Initiative coordinates data acquisition, quality pipelines, annotation planning, modeling, and deployment pathways so researchers and partners can build dialect-aware language tools from a single open source foundation.

0 Total Records
0 Data Sources
0 Collection Methods
0 Insights Live
1
Data Ingestion
Current • Complete
2
Preprocessing
Upcoming
3
Model Training
Upcoming
4
Evaluation
Upcoming
5
Deployment
Upcoming
6
Monitoring
Upcoming
Overall Progress: 17% Complete

The Journey From Raw Data to AI-Ready Corpus

Low-resource languages like Somali face a critical barrier in NLP: lack of quality training data. We built an automated, production-grade data pipeline to systematically collect, clean, and unify Somali text from the web's most reliable sources. This is the foundation for advancing Somali language AI and making dialect classification accessible to researchers worldwide.

0 Records Collected

Through systematic crawling and API integration, we're assembling an open Somali language corpus, providing researchers with access to diverse linguistic data.

0% Avg Quality Rate

Our robust pipeline architecture ensures high-quality data through intelligent filtering, validation, and source-specific quality checks across all pipeline types.

0 Data Sources

By combining encyclopedic knowledge, journalism, ML datasets, and academic corpora, we ensure comprehensive coverage of Somali dialects, registers, and domains.

Open Source Impact

Released under an open license, this corpus democratizes access to Somali NLP research, enabling global collaboration and accelerating innovation for low-resource languages.

Data Ingestion Overview

Compiling latest ingestion snapshot…

0
Total Records
0%
Avg Quality Rate
0%
Success Rate
0
Active Sources

Processing Summary

Ingestion Velocity

Records per orchestrated run (latest 8 ingestion cycles)

Latest total
0
Waiting for pipeline runs…

Source Balance

Actual share vs target mix

Largest gap
Targets reflect planned dataset volumes from project documentation.

Pipeline Efficiency

Retention through discovery → silver dataset

Silver yield
    Retention values update after each orchestration run.

    Wikipedia

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    BBC

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    HuggingFace

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    Språkbanken

    Complete
    Records 0
    Percentage 0%
    Quality Rate 0%

    TikTok

    Ingesting
    Records 0
    Percentage 0%
    Quality Rate 0%

    Source Portfolio & ETL Readiness

    Mapping current corpus composition…

    Building ETL readiness overview…

    Active Sources

    Counting active feeds…

    Planned Adds

    Reviewing roadmap…

    Pipeline Types

    Compiling pipeline coverage…

    Coverage vs Target

    Calculating variance…

    Pipeline Stage Allocation

    Share of records at Discovery → Extraction → Silver per source

    Acquisition Method Treemap

    Relative volume by acquisition approach

    Ingestion Cadence

    Tracking latest orchestrations…

    Quota Status

    Daily quota usage and limits per source

    Recent Ingestion Runs

    Orchestration run history and statistics

    Recent Ingestion Runs
    Run ID Timestamp Records Quota Hits

    Source Readiness Checklist

    Filter by acquisition method or SLA to focus on workstreams.

    Source Readiness Comparison - Shows volume, quality, and update status for all data sources
    Source Records Share Avg Length Acquisition Refresh SLA Owner Stage Last Updated

    Integration Roadmap

    Upcoming sources and sunset decisions

    Quality Guardrails & Retention

    Compiling retention story…

    Silver Yield

    Awaiting data.

    Records Rejected

    Awaiting data.

    Non-Somali Blocks

    Awaiting data.

    Dedup Yield

    Awaiting data.

    Guardrail Coverage

    Filter families by source with share of rejections

    Retention Funnel

    Discovery to Silver with guardrail annotations

    Retention funnel updates after each run.

    Filter Drilldown

    Ranked rejections with policy context

    Top Filters

    Rejections by filter reason

    Exception Feed

    Active alerts and recommended actions

    Severity Source Alert Recommendation
    Loading alerts…

    Policy & Waiver Log

    Active waivers and manual review queues

    Pipeline Performance Metrics

    Throughput
    Duration
    Quality Pass Rate
    Retries

    Throughput Trend

    Records processed per minute over the last 10 runs with 7-day moving average

    Quality Pass Rate by Source

    Percentage of records passing validation checks per data source

    Stage Latency Waterfall

    Time spent across discovery, fetch, extract, quality, and write stages

    Pipeline Stage Durations

    Breakdown of time spent in each pipeline stage for the latest run

    Source SLA Monitor

    Live throughput vs SLA for each ingestion source

    Run Timeline & Backlog

    Recent orchestrations with duration, throughput, and retries

    Pipeline Run History

    Duration and retry status for the last 10 pipeline executions

    Retry & Error Heatmap

    Sources vs error types to spotlight hotspots

    Resource Utilization

    Worker concurrency, queue depth, and bandwidth

    Runbook Alerts

    Operator actions required for upcoming runs

    Severity Scope Alert Recommendation
    Loading alerts…

    Observation Log

    Known throttles and operational notes

    Technical Documentation

    Access technical reports, data schemas, API documentation, and export options.

    Metrics Report (JSON)

    Complete metrics data including success rates, processing times, and quality statistics for all sources.

    Download JSON

    Data Schema

    Comprehensive schema documentation for the Somali dialect corpus including field definitions and data types.

    View Schema

    API Documentation

    REST API endpoints for programmatic access to the corpus, including authentication and rate limits.

    View API Docs

    Export Formats

    Download the corpus in multiple formats: CSV, JSON, Parquet, or HuggingFace Datasets format.

    Export Options

    Usage Guide

    Step-by-step guide for loading and using the corpus in popular ML frameworks like PyTorch and TensorFlow.

    Read Guide

    Citation

    Academic citation information for referencing this corpus in research papers and publications.

    Copy Citation

    Help Build the Future of Somali NLP

    The Somali Dialect Classifier is an open-source project that thrives on community contributions. Whether you're a developer, linguist, researcher, or simply passionate about Somali language technology, there are many ways to get involved and make an impact.

    Submit Text

    Share Somali text samples, articles, or documents to expand our corpus.

    Improve Pipeline

    Contribute code to enhance data collection, processing, or quality checks.

    Add Sources

    Identify and integrate new high-quality Somali language data sources.

    Research

    Use the corpus for research and share insights on Somali dialects.

    View on GitHub