Current Pipeline State
Loading pipeline metrics...
Data Collection & Processing Insights
Watch how raw web content transforms into structured language datasets. These interactive visualizations track source performance, processing speed, and quality metrics across 130,000+ Somali text records.
Collection Timeline: When Did We Gather Data?
Track collection velocity across Wikipedia, BBC Somali, HuggingFace, and Språkbanken. Larger bubbles show higher-volume periods, colors distinguish sources.
Which Sources Deliver the Most Data?
Radial comparison of total record contributions by source, with embedded trend sparklines showing collection momentum.
Dataset Growth: Cumulative Records Over Time
Watch the dataset expand as each source contributes records. Hover to see exact counts, dates, and cumulative totals at any point in the timeline.
Success Rate Trends: Performance Against 90% Baseline
Horizon charts reveal how each source performs against the 90% success threshold. Green areas exceed targets, red areas signal processing issues requiring attention.
Speed vs. Quality: How Sources Compare
Scatter plot correlating success rates with deduplication rates. Bubble size represents total records collected. Zoomed to actual data range for clarity.
Source Quality at a Glance: Uptime, Success, Records, Dedup
Quick health check across all sources. Each cell shows current value, status indicator, and sparkline trend. Color-coded for rapid diagnostics.
About This Project
The Somali Dialect Classifier addresses a fundamental challenge in natural language processing: building robust datasets for low-resource languages. While Somali is spoken by over 16 million people across the Horn of Africa and diaspora communities worldwide, it lacks the extensive datasets that power modern language technologies.
This project creates an automated, production-grade data pipeline that continuously aggregates, validates, and refines Somali text from diverse sources—enabling researchers and developers to build more accurate dialect classifiers, translation systems, and language models.
Key Features
- Multi-Source Aggregation: Combines Wikipedia, BBC Somali, HuggingFace corpora, and academic collections
- Quality-First Architecture: Comprehensive language detection, filtering, and validation at every stage
- Intelligent Deduplication: MinHash LSH algorithm removes duplicates while preserving dialectal variations
- Production-Grade Pipeline: Bronze-silver-gold layered architecture ensures data quality and traceability
Source-specific quality analysis
Loading quality reports...
Understanding the Metrics
Success rate measures how effectively our pipeline fetches and processes web content. It's calculated as the percentage of fetched URLs that successfully make it through our quality filters and validation checks.
Success Rate = (Records Successfully Processed / Total URLs Fetched) × 100%
Important context: This metric focuses on fetch and processing success, not discovery-to-completion conversion. Many discovered URLs are intentionally filtered out before fetching—navigation elements, asset links, and known duplicates never enter the pipeline.
For file-based sources like Wikipedia dumps and academic corpora, you'll typically see 100% success rates because files are pre-validated. For web scraping and streaming sources, expect 95-100% success rates, with minor failures due to network issues or content that doesn't meet quality thresholds.
A success rate below 100% is normal and expected for web-based data collection. The internet is dynamic, and several legitimate issues can cause processing failures:
- 404 Errors: Content moved or deleted since discovery
- Access Restrictions: Robots.txt rules or authentication requirements
- Network Issues: Timeouts, connection failures, or rate limiting
- Quality Filtering: Content too short, wrong language, or insufficient information
- Format Problems: Malformed HTML or unsupported content types
Our current 98.5% success rate indicates excellent pipeline health—we're successfully processing the vast majority of targets while maintaining high quality standards.
Our metrics tell the story of data flowing through multiple quality gates:
- URLs Discovered: Total links found during source exploration
- URLs Fetched: Subset of discovered URLs that passed initial filters and were actually downloaded
- URLs Processed: Successfully validated and added to the dataset
- Records Written: Final deduplicated records stored in the gold tier
Success rate is calculated as (URLs Processed / URLs Fetched) × 100%—
measuring how well we convert fetched content into valid records.
Deduplication rate shows what percentage of processed records were duplicates
of existing data, helping us understand content overlap across sources.
We provide multiple layers of diagnostic information to help you understand pipeline behavior and troubleshoot any issues:
- Quality Reports: Detailed breakdowns of HTTP status codes, error types, and failure patterns
- Metrics Files: Raw JSON data in
data/metrics/*_processing.jsonwith granular statistics - Visualization Drill-Downs: Click on charts to explore source-specific performance
- Health Matrix: Real-time indicators showing uptime, latency, and error rates per source
Each metrics file contains http_status_codes, error_types,
and filter_reasons fields that provide full transparency into what happened
during processing. Use these to optimize your pipeline configuration or identify sources
that may need attention.
Contribute to Somali NLP
Join a growing community of researchers, developers, and language enthusiasts building the largest open-source Somali language dataset. Whether you're contributing code, identifying new data sources, or improving documentation—your expertise helps advance language technology for millions of Somali speakers worldwide.