Dataset Overview
How much text we've collected, which sources it comes from, and what passed quality review.
Compiling latest ingestion snapshot…
Processing Summary
Ingestion Velocity
Records per orchestrated run (latest 8 ingestion cycles)
Source Balance
Actual share vs target mix
Pipeline Efficiency
Retention through collected → curated
BBC
✓ CompleteHuggingFace
✓ CompleteSpråkbanken
✓ CompleteTikTok
⏳ IngestingData Sources
Five sources, each with distinct licenses, acquisition methods, and quality profiles.
Mapping current corpus composition…
Building source readiness overview…
Active Sources
—
Counting active feeds…
Planned Adds
—
Reviewing roadmap…
Pipeline Types
—
Compiling pipeline coverage…
Coverage vs Target
—
Calculating variance…
Pipeline Stage Allocation
Share of records at each collection stage per source
Acquisition Method Treemap
Relative volume by acquisition approach
Collection Timeline
Tracking latest orchestrations…
Quota Status
Daily quota usage and limits per source
Collection Runs
Orchestration run history and statistics
| Run ID | Timestamp | Records | Quota Hits |
|---|
Source Readiness Checklist
Filter by acquisition method or SLA to focus on workstreams.
| Source | Records | Share | Avg Length | Acquisition | Refresh SLA | Owner | Stage | Last Updated |
|---|
Integration Roadmap
Upcoming sources and sunset decisions
Data Quality
We apply 5 filters to every collected record. Only text that passes all criteria enters the corpus.
Compiling retention story…
No runs yet
No runs yet
No runs yet
No runs yet
Quality Criteria
Filter families by source with share of rejections
Retention Funnel
Records collected to curated corpus, with quality filter annotations
Retention funnel updates after each run.
Filter Drilldown
Ranked rejections with policy context
Top Filters
Rejections by filter reason
Trend Benchmarks
Track retention stability by run
Quality Trend
Record-weighted pass rate over recent runs
Success vs Quality
Per-source stability across runs
Quality Issues
Active alerts and recommended actions
| Severity | Source | Alert | Recommendation |
|---|---|---|---|
| Loading alerts… | |||
Quality Decisions
Active waivers and manual review queues
What's Next
Data collection is complete. Here's the plan for building and releasing the dialect classifier.
Data Collection
25,959 records collected from 5 sources. Quality filters applied. Curated corpus ready for preprocessing.
Preprocessing
Tokenization, train/validation/test splitting (80/10/10), and dialect label assignment for annotated samples.
Model Training
Fine-tune XLM-R and multilingual BERT as dialect classification baselines. Target: weighted F1 ≥ 0.80 per dialect class (Northern, Southern, Central).
Evaluation & Error Analysis
Per-dialect confusion matrices, error analysis on misclassified samples, and bias audit across data sources.
Public Release
Model card, HuggingFace Hub release, citation guide, and full reproducibility documentation. All training code, data, and model weights open-sourced.
Help Us Get There Faster
The corpus needs more text and the models need annotation. Here's where you can help:
- Linguists & native speakers: Help annotate dialect samples — contact us to get involved in the labeling effort.
- Researchers: Cite this dataset in your work and share findings with the community.
- Developers: Contribute to the pipeline, add new data sources, or improve quality filters.