Automatic Speech Recognition
NeMo
Finnish
asr
speech-recognition
canary-v2
kenlm
finnish
Eval Results (legacy)
Instructions to use RASMUS/Finnish-ASR-Canary-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use RASMUS/Finnish-ASR-Canary-v2 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("RASMUS/Finnish-ASR-Canary-v2") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Upload PLAN_AND_PROGRESS.md with huggingface_hub
Browse files- PLAN_AND_PROGRESS.md +497 -54
PLAN_AND_PROGRESS.md
CHANGED
|
@@ -101,68 +101,511 @@ We use a balanced mix of datasets to cover various audio qualities and transcrip
|
|
| 101 |
|
| 102 |
---
|
| 103 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
## π Progress & Results
|
| 105 |
|
| 106 |
-
### Current Status: **
|
| 107 |
-
We have successfully completed the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
---
|
| 116 |
|
| 117 |
-
##
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
---
|
| 156 |
|
| 157 |
## π Progress Log
|
| 158 |
-
- **2026-01-11:** Initial project setup
|
| 159 |
- **2026-02-08:** Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
|
| 160 |
-
- **2026-02-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
- **2026-02-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
- **Deployment:** Uploaded all final models and documentation to [RASMUS/Finnish-ASR-Canary-v2](https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2).
|
|
|
|
| 101 |
|
| 102 |
---
|
| 103 |
|
| 104 |
+
## π Training Data Analysis
|
| 105 |
+
|
| 106 |
+
This section documents the composition and length distribution of our training data (from `RASMUS/canary-finnish-asr-data`, accessed 2026-02-26).
|
| 107 |
+
|
| 108 |
+
### Dataset Summary
|
| 109 |
+
|
| 110 |
+
| Dataset | Samples | Mean Duration | Max Duration | Total Hours |
|
| 111 |
+
|---------|---------|--------------|-------------|-------------|
|
| 112 |
+
| **Common Voice v24** | 9,086 | 4.5s | 10.5s | 11.2h |
|
| 113 |
+
| **VoxPopuli** | 8,164 | 10.1s | 50.5s | 23.0h |
|
| 114 |
+
| **CSS10** | 3,226 | 7.7s | 20.2s | 6.9h |
|
| 115 |
+
| **FLEURS** | 2,704 | 11.7s | 43.2s | 8.8h |
|
| 116 |
+
| **TOTAL** | **23,180** | **7.8s** | **50.5s** | **~50h** |
|
| 117 |
+
|
| 118 |
+
### Duration Distribution (Training Set)
|
| 119 |
+
|
| 120 |
+
```
|
| 121 |
+
0β5s : 33.3% (7,725 samples) ββββββββββββββββββββββββββββββββββββββββ
|
| 122 |
+
5β10s : 43.7% (10,139 samples) βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 123 |
+
10β15s : 15.0% (3,473 samples) ββββββββββββββββββ
|
| 124 |
+
15β20s : 5.4% (1,241 samples) ββββββ
|
| 125 |
+
20β30s : 2.4% (562 samples) βββ
|
| 126 |
+
>30s : 0.2% (40 samples)
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
**Key insight:** 77% of training samples are shorter than 10 seconds. The model has very little exposure to longer audio segments (only 0.2% are >30s). This has direct implications for long-form inference stability.
|
| 130 |
+
|
| 131 |
+
### Evaluation Set Durations
|
| 132 |
+
|
| 133 |
+
| Eval Set | Samples | Mean Duration | Max Duration |
|
| 134 |
+
|----------|---------|--------------|-------------|
|
| 135 |
+
| FLEURS | 918 | 13.0s | 33.7s |
|
| 136 |
+
| Common Voice | 1,554 | 5.1s | 10.5s |
|
| 137 |
+
| CSS10 | 170 | 7.5s | 10.2s |
|
| 138 |
+
| VoxPopuli | 430 | 10.6s | 47.5s |
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
## π’ Number Handling Analysis
|
| 143 |
+
|
| 144 |
+
### Live Inference Results: Base vs Finetuned (2026-02-26)
|
| 145 |
+
|
| 146 |
+
We ran both models on 5 FLEURS test samples to determine each model's number output style.
|
| 147 |
+
|
| 148 |
+
| # | Scenario | Reference | Base Canary-v2 | Our Finetuned |
|
| 149 |
+
|---|----------|-----------|----------------|---------------|
|
| 150 |
+
| 1 | Spoken "sata" (hundred) | `yli sata vuotta` | `yli 100 vuotta` β | `yli 100 vuotta` β |
|
| 151 |
+
| 2 | Spoken "seitsemΓ€ntoista" (17) | `surmaten seitsemΓ€ntoista henkeΓ€` | `surmaten 17 henkeΓ€` β | `surmaten seitsemΓ€ntoista henkeΓ€` β
|
|
| 152 |
+
| 3 | Digits in reference (15, 2011, 2017) | `15 metriΓ€... 2011... 2017` | Correct β
| Correct β
|
|
| 153 |
+
| 4 | Abbreviation "jKr." (AD) | `400 jKr.` | `400 jΓ€lkeen Kristuksen` | `400 jΓ€lkeen Kristuksen` |
|
| 154 |
+
| 5 | Range "25β30" (en-dash U+2013) | `25β30 vuodella` | `25-30 vuodella` (ASCII hyphen) | `25 β 30 vuodella` β UNK token |
|
| 155 |
+
|
| 156 |
+
**Key findings:**
|
| 157 |
+
|
| 158 |
+
1. **Base model outputs digits.** When the speaker says "sata" (hundred) or "seitsemΓ€ntoista" (seventeen), the base Canary-v2 outputs `100` and `17`. This is NVIDIA's built-in text normalisation β Canary always outputs digit form for numbers.
|
| 159 |
+
|
| 160 |
+
2. **Finetuning introduced inconsistency.** Our finetuning partially reversed this: for `seitsemΓ€ntoista` the finetuned model now outputs the written word (because FLEURS training transcripts used written-out numbers), but still outputs `100` for `sata`. This inconsistency is worse than either consistent policy.
|
| 161 |
+
|
| 162 |
+
3. **En-dash produces a UNK token in the finetuned model.** The character `β` (U+2013 en-dash) in `25β30` causes the finetuned model to emit `β` (SentencePiece UNK). The base model degrades gracefully to an ASCII hyphen `25-30`. This is a regression introduced by finetuning β likely because the en-dash was absent or inconsistently encoded in our training data.
|
| 163 |
+
|
| 164 |
+
4. **Abbreviations are expanded by both models.** `jKr.` β `jΓ€lkeen Kristuksen` in both β this is model behaviour, not a finetuning artifact.
|
| 165 |
+
|
| 166 |
+
### Policy Decision
|
| 167 |
+
**We want digit output** (not written-out Finnish number words). The base model's behaviour is correct here. The finetuned model regressed on consistency because our FLEURS training transcripts used written-out numbers.
|
| 168 |
+
|
| 169 |
+
### Training Data Issues Found
|
| 170 |
+
- Only **2.5% (578 / 23,180)** of training samples contain digit characters at all.
|
| 171 |
+
- FLEURS transcripts use written-out numbers (`sata vuotta`) while VoxPopuli and Common Voice use digits. This gives the model conflicting signal.
|
| 172 |
+
- En-dash (`β` U+2013) may be absent or mis-encoded in training manifests, causing UNK tokens at inference time.
|
| 173 |
+
|
| 174 |
+
### Action Plan: Numbers & UNK Token
|
| 175 |
+
|
| 176 |
+
#### Step 1 β Normalise training transcripts to digit form
|
| 177 |
+
Run a pre-processing pass on `train_manifest.json` before the next training run:
|
| 178 |
+
- Use the Python library `num2words` with locale `fi` to convert Finnish written-out numbers to digits: e.g. `sata` β `100`, `seitsemΓ€ntoista` β `17`.
|
| 179 |
+
- OR (simpler / safer): replace the FLEURS transcripts in the manifest with their **raw reference texts which already have digits** (FLEURS provides both `raw_transcription` and `transcription` columns; currently we use `raw_transcription` which has written numbers).
|
| 180 |
+
- Target: **all numeric quantities consistently in digit form** across all four datasets.
|
| 181 |
+
|
| 182 |
+
#### Step 2 β Fix en-dash encoding (ROOT CAUSE CONFIRMED)
|
| 183 |
+
|
| 184 |
+
**Confirmed via tokenizer inspection (2026-02-26):**
|
| 185 |
+
|
| 186 |
+
```python
|
| 187 |
+
m.tokenizer.text_to_ids("25β30") # β [16053, 1125, 1128, 0, 1126, 1123]
|
| 188 |
+
# β id 0 = UNK for the en-dash!
|
| 189 |
+
m.tokenizer.text_to_ids("25-30") # β [16053, 1125, 1128, 16107, 1126, 1123]
|
| 190 |
+
# β ASCII hyphen tokenises correctly
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
- **En-dash `β` (U+2013) and em-dash `β` (U+2014) are NOT in the CanaryBPETokenizer vocabulary** (both map to UNK id 0).
|
| 194 |
+
- Training data contains **85 entries with en-dash** (83 FLEURS, 2 Common Voice). During training, the en-dash in the TARGET text was encoded as UNK, so the model learned to produce UNK for the corresponding speech sounds.
|
| 195 |
+
- **Fix: replace all `β` and `β` with ASCII hyphen `-` in all training transcripts** before the next training run. This is a one-line preprocessing step.
|
| 196 |
+
|
| 197 |
+
```python
|
| 198 |
+
# In manifest preprocessing:
|
| 199 |
+
text = text.replace('\u2013', '-').replace('\u2014', '-')
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
#### Step 3 β Re-evaluate after normalisation
|
| 203 |
+
After normalising transcripts, re-run the 5-sample live inference test to verify:
|
| 204 |
+
- `sata vuotta` audio β model outputs `100 vuotta`
|
| 205 |
+
- `seitsemΓ€ntoista` audio β model outputs `17`
|
| 206 |
+
- `25β30` audio β model outputs `25-30` or `25β30` (no UNK)
|
| 207 |
+
|
| 208 |
+
---
|
| 209 |
+
|
| 210 |
+
## π Long-Form Audio: Root Cause Analysis
|
| 211 |
+
|
| 212 |
+
Our test file `moo.wav` is **30 minutes** (1,800s) of continuous Finnish speech. This reveals a core gap vs. our finetuned Whisper model.
|
| 213 |
+
|
| 214 |
+
### How Canary-v2 Handles Long Audio (Natively)
|
| 215 |
+
- NVIDIA's Canary-v2 uses **dynamic chunking** with 1-second overlap between chunks.
|
| 216 |
+
- This is automatically triggered for audio longer than **40 seconds**.
|
| 217 |
+
- The model was pre-trained on a 1.7M-hour multilingual corpus with this chunking strategy baked in.
|
| 218 |
+
|
| 219 |
+
### Our Current Approach (`inference_vad.py`)
|
| 220 |
+
1. Silero VAD detects speech segments.
|
| 221 |
+
2. Segments are merged into chunks up to `chunk_len` seconds (default: **15s**).
|
| 222 |
+
3. Each chunk is transcribed **independently** β no shared context between chunks.
|
| 223 |
+
|
| 224 |
+
### Root Causes of Degradation on Long-Form
|
| 225 |
+
|
| 226 |
+
| Issue | Detail |
|
| 227 |
+
|-------|--------|
|
| 228 |
+
| **Training length mismatch** | 77% of fine-tuning data is <10s. Inference chunks at 15s are longer than nearly all training examples, creating distribution shift. |
|
| 229 |
+
| **No cross-chunk context** | Each 15s chunk is transcribed in isolation. Canary's attention decoder has no memory of previous chunks, so topic/speaker continuity is lost at boundaries. |
|
| 230 |
+
| **VAD vs. native chunking** | Our VAD-based approach differs from Canary's built-in dynamic chunking. The model was not fine-tuned with this chunking strategy. |
|
| 231 |
+
| **Repetition / hallucination** | At chunk boundaries with silence or music, the decoder can loop. This is worsened when segments are near the edge of the model's training length distribution. |
|
| 232 |
+
| **No overlap** | Without overlap between chunks, words at segment boundaries can be dropped or doubled. |
|
| 233 |
+
|
| 234 |
+
### Comparison: Canary vs. Our Finetuned Whisper on Long-Form
|
| 235 |
+
|
| 236 |
+
Whisper was explicitly designed and trained for long-form audio with:
|
| 237 |
+
- Sliding window inference with overlap
|
| 238 |
+
- Previous-chunk text as conditioning (prompt-based context)
|
| 239 |
+
- Timestamps for alignment
|
| 240 |
+
|
| 241 |
+
Canary's AED architecture does not use previous-chunk text as input, making long-form continuity fundamentally harder to achieve without careful chunk overlap and stitching.
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
+
|
| 245 |
## π Progress & Results
|
| 246 |
|
| 247 |
+
### Current Status: **Model Released & Repository Consolidated**
|
| 248 |
+
We have successfully completed the finetuning, KenLM integration, and repository consolidation phases. The model and its associated language models are now hosted on Hugging Face at `RASMUS/Finnish-ASR-Canary-v2`.
|
| 249 |
+
|
| 250 |
+
- **Infrastructure:** Finetuned on **RTX 6000 PRO Blackwell** (96 GB VRAM) on Verda.com platform in Finland.
|
| 251 |
+
- **Model Suite:** Acoustic model + 3 KenLM variants (1M, 2M, 5M sentences).
|
| 252 |
+
- **Best Performance (with KenLM 5M):**
|
| 253 |
+
- **FLEURS:** 7.86% WER
|
| 254 |
+
- **Common Voice:** 4.70% WER
|
| 255 |
+
- **CSS10:** 7.07% WER
|
| 256 |
+
- **VoxPopuli:** 11.65% WER
|
| 257 |
+
- **Deployment:** Integrated Silero VAD-based inference for robust long-form audio processing.
|
| 258 |
+
|
| 259 |
+
### Next Steps:
|
| 260 |
+
1. **Long-form Tuning:** Reduce default `chunk_len` to 8β10s (closer to training distribution median) and add 0.5β1s overlap between chunks to reduce boundary artifacts.
|
| 261 |
+
2. **Data Quality Audit:** Fix 28 confirmed corrupted Common Voice entries where raw TSV metadata (client ID hashes, gender tags) was accidentally written into the `text` field. Audit VoxPopuli for missing capitalisation (all-lowercase transcripts despite `pnc: yes`).
|
| 262 |
+
3. **Number Handling:** Add Finnish-specific training data with numeric content. Consider TTS-synthesised samples covering phone numbers, years, statistics, and measurements (both digit and written-out forms paired).
|
| 263 |
+
4. **Long-form Training Data:** Incorporate longer audio segments: TTS synthetic long-form audio (`fbc_monolog_processed`, parliament data) into the training manifest to shift the duration distribution toward 15β30s.
|
| 264 |
+
5. **KenLM Refinement:** Re-train KenLM with high-quality punctuated text. Current LM trained on mixed-quality data.
|
| 265 |
+
6. **Advanced Evaluation:** Implement CER evaluation on non-normalised test sets to better capture punctuation/casing accuracy.
|
| 266 |
+
7. **Repetition Penalty:** Explore repetition penalty in decoding if chunk-level loops persist after chunk length tuning.
|
| 267 |
+
8. **Real-world Evaluation:** Benchmark on diverse long-form samples (podcasts, meetings, call-centre audio).
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## πΊοΈ Action Plan: Next Training Run
|
| 272 |
+
|
| 273 |
+
This section details the concrete steps for the next finetuning iteration, based on the root-cause analysis above.
|
| 274 |
+
|
| 275 |
+
### Priority 1 β Fix Training Data (before re-training)
|
| 276 |
+
|
| 277 |
+
#### 1a. Normalise numbers to digit form (Gemini Flash)
|
| 278 |
+
Finnish written-out numbers in FLEURS transcripts cause the finetuned model to output inconsistent number forms. We will use the Gemini Flash API to convert all training transcripts in a single batch pass:
|
| 279 |
+
|
| 280 |
+
```python
|
| 281 |
+
# Pseudocode β run once on train_manifest.json before next training
|
| 282 |
+
import google.generativeai as genai
|
| 283 |
+
import json
|
| 284 |
+
|
| 285 |
+
genai.configure(api_key=GEMINI_API_KEY)
|
| 286 |
+
model = genai.GenerativeModel("gemini-2.0-flash")
|
| 287 |
+
|
| 288 |
+
SYSTEM_PROMPT = """You are a Finnish text normalizer.
|
| 289 |
+
Convert any written-out Finnish numbers, ordinals, or number words in the text to digit form.
|
| 290 |
+
Examples:
|
| 291 |
+
"yli sata vuotta" β "yli 100 vuotta"
|
| 292 |
+
"seitsemΓ€ntoista henkeΓ€" β "17 henkeΓ€"
|
| 293 |
+
"vuonna tuhat yhdeksΓ€nsataa" β "vuonna 1900"
|
| 294 |
+
Keep all other text exactly as-is. Return only the modified text, nothing else."""
|
| 295 |
+
|
| 296 |
+
entries = []
|
| 297 |
+
with open('manifests/train_manifest.json') as f:
|
| 298 |
+
for line in f:
|
| 299 |
+
d = json.loads(line)
|
| 300 |
+
response = model.generate_content(f"{SYSTEM_PROMPT}\n\n{d['text']}")
|
| 301 |
+
d['text'] = response.text.strip()
|
| 302 |
+
entries.append(d)
|
| 303 |
+
|
| 304 |
+
with open('manifests/train_manifest_normalised.json', 'w') as f:
|
| 305 |
+
for e in entries:
|
| 306 |
+
f.write(json.dumps(e, ensure_ascii=False) + '\n')
|
| 307 |
+
```
|
| 308 |
+
|
| 309 |
+
Cost estimate: 23,180 entries Γ ~50 tokens average = ~1.2M tokens. At Gemini Flash pricing (~$0.075/1M tokens input) β **< $0.10 total**.
|
| 310 |
+
|
| 311 |
+
#### 1b. Fix en-dash UNK token (confirmed root cause)
|
| 312 |
+
The en-dash `β` (U+2013) is NOT in the tokenizer vocabulary β it maps to UNK (id 0). Replace it with ASCII hyphen before training:
|
| 313 |
+
|
| 314 |
+
```python
|
| 315 |
+
# Add to the manifest preprocessing step
|
| 316 |
+
text = text.replace('\u2013', '-').replace('\u2014', '-')
|
| 317 |
+
```
|
| 318 |
+
|
| 319 |
+
This affects **85 entries** in `train_manifest.json` (83 FLEURS, 2 Common Voice).
|
| 320 |
+
|
| 321 |
+
#### 1c. Fix 28 corrupted Common Voice entries
|
| 322 |
+
Replace entries where the `text` field contains raw TSV metadata (tabs + client_id hashes). Strip everything after the first tab character.
|
| 323 |
+
|
| 324 |
+
---
|
| 325 |
+
|
| 326 |
+
### Priority 2 β Add Long-Form Training Data
|
| 327 |
+
|
| 328 |
+
#### TTS Long-Form Dataset: `RASMUS/canary_asr_finetune_tts_long_data`
|
| 329 |
+
|
| 330 |
+
| Property | Value |
|
| 331 |
+
|----------|-------|
|
| 332 |
+
| Size | 8.0 GB zip |
|
| 333 |
+
| Format | FLAC audio + JSONL manifest |
|
| 334 |
+
| Mean duration | **16.5s** (vs 7.8s in current data) |
|
| 335 |
+
| Median duration | 15.9s |
|
| 336 |
+
| Max duration | 25.0s |
|
| 337 |
+
| Content | Finnish speech: lectures, podcasts, YouTube |
|
| 338 |
+
| Segments >20s | ~25% |
|
| 339 |
+
|
| 340 |
+
This dataset directly addresses the training length mismatch. Adding it will shift the duration distribution from a mean of 7.8s toward ~10β12s and significantly increase the proportion of 15β25s segments that match inference chunk lengths.
|
| 341 |
+
|
| 342 |
+
**Integration plan:**
|
| 343 |
+
```bash
|
| 344 |
+
# Download the dataset
|
| 345 |
+
curl -L -H "Authorization: Bearer ${HF_TOKEN}" \
|
| 346 |
+
"https://huggingface.co/datasets/RASMUS/canary_asr_finetune_tts_long_data/resolve/main/canary_dataset.zip" \
|
| 347 |
+
-o /workspace/data/tts_long_data.zip
|
| 348 |
|
| 349 |
+
# Extract
|
| 350 |
+
unzip /workspace/data/tts_long_data.zip -d /workspace/data/tts_long_data/
|
| 351 |
+
|
| 352 |
+
# Apply number normalisation and dash fix to canary_manifest.jsonl
|
| 353 |
+
# then merge with existing train_manifest_normalised.json
|
| 354 |
+
```
|
| 355 |
+
|
| 356 |
+
After applying number normalisation and dash fixes to the new manifest, concatenate with the existing training set. Expected combined size: ~23,180 + N (estimate 5,000β20,000+ entries depending on total dataset size).
|
| 357 |
|
| 358 |
---
|
| 359 |
|
| 360 |
+
### Priority 3 β Inference Tuning (without re-training)
|
| 361 |
+
|
| 362 |
+
Even before re-training, we can improve `moo.wav` performance by adjusting `inference_vad.py`:
|
| 363 |
+
|
| 364 |
+
| Parameter | Current | Recommended |
|
| 365 |
+
|-----------|---------|-------------|
|
| 366 |
+
| `chunk_len` | 15s | 8β10s (match training median of 7.8s) |
|
| 367 |
+
| chunk overlap | 0s | 0.5s (reduce boundary word drops) |
|
| 368 |
+
| `alpha` (KenLM) | 0.2 | Try 0.1β0.15 (current may over-constrain decoder) |
|
| 369 |
+
|
| 370 |
+
---
|
| 371 |
+
|
| 372 |
+
## π Round 2: Data Pipeline & Splits
|
| 373 |
+
|
| 374 |
+
This section documents the data preparation methodology for Round 2 finetuning, including all new eval sets, the TTS integration, and the final manifest composition.
|
| 375 |
+
|
| 376 |
+
### Overview of Changes vs Round 1
|
| 377 |
+
|
| 378 |
+
| Item | Round 1 | Round 2 |
|
| 379 |
+
|------|---------|---------|
|
| 380 |
+
| Base model | `canary-1b-v2.nemo` | `canary-1b-v2.nemo` (fresh start) |
|
| 381 |
+
| Training samples | 23,180 | **28,858** |
|
| 382 |
+
| Training hours | ~50h | **75.6h** |
|
| 383 |
+
| Mean duration | 7.8s | **9.4s** |
|
| 384 |
+
| Max duration allowed | 20.0s | **30.0s** |
|
| 385 |
+
| Transcripts normalised | No | **Yes (digits, dashes fixed)** |
|
| 386 |
+
| Eval sets | 4 | **6** |
|
| 387 |
+
|
| 388 |
+
### Step 1 β Transcript Normalisation (`normalize_manifests.py`)
|
| 389 |
+
|
| 390 |
+
All training transcripts were cleaned in two layers:
|
| 391 |
+
|
| 392 |
+
**Deterministic fixes (no API call needed):**
|
| 393 |
+
- En-dash `β` (U+2013) and em-dash `β` (U+2014) β ASCII hyphen `-` (fixes UNK token regression)
|
| 394 |
+
- Corrupted Common Voice entries (raw TSV metadata in `text` field) β strip everything after first tab
|
| 395 |
+
|
| 396 |
+
**Gemini 2.5 Flash API calls (2,586 of 23,180 entries needed conversion):**
|
| 397 |
+
- Pre-filtered with a Finnish number-word regex so only entries that actually contain written numbers are sent to the API (cost: ~$0.62)
|
| 398 |
+
- Written Finnish numbers converted to digit form: `sata vuotta` β `100 vuotta`, `seitsemΓ€ntoista` β `17`
|
| 399 |
+
- Explicit DO NOT CONVERT rules: ordinals (`ensimmΓ€inen`, `toinen`), superlative constructions (`yksi tΓ€rkeimmistΓ€`), and `toinen` as "another/other"
|
| 400 |
+
|
| 401 |
+
### Step 2 β TTS Long-Form Data Integration
|
| 402 |
+
|
| 403 |
+
Downloaded `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB, 6,365 entries, mean 16.4s).
|
| 404 |
+
|
| 405 |
+
Aligned to NeMo training format:
|
| 406 |
+
- Path rewritten to relative style: `data/tts_long_data/audio/{filename}`
|
| 407 |
+
- Fields mapped: `language` β `source_lang`/`target_lang`, `task: "transcription"` β `taskname: "asr"`, added `pnc: "yes"`
|
| 408 |
+
- Same Gemini normalisation pass applied (888 entries converted)
|
| 409 |
+
|
| 410 |
+
### Step 3 β Eval Set Construction (TTS Data)
|
| 411 |
+
|
| 412 |
+
The 6,365 normalised TTS entries were split into train / eval / long-form-test:
|
| 413 |
+
|
| 414 |
+
```
|
| 415 |
+
All TTS entries (6,365)
|
| 416 |
+
β
|
| 417 |
+
βββ Long-form pool (>20s): 1,501 entries
|
| 418 |
+
β βββ eval_long_form (sampled): 200 entries β random.seed(42) shuffle β first 200
|
| 419 |
+
β βββ Returned to training pool: 1,301 entries
|
| 420 |
+
β
|
| 421 |
+
βββ Medium pool (10β20s): 4,864 entries
|
| 422 |
+
βββ eval_tts (10% hold-out): 487 entries β stratified by duration bucket
|
| 423 |
+
βββ tts_train: 4,377 entries
|
| 424 |
+
```
|
| 425 |
+
|
| 426 |
+
**Why eval_long_form = 200 entries?**
|
| 427 |
+
The original 1,501 long-form entries (>20s) had a total duration of ~9.4 hours β far too long to run as a validation set every epoch. At batch_size=32 on a single GPU, each validation pass over 1,501 entries takes ~25 minutes, adding 2.5h per epoch. 200 entries (β75 minutes of audio) provides a representative sample of the long-form distribution at reasonable cost: ~4 minutes of eval time per epoch.
|
| 428 |
+
|
| 429 |
+
**eval_tts construction:**
|
| 430 |
+
487 entries were held out from the 10β20s duration range (10% stratified sample). This tests the model's ability to handle medium-length audio and is separate from the original 4 eval sets.
|
| 431 |
+
|
| 432 |
+
### Step 4 β Combined Training Manifest
|
| 433 |
+
|
| 434 |
+
Final `train_manifest_combined.jsonl` composition:
|
| 435 |
+
|
| 436 |
+
| Source | Entries | Notes |
|
| 437 |
+
|--------|---------|-------|
|
| 438 |
+
| Original train (normalised) | 23,180 | Digits + dash fix applied |
|
| 439 |
+
| TTS train (10β20s) | 4,377 | Synthesised long-form speech |
|
| 440 |
+
| Long-form overflow | 1,301 | >20s entries not selected for eval_long_form |
|
| 441 |
+
| **Total** | **28,858** | Mean 9.4s, 75.6h |
|
| 442 |
+
|
| 443 |
+
### Final Eval Sets (Round 2)
|
| 444 |
+
|
| 445 |
+
| Set | File | Entries | Mean Duration | Purpose |
|
| 446 |
+
|-----|------|---------|--------------|---------|
|
| 447 |
+
| `eval_fleurs` | `eval_fleurs.json` | 918 | 13.0s | Primary benchmark (monitored for checkpointing) |
|
| 448 |
+
| `eval_common_voice` | `eval_common_voice.json` | 1,554 | 5.1s | Crowdsourced quality |
|
| 449 |
+
| `eval_css10` | `eval_css10.json` | 170 | 7.5s | Clean single-speaker |
|
| 450 |
+
| `eval_voxpopuli` | `eval_voxpopuli.json` | 430 | 10.6s | Formal/parliament speech |
|
| 451 |
+
| `eval_tts` | `eval_tts.jsonl` | 487 | 14.5s | Medium-length TTS (new) |
|
| 452 |
+
| `eval_long_form` | `eval_long_form.jsonl` | **200** | 22.5s | Long-form >20s sample (new) |
|
| 453 |
+
|
| 454 |
+
**Checkpoint monitoring:** `val_wer` tracks FLEURS (first validation set). All 6 WERs are logged independently to WandB.
|
| 455 |
+
|
| 456 |
+
### Round 2 Training Config
|
| 457 |
+
|
| 458 |
+
File: `configs/canary_finetune_finnish_v2.yaml`
|
| 459 |
+
Key settings:
|
| 460 |
+
- `init_from_nemo_model`: `/workspace/Finnish-ASR-Canary-v2/models/canary-1b-v2.nemo` (fresh start from base)
|
| 461 |
+
- `max_duration`: 30.0s (up from 20.0s to include TTS segments up to 25s)
|
| 462 |
+
- `max_steps`: 18,000 (scaled: 28,858 / 32 β 902 steps/epoch Γ 20 epochs β 18,040)
|
| 463 |
+
- `lr`: 1e-5, `WarmupAnnealing`, 500 warmup steps
|
| 464 |
+
- `precision`: bf16, single GPU, `strategy: auto`
|
| 465 |
+
|
| 466 |
+
---
|
| 467 |
+
|
| 468 |
+
## π οΈ Workflow Status Details
|
| 469 |
+
|
| 470 |
+
### 1. Data Preparation - DONE
|
| 471 |
+
- [x] Identify and inventory all 4 datasets
|
| 472 |
+
- [x] Create unified processing script (`scripts/prepare_all_manifests.py`)
|
| 473 |
+
- [x] Run `scripts/prepare_all_manifests.py` on devcontainer
|
| 474 |
+
- [x] Verify manifest sample counts and audio file integrity
|
| 475 |
+
|
| 476 |
+
### 2. Configuration Setup - DONE
|
| 477 |
+
- [x] Create Hydra training config (`configs/canary_finetune_finnish.yaml`)
|
| 478 |
+
- [x] Configure multi-validation with 4 eval datasets
|
| 479 |
+
- [x] Checkpoint monitors primary eval set (FLEURS) via `val_wer`
|
| 480 |
+
- [x] All 4 eval WERs logged independently to WandB
|
| 481 |
+
|
| 482 |
+
### 3. Training - DONE
|
| 483 |
+
- [x] Run finetuning via `run_training.sh`
|
| 484 |
+
- [x] Monitor per-dataset WER in WandB
|
| 485 |
+
|
| 486 |
+
### 4. KenLM / NGPU-LM Language Model Integration - DONE
|
| 487 |
+
- [x] Install KenLM tools (`install_beamsearch_decoders.sh`)
|
| 488 |
+
- [x] Gather Finnish text (ASR transcripts + Wikipedia + mc4)
|
| 489 |
+
- [x] Train 3 variants of KenLM (1M, 2M, 5M sentences)
|
| 490 |
+
- [x] Evaluate with LM fusion on all 4 test sets
|
| 491 |
+
|
| 492 |
+
### 5. Repository & Long-Form Inference - IN PROGRESS
|
| 493 |
+
- [x] Consolidate README and model metadata for Hugging Face release
|
| 494 |
+
- [x] Upload model checkpoints and KenLM bundles to HF Hub
|
| 495 |
+
- [x] Implement Silero VAD-based chunking for long-form audio (`inference_vad.py`)
|
| 496 |
+
- [x] Root-cause analysis of long-form degradation vs. Whisper (see above)
|
| 497 |
+
- [ ] Reduce `chunk_len` to 8β10s and add chunk overlap (Current Focus)
|
| 498 |
+
- [ ] Optimize `alpha` for stability on `moo.wav` (30 min test file)
|
| 499 |
+
|
| 500 |
+
### 6. Data Quality & Advanced Evaluation - PARTIALLY DONE
|
| 501 |
+
- [x] Fix 28 corrupted Common Voice manifest entries (raw TSV data in text field) β done in normalisation pass.
|
| 502 |
+
- [x] Fix en-dash/em-dash UNK token regression β done in normalisation pass.
|
| 503 |
+
- [ ] Audit VoxPopuli transcripts for all-lowercase entries (capitalisation missing).
|
| 504 |
+
- [ ] Re-train KenLM with high-quality punctuated text.
|
| 505 |
+
- [ ] Evaluate CER on non-normalized test sets.
|
| 506 |
+
|
| 507 |
+
### 7. Number Normalisation & UNK Token Fix - DONE
|
| 508 |
+
- [x] Replace en-dash `β` and em-dash `β` with ASCII hyphen `-` in all training manifests (85 train + 70 TTS entries fixed).
|
| 509 |
+
- [x] Use Gemini 2.5 Flash to normalise written-out Finnish numbers to digit form (2,586 API calls across train + TTS).
|
| 510 |
+
- [ ] Re-evaluate on the 5-sample number test set after Round 2 training to verify consistency.
|
| 511 |
+
|
| 512 |
+
### 8. Long-Form Data Expansion - DONE
|
| 513 |
+
- [x] Download `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB zip, 6,365 entries, mean 16.4s).
|
| 514 |
+
- [x] Align TTS manifest to NeMo training format and integrate into combined training manifest.
|
| 515 |
+
- [x] Round 2 training configured and ready to launch (see Round 2 section below).
|
| 516 |
+
- [ ] Benchmark Round 2 model against Round 1 and finetuned Whisper on `moo.wav`.
|
| 517 |
+
|
| 518 |
+
---
|
| 519 |
+
|
| 520 |
+
## π οΈ NeMo Environment Setup
|
| 521 |
+
|
| 522 |
+
This section documents the exact steps to set up a working NeMo inference/training environment, including the fixes required for the `nvcr.io/nvidia/pytorch:25.01-py3` container.
|
| 523 |
+
|
| 524 |
+
### Installation (from scratch on pytorch:25.01-py3 base image)
|
| 525 |
+
|
| 526 |
+
```bash
|
| 527 |
+
# 1. Clone the HF model repo (contains NeMo source with patches applied)
|
| 528 |
+
# Skip LFS to avoid downloading the 3.6 GB model during clone
|
| 529 |
+
GIT_LFS_SKIP_SMUDGE=1 git clone \
|
| 530 |
+
"https://user:${HF_TOKEN}@huggingface.co/RASMUS/Finnish-ASR-Canary-v2" \
|
| 531 |
+
/workspace/Finnish-ASR-Canary-v2
|
| 532 |
+
|
| 533 |
+
# 2. Install NeMo in editable mode from the patched source
|
| 534 |
+
cd /workspace/Finnish-ASR-Canary-v2/NeMo
|
| 535 |
+
pip install -e ".[asr]"
|
| 536 |
+
|
| 537 |
+
# 3. Install pinned dependencies
|
| 538 |
+
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' kaldialign wandb
|
| 539 |
+
```
|
| 540 |
+
|
| 541 |
+
### Required Compatibility Fixes
|
| 542 |
+
|
| 543 |
+
The pytorch:25.01-py3 container ships with packages that conflict with NeMo 2.8.0rc0:
|
| 544 |
+
|
| 545 |
+
```bash
|
| 546 |
+
# Fix 1: Downgrade lightning to the version NeMo requires (<=2.4.0)
|
| 547 |
+
# The container ships lightning 2.4.0 but pip may upgrade it β pin it back.
|
| 548 |
+
pip install "lightning==2.4.0" "pytorch-lightning==2.4.0"
|
| 549 |
+
|
| 550 |
+
# Fix 2: Remove incompatible torchvision
|
| 551 |
+
# The container's torchvision (0.20.0a0) was built against torch 2.6.0a0 (the original
|
| 552 |
+
# container torch), but NeMo's install upgrades torch to ~2.10. torchvision then fails
|
| 553 |
+
# on import and blocks NeMo. ASR does not need torchvision.
|
| 554 |
+
pip uninstall -y torchvision
|
| 555 |
+
```
|
| 556 |
+
|
| 557 |
+
### Downloading the Finetuned Model
|
| 558 |
+
|
| 559 |
+
```bash
|
| 560 |
+
# Download the finetuned acoustic model (3.6 GB)
|
| 561 |
+
curl -L \
|
| 562 |
+
-H "Authorization: Bearer ${HF_TOKEN}" \
|
| 563 |
+
"https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/canary-finnish.nemo" \
|
| 564 |
+
-o /workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo
|
| 565 |
+
|
| 566 |
+
# KenLM models are also LFS β download the 5M variant (best WER):
|
| 567 |
+
curl -L \
|
| 568 |
+
-H "Authorization: Bearer ${HF_TOKEN}" \
|
| 569 |
+
"https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/kenlm_5M.nemo" \
|
| 570 |
+
-o /workspace/Finnish-ASR-Canary-v2/kenlm_5M.nemo
|
| 571 |
+
```
|
| 572 |
+
|
| 573 |
+
### Quick Inference Smoke Test
|
| 574 |
+
|
| 575 |
+
```python
|
| 576 |
+
import warnings; warnings.filterwarnings('ignore')
|
| 577 |
+
from nemo.collections.asr.models import EncDecMultiTaskModel
|
| 578 |
+
|
| 579 |
+
model = EncDecMultiTaskModel.restore_from(
|
| 580 |
+
'/workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo',
|
| 581 |
+
map_location='cuda'
|
| 582 |
+
)
|
| 583 |
+
model.eval()
|
| 584 |
+
|
| 585 |
+
results = model.transcribe(
|
| 586 |
+
audio=['path/to/audio.wav'],
|
| 587 |
+
task='asr', source_lang='fi', target_lang='fi', pnc='yes'
|
| 588 |
+
)
|
| 589 |
+
print(results[0].text)
|
| 590 |
+
```
|
| 591 |
+
|
| 592 |
+
### Loading the Base Model (for comparison)
|
| 593 |
+
|
| 594 |
+
```python
|
| 595 |
+
# Downloads ~3.6 GB on first run, cached in ~/.cache/huggingface/
|
| 596 |
+
model_base = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2", map_location='cuda')
|
| 597 |
+
```
|
| 598 |
|
| 599 |
---
|
| 600 |
|
| 601 |
## π Progress Log
|
| 602 |
+
- **2026-01-11:** Initial project setup.
|
| 603 |
- **2026-02-08:** Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
|
| 604 |
+
- **2026-02-10:** **Finetuning complete.** Epoch 11 reached `val_wer=0.1258` on FLEURS.
|
| 605 |
+
- **2026-02-13:** Mermaid diagrams and project documentation for DS team.
|
| 606 |
+
- **2026-02-18:** **KenLM benchmarks finished.** Consolidated repository structure. Applied NeMo patches for inference stability.
|
| 607 |
+
- **2026-02-20:** **Model Released.** Release of `Finnish-ASR-Canary-v2` on HF. Implemented VAD-based inference pipeline. Currently tuning for long-form stability on `moo.wav` with various `alpha` settings (0.0 - 0.4 tested).
|
| 608 |
+
- **2026-02-26:** **Root-cause analysis complete.** Investigated long-form gap vs. Whisper and number handling. Key findings: (1) 77% of training data is <10s, creating distribution shift at inference chunk lengths; (2) No cross-chunk context in Canary's AED architecture; (3) Only 2.5% of training samples contain digit characters β numbers are a known weak point; (4) 28 corrupted Common Voice entries found (TSV metadata in text field); (5) `moo.wav` test file confirmed as 30 minutes. Action plan: shorten chunk_len, add chunk overlap, fix data corruption, and plan a long-form training data expansion round.
|
| 609 |
+
- **2026-02-26:** **Live number inference + tokenizer audit completed.** Ran base Canary-v2 vs. finetuned model on 5 FLEURS samples. Confirmed: (1) base model always outputs digits (`100`, `17`); (2) finetuned model regressed to mixed output β sometimes written words, sometimes digits β due to inconsistent training transcripts; (3) en-dash (`β`) produces UNK token `β` in finetuned model, base model degrades gracefully to ASCII hyphen. Policy decision: **standardise on digit output** and fix en-dash encoding in training manifests before next training run. NeMo environment setup documented (with fixes for `torchvision` and `lightning` version conflicts). TTS long-form dataset (`canary_asr_finetune_tts_long_data`, 8GB, mean 16.5s/segment) identified as key data source for next training run. Action plan for next run: (1) normalise numbers to digits via Gemini Flash API, (2) fix en-dash β ASCII hyphen, (3) fix 28 corrupted CV entries, (4) add TTS long-form data.
|
| 610 |
+
- **2026-03-01:** **Round 2 data pipeline complete.** Ran `normalize_manifests.py`: 2,586 Gemini 2.5 Flash API calls (~$0.62), 1,137 number changes in train + 888 in TTS, 85 en-dash and 28 corrupted CV entries fixed. Downloaded and extracted TTS long-form dataset (6,365 entries, 4.8 GB). Split TTS data into train (4,377), eval_tts (487, mean 14.5s), and long-form pool (1,501 entries >20s). Sampled 200 entries into `eval_long_form.jsonl` (seed 42) and returned 1,301 to training, yielding `train_manifest_combined.jsonl` (28,858 entries, 75.6h). Round 2 training config created (`configs/canary_finetune_finnish_v2.yaml`). **Training ready to launch.**
|
| 611 |
+
- **2026-03-01:** **Training crash diagnosed and fixed.** Round 2 training ran 505 steps then crashed with CUDA `vectorized_gather_kernel index out of bounds`. Root cause: entry 14857 in `train_manifest_combined.jsonl` contained 11,247 chars of Python code (Gemini normalization returned a code block instead of a transcript for `voxpopuli_005371.wav`). When tokenized with the canary2 prompt format, the sequence far exceeded the decoder's `max_sequence_length=1024`, causing position-embedding OOB. Additionally, 4 entries in `eval_common_voice.json` had TSV metadata contamination (same v1 issue, not previously caught in the v2 eval set). Both manifests fixed. Config rewritten from full-architecture spec to minimal v1-style format (`tokenizer: update_tokenizer: false`) using `speech_to_text_finetune.py` (which restores the full model from the `.nemo` file). Training re-launched. Manifests synced to `canary-finnish-asr-data` HuggingFace dataset repo.
|
|
|