RASMUS commited on
Commit
4d2ba80
Β·
verified Β·
1 Parent(s): 3700d96

Upload PLAN_AND_PROGRESS.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. PLAN_AND_PROGRESS.md +497 -54
PLAN_AND_PROGRESS.md CHANGED
@@ -101,68 +101,511 @@ We use a balanced mix of datasets to cover various audio qualities and transcrip
101
 
102
  ---
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ## πŸš€ Progress & Results
105
 
106
- ### Current Status: **Project Completed & Models Uploaded**
107
- We have successfully completed the end-to-end pipeline: finetuning, KenLM integration (5M samples), full benchmarking, and deployment to the Hugging Face Hub.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
- - **Hardware:** RTX 6000 Ada (96 GB VRAM)
110
- - **Acoustic Model:** Finetuned Canary-v2 (1B params)
111
- - **Language Model:** 6-gram KenLM (5M samples, token-aligned)
112
- - **Final Performance:** **7.51% Average WER** across four datasets (40.8% error reduction from baseline).
113
- - **Deployment:** [RASMUS/Finnish-ASR-Canary-v2](https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2)
 
 
 
114
 
115
  ---
116
 
117
- ## πŸ› οΈ Implementation Details
118
-
119
- ### 1. Framework Setup (NVIDIA NeMo)
120
- We use a specific commit of NVIDIA NeMo to ensure compatibility with Canary-v2.
121
- - **Repository:** `https://github.com/NVIDIA/NeMo.git`
122
- - **Commit:** `557177a18d`
123
- - **Installation:** Cloned into `/workspace/NeMo` and installed in editable mode (`pip install -e .[asr]`).
124
- - **Patches Applied:**
125
- - Fixed `OneLogger` import issues in `nemo/lightning/callback_group.py`.
126
- - Fixed Canary-v2 EOS assertion in `nemo/collections/common/prompts/canary2.py` for inference compatibility.
127
-
128
- ### 2. Model Initialization & Conversion
129
- - **Base Model:** Downloaded `nvidia/canary-1b-v2` from HuggingFace using `scripts/init_canary_model.py`.
130
- - **Finetuned Model:** Downloaded the PyTorch Lightning checkpoint (`canary-finnish--epoch=11--val_wer=0.1258.ckpt`) from [RASMUS/Canary_finetune_trial](https://huggingface.co/RASMUS/Canary_finetune_trial/tree/main) and converted it to a portable `.nemo` file at `/workspace/models/canary-finetuned-finnish.nemo`.
131
-
132
- ### 3. Data Ingestion
133
- - **ASR Data:** Downloaded from HuggingFace (`RASMUS/canary-finnish-asr-data`) using `huggingface-cli download`.
134
- - **Structure:**
135
- - Manifests: `/workspace/Canary_finetune_trial/manifests/`
136
- - Audio: `/workspace/data/audio/` (Extracted from tars).
137
-
138
- ### 4. KenLM Language Model Integration
139
- To solve Finnish-specific challenges like long compound words, we integrated a token-aligned 6-gram KenLM model.
140
- - **Training Data:** 5 million lines of high-quality Finnish text filtered from Reddit, FinePDF, Wiki-Edu, and ASR transcripts.
141
- - **Data Integrity:** Removed 1,833 leaked sentences from the training corpus to ensure fair evaluation.
142
- - **Performance Optimization:** Converted the ARPA text models into binary `.nemo` LM bundles, reducing load time from 5+ minutes to ~7 seconds.
143
-
144
- ### 5. Benchmarking the Evolution
145
- Our journey saw massive gains at every step. Here are the exact evaluation results (metrics are clean WER %) across our development cycle:
146
-
147
- | Dataset | Original Canary-v2 | Finetuned (Greedy) | **Finetuned + KenLM (5M, Ξ±=0.2)** | Improvement |
148
- | :--- | :---: | :---: | :---: | :---: |
149
- | **Common Voice** | 17.95% | 12.82% | **5.98%** | **-66.7%** |
150
- | **FLEURS** | 7.79% | 8.33% | **6.48%** | **-16.8%** |
151
- | **CSS10 (Audiobook)**| 17.07% | 12.19% | **11.85%** | **-30.6%** |
152
- | **VoxPopuli (Formal)**| 7.96% | 4.46% | **5.73%** | **-28.0%** |
153
- | **GLOBAL AVG** | 12.69% | 9.45% | **7.51%** | **-40.8%** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
 
155
  ---
156
 
157
  ## πŸ“ Progress Log
158
- - **2026-01-11:** Initial project setup with Chatterbox-TTS-10k + MCV test set.
159
  - **2026-02-08:** Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
160
- - **2026-02-14:**
161
- - Environment: Cloned NeMo (`557177a18d`), applied patches.
162
- - Models: Converted epoch 11 checkpoint to `.nemo` format.
163
- - Data: Downloaded full ASR dataset from Hub.
164
- - **2026-02-15:**
165
- - **KenLM Expansion:** Scaled from 500k to 5M training samples with high-quality filtering (FineWeb-Edu).
166
- - **Optimization:** Converted ARPA to NeMo LM bundles for fast GPU loading.
167
- - **Evaluation:** Completed full benchmarks on all four test sets.
168
- - **Deployment:** Uploaded all final models and documentation to [RASMUS/Finnish-ASR-Canary-v2](https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2).
 
101
 
102
  ---
103
 
104
+ ## πŸ“Š Training Data Analysis
105
+
106
+ This section documents the composition and length distribution of our training data (from `RASMUS/canary-finnish-asr-data`, accessed 2026-02-26).
107
+
108
+ ### Dataset Summary
109
+
110
+ | Dataset | Samples | Mean Duration | Max Duration | Total Hours |
111
+ |---------|---------|--------------|-------------|-------------|
112
+ | **Common Voice v24** | 9,086 | 4.5s | 10.5s | 11.2h |
113
+ | **VoxPopuli** | 8,164 | 10.1s | 50.5s | 23.0h |
114
+ | **CSS10** | 3,226 | 7.7s | 20.2s | 6.9h |
115
+ | **FLEURS** | 2,704 | 11.7s | 43.2s | 8.8h |
116
+ | **TOTAL** | **23,180** | **7.8s** | **50.5s** | **~50h** |
117
+
118
+ ### Duration Distribution (Training Set)
119
+
120
+ ```
121
+ 0–5s : 33.3% (7,725 samples) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
122
+ 5–10s : 43.7% (10,139 samples) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
123
+ 10–15s : 15.0% (3,473 samples) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
124
+ 15–20s : 5.4% (1,241 samples) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
125
+ 20–30s : 2.4% (562 samples) β–ˆβ–ˆβ–ˆ
126
+ >30s : 0.2% (40 samples)
127
+ ```
128
+
129
+ **Key insight:** 77% of training samples are shorter than 10 seconds. The model has very little exposure to longer audio segments (only 0.2% are >30s). This has direct implications for long-form inference stability.
130
+
131
+ ### Evaluation Set Durations
132
+
133
+ | Eval Set | Samples | Mean Duration | Max Duration |
134
+ |----------|---------|--------------|-------------|
135
+ | FLEURS | 918 | 13.0s | 33.7s |
136
+ | Common Voice | 1,554 | 5.1s | 10.5s |
137
+ | CSS10 | 170 | 7.5s | 10.2s |
138
+ | VoxPopuli | 430 | 10.6s | 47.5s |
139
+
140
+ ---
141
+
142
+ ## πŸ”’ Number Handling Analysis
143
+
144
+ ### Live Inference Results: Base vs Finetuned (2026-02-26)
145
+
146
+ We ran both models on 5 FLEURS test samples to determine each model's number output style.
147
+
148
+ | # | Scenario | Reference | Base Canary-v2 | Our Finetuned |
149
+ |---|----------|-----------|----------------|---------------|
150
+ | 1 | Spoken "sata" (hundred) | `yli sata vuotta` | `yli 100 vuotta` ❌ | `yli 100 vuotta` ❌ |
151
+ | 2 | Spoken "seitsemΓ€ntoista" (17) | `surmaten seitsemΓ€ntoista henkeΓ€` | `surmaten 17 henkeΓ€` ❌ | `surmaten seitsemΓ€ntoista henkeΓ€` βœ… |
152
+ | 3 | Digits in reference (15, 2011, 2017) | `15 metriΓ€... 2011... 2017` | Correct βœ… | Correct βœ… |
153
+ | 4 | Abbreviation "jKr." (AD) | `400 jKr.` | `400 jΓ€lkeen Kristuksen` | `400 jΓ€lkeen Kristuksen` |
154
+ | 5 | Range "25–30" (en-dash U+2013) | `25–30 vuodella` | `25-30 vuodella` (ASCII hyphen) | `25 ⁇ 30 vuodella` ❌ UNK token |
155
+
156
+ **Key findings:**
157
+
158
+ 1. **Base model outputs digits.** When the speaker says "sata" (hundred) or "seitsemΓ€ntoista" (seventeen), the base Canary-v2 outputs `100` and `17`. This is NVIDIA's built-in text normalisation β€” Canary always outputs digit form for numbers.
159
+
160
+ 2. **Finetuning introduced inconsistency.** Our finetuning partially reversed this: for `seitsemΓ€ntoista` the finetuned model now outputs the written word (because FLEURS training transcripts used written-out numbers), but still outputs `100` for `sata`. This inconsistency is worse than either consistent policy.
161
+
162
+ 3. **En-dash produces a UNK token in the finetuned model.** The character `–` (U+2013 en-dash) in `25–30` causes the finetuned model to emit `⁇` (SentencePiece UNK). The base model degrades gracefully to an ASCII hyphen `25-30`. This is a regression introduced by finetuning β€” likely because the en-dash was absent or inconsistently encoded in our training data.
163
+
164
+ 4. **Abbreviations are expanded by both models.** `jKr.` β†’ `jΓ€lkeen Kristuksen` in both β€” this is model behaviour, not a finetuning artifact.
165
+
166
+ ### Policy Decision
167
+ **We want digit output** (not written-out Finnish number words). The base model's behaviour is correct here. The finetuned model regressed on consistency because our FLEURS training transcripts used written-out numbers.
168
+
169
+ ### Training Data Issues Found
170
+ - Only **2.5% (578 / 23,180)** of training samples contain digit characters at all.
171
+ - FLEURS transcripts use written-out numbers (`sata vuotta`) while VoxPopuli and Common Voice use digits. This gives the model conflicting signal.
172
+ - En-dash (`–` U+2013) may be absent or mis-encoded in training manifests, causing UNK tokens at inference time.
173
+
174
+ ### Action Plan: Numbers & UNK Token
175
+
176
+ #### Step 1 β€” Normalise training transcripts to digit form
177
+ Run a pre-processing pass on `train_manifest.json` before the next training run:
178
+ - Use the Python library `num2words` with locale `fi` to convert Finnish written-out numbers to digits: e.g. `sata` β†’ `100`, `seitsemΓ€ntoista` β†’ `17`.
179
+ - OR (simpler / safer): replace the FLEURS transcripts in the manifest with their **raw reference texts which already have digits** (FLEURS provides both `raw_transcription` and `transcription` columns; currently we use `raw_transcription` which has written numbers).
180
+ - Target: **all numeric quantities consistently in digit form** across all four datasets.
181
+
182
+ #### Step 2 β€” Fix en-dash encoding (ROOT CAUSE CONFIRMED)
183
+
184
+ **Confirmed via tokenizer inspection (2026-02-26):**
185
+
186
+ ```python
187
+ m.tokenizer.text_to_ids("25–30") # β†’ [16053, 1125, 1128, 0, 1126, 1123]
188
+ # ↑ id 0 = UNK for the en-dash!
189
+ m.tokenizer.text_to_ids("25-30") # β†’ [16053, 1125, 1128, 16107, 1126, 1123]
190
+ # ↑ ASCII hyphen tokenises correctly
191
+ ```
192
+
193
+ - **En-dash `–` (U+2013) and em-dash `β€”` (U+2014) are NOT in the CanaryBPETokenizer vocabulary** (both map to UNK id 0).
194
+ - Training data contains **85 entries with en-dash** (83 FLEURS, 2 Common Voice). During training, the en-dash in the TARGET text was encoded as UNK, so the model learned to produce UNK for the corresponding speech sounds.
195
+ - **Fix: replace all `–` and `β€”` with ASCII hyphen `-` in all training transcripts** before the next training run. This is a one-line preprocessing step.
196
+
197
+ ```python
198
+ # In manifest preprocessing:
199
+ text = text.replace('\u2013', '-').replace('\u2014', '-')
200
+ ```
201
+
202
+ #### Step 3 β€” Re-evaluate after normalisation
203
+ After normalising transcripts, re-run the 5-sample live inference test to verify:
204
+ - `sata vuotta` audio β†’ model outputs `100 vuotta`
205
+ - `seitsemΓ€ntoista` audio β†’ model outputs `17`
206
+ - `25–30` audio β†’ model outputs `25-30` or `25–30` (no UNK)
207
+
208
+ ---
209
+
210
+ ## πŸ”ˆ Long-Form Audio: Root Cause Analysis
211
+
212
+ Our test file `moo.wav` is **30 minutes** (1,800s) of continuous Finnish speech. This reveals a core gap vs. our finetuned Whisper model.
213
+
214
+ ### How Canary-v2 Handles Long Audio (Natively)
215
+ - NVIDIA's Canary-v2 uses **dynamic chunking** with 1-second overlap between chunks.
216
+ - This is automatically triggered for audio longer than **40 seconds**.
217
+ - The model was pre-trained on a 1.7M-hour multilingual corpus with this chunking strategy baked in.
218
+
219
+ ### Our Current Approach (`inference_vad.py`)
220
+ 1. Silero VAD detects speech segments.
221
+ 2. Segments are merged into chunks up to `chunk_len` seconds (default: **15s**).
222
+ 3. Each chunk is transcribed **independently** β€” no shared context between chunks.
223
+
224
+ ### Root Causes of Degradation on Long-Form
225
+
226
+ | Issue | Detail |
227
+ |-------|--------|
228
+ | **Training length mismatch** | 77% of fine-tuning data is <10s. Inference chunks at 15s are longer than nearly all training examples, creating distribution shift. |
229
+ | **No cross-chunk context** | Each 15s chunk is transcribed in isolation. Canary's attention decoder has no memory of previous chunks, so topic/speaker continuity is lost at boundaries. |
230
+ | **VAD vs. native chunking** | Our VAD-based approach differs from Canary's built-in dynamic chunking. The model was not fine-tuned with this chunking strategy. |
231
+ | **Repetition / hallucination** | At chunk boundaries with silence or music, the decoder can loop. This is worsened when segments are near the edge of the model's training length distribution. |
232
+ | **No overlap** | Without overlap between chunks, words at segment boundaries can be dropped or doubled. |
233
+
234
+ ### Comparison: Canary vs. Our Finetuned Whisper on Long-Form
235
+
236
+ Whisper was explicitly designed and trained for long-form audio with:
237
+ - Sliding window inference with overlap
238
+ - Previous-chunk text as conditioning (prompt-based context)
239
+ - Timestamps for alignment
240
+
241
+ Canary's AED architecture does not use previous-chunk text as input, making long-form continuity fundamentally harder to achieve without careful chunk overlap and stitching.
242
+
243
+ ---
244
+
245
  ## πŸš€ Progress & Results
246
 
247
+ ### Current Status: **Model Released & Repository Consolidated**
248
+ We have successfully completed the finetuning, KenLM integration, and repository consolidation phases. The model and its associated language models are now hosted on Hugging Face at `RASMUS/Finnish-ASR-Canary-v2`.
249
+
250
+ - **Infrastructure:** Finetuned on **RTX 6000 PRO Blackwell** (96 GB VRAM) on Verda.com platform in Finland.
251
+ - **Model Suite:** Acoustic model + 3 KenLM variants (1M, 2M, 5M sentences).
252
+ - **Best Performance (with KenLM 5M):**
253
+ - **FLEURS:** 7.86% WER
254
+ - **Common Voice:** 4.70% WER
255
+ - **CSS10:** 7.07% WER
256
+ - **VoxPopuli:** 11.65% WER
257
+ - **Deployment:** Integrated Silero VAD-based inference for robust long-form audio processing.
258
+
259
+ ### Next Steps:
260
+ 1. **Long-form Tuning:** Reduce default `chunk_len` to 8–10s (closer to training distribution median) and add 0.5–1s overlap between chunks to reduce boundary artifacts.
261
+ 2. **Data Quality Audit:** Fix 28 confirmed corrupted Common Voice entries where raw TSV metadata (client ID hashes, gender tags) was accidentally written into the `text` field. Audit VoxPopuli for missing capitalisation (all-lowercase transcripts despite `pnc: yes`).
262
+ 3. **Number Handling:** Add Finnish-specific training data with numeric content. Consider TTS-synthesised samples covering phone numbers, years, statistics, and measurements (both digit and written-out forms paired).
263
+ 4. **Long-form Training Data:** Incorporate longer audio segments: TTS synthetic long-form audio (`fbc_monolog_processed`, parliament data) into the training manifest to shift the duration distribution toward 15–30s.
264
+ 5. **KenLM Refinement:** Re-train KenLM with high-quality punctuated text. Current LM trained on mixed-quality data.
265
+ 6. **Advanced Evaluation:** Implement CER evaluation on non-normalised test sets to better capture punctuation/casing accuracy.
266
+ 7. **Repetition Penalty:** Explore repetition penalty in decoding if chunk-level loops persist after chunk length tuning.
267
+ 8. **Real-world Evaluation:** Benchmark on diverse long-form samples (podcasts, meetings, call-centre audio).
268
+
269
+ ---
270
+
271
+ ## πŸ—ΊοΈ Action Plan: Next Training Run
272
+
273
+ This section details the concrete steps for the next finetuning iteration, based on the root-cause analysis above.
274
+
275
+ ### Priority 1 β€” Fix Training Data (before re-training)
276
+
277
+ #### 1a. Normalise numbers to digit form (Gemini Flash)
278
+ Finnish written-out numbers in FLEURS transcripts cause the finetuned model to output inconsistent number forms. We will use the Gemini Flash API to convert all training transcripts in a single batch pass:
279
+
280
+ ```python
281
+ # Pseudocode β€” run once on train_manifest.json before next training
282
+ import google.generativeai as genai
283
+ import json
284
+
285
+ genai.configure(api_key=GEMINI_API_KEY)
286
+ model = genai.GenerativeModel("gemini-2.0-flash")
287
+
288
+ SYSTEM_PROMPT = """You are a Finnish text normalizer.
289
+ Convert any written-out Finnish numbers, ordinals, or number words in the text to digit form.
290
+ Examples:
291
+ "yli sata vuotta" β†’ "yli 100 vuotta"
292
+ "seitsemΓ€ntoista henkeΓ€" β†’ "17 henkeΓ€"
293
+ "vuonna tuhat yhdeksΓ€nsataa" β†’ "vuonna 1900"
294
+ Keep all other text exactly as-is. Return only the modified text, nothing else."""
295
+
296
+ entries = []
297
+ with open('manifests/train_manifest.json') as f:
298
+ for line in f:
299
+ d = json.loads(line)
300
+ response = model.generate_content(f"{SYSTEM_PROMPT}\n\n{d['text']}")
301
+ d['text'] = response.text.strip()
302
+ entries.append(d)
303
+
304
+ with open('manifests/train_manifest_normalised.json', 'w') as f:
305
+ for e in entries:
306
+ f.write(json.dumps(e, ensure_ascii=False) + '\n')
307
+ ```
308
+
309
+ Cost estimate: 23,180 entries Γ— ~50 tokens average = ~1.2M tokens. At Gemini Flash pricing (~$0.075/1M tokens input) β‰ˆ **< $0.10 total**.
310
+
311
+ #### 1b. Fix en-dash UNK token (confirmed root cause)
312
+ The en-dash `–` (U+2013) is NOT in the tokenizer vocabulary β€” it maps to UNK (id 0). Replace it with ASCII hyphen before training:
313
+
314
+ ```python
315
+ # Add to the manifest preprocessing step
316
+ text = text.replace('\u2013', '-').replace('\u2014', '-')
317
+ ```
318
+
319
+ This affects **85 entries** in `train_manifest.json` (83 FLEURS, 2 Common Voice).
320
+
321
+ #### 1c. Fix 28 corrupted Common Voice entries
322
+ Replace entries where the `text` field contains raw TSV metadata (tabs + client_id hashes). Strip everything after the first tab character.
323
+
324
+ ---
325
+
326
+ ### Priority 2 β€” Add Long-Form Training Data
327
+
328
+ #### TTS Long-Form Dataset: `RASMUS/canary_asr_finetune_tts_long_data`
329
+
330
+ | Property | Value |
331
+ |----------|-------|
332
+ | Size | 8.0 GB zip |
333
+ | Format | FLAC audio + JSONL manifest |
334
+ | Mean duration | **16.5s** (vs 7.8s in current data) |
335
+ | Median duration | 15.9s |
336
+ | Max duration | 25.0s |
337
+ | Content | Finnish speech: lectures, podcasts, YouTube |
338
+ | Segments >20s | ~25% |
339
+
340
+ This dataset directly addresses the training length mismatch. Adding it will shift the duration distribution from a mean of 7.8s toward ~10–12s and significantly increase the proportion of 15–25s segments that match inference chunk lengths.
341
+
342
+ **Integration plan:**
343
+ ```bash
344
+ # Download the dataset
345
+ curl -L -H "Authorization: Bearer ${HF_TOKEN}" \
346
+ "https://huggingface.co/datasets/RASMUS/canary_asr_finetune_tts_long_data/resolve/main/canary_dataset.zip" \
347
+ -o /workspace/data/tts_long_data.zip
348
 
349
+ # Extract
350
+ unzip /workspace/data/tts_long_data.zip -d /workspace/data/tts_long_data/
351
+
352
+ # Apply number normalisation and dash fix to canary_manifest.jsonl
353
+ # then merge with existing train_manifest_normalised.json
354
+ ```
355
+
356
+ After applying number normalisation and dash fixes to the new manifest, concatenate with the existing training set. Expected combined size: ~23,180 + N (estimate 5,000–20,000+ entries depending on total dataset size).
357
 
358
  ---
359
 
360
+ ### Priority 3 β€” Inference Tuning (without re-training)
361
+
362
+ Even before re-training, we can improve `moo.wav` performance by adjusting `inference_vad.py`:
363
+
364
+ | Parameter | Current | Recommended |
365
+ |-----------|---------|-------------|
366
+ | `chunk_len` | 15s | 8–10s (match training median of 7.8s) |
367
+ | chunk overlap | 0s | 0.5s (reduce boundary word drops) |
368
+ | `alpha` (KenLM) | 0.2 | Try 0.1–0.15 (current may over-constrain decoder) |
369
+
370
+ ---
371
+
372
+ ## πŸ”„ Round 2: Data Pipeline & Splits
373
+
374
+ This section documents the data preparation methodology for Round 2 finetuning, including all new eval sets, the TTS integration, and the final manifest composition.
375
+
376
+ ### Overview of Changes vs Round 1
377
+
378
+ | Item | Round 1 | Round 2 |
379
+ |------|---------|---------|
380
+ | Base model | `canary-1b-v2.nemo` | `canary-1b-v2.nemo` (fresh start) |
381
+ | Training samples | 23,180 | **28,858** |
382
+ | Training hours | ~50h | **75.6h** |
383
+ | Mean duration | 7.8s | **9.4s** |
384
+ | Max duration allowed | 20.0s | **30.0s** |
385
+ | Transcripts normalised | No | **Yes (digits, dashes fixed)** |
386
+ | Eval sets | 4 | **6** |
387
+
388
+ ### Step 1 β€” Transcript Normalisation (`normalize_manifests.py`)
389
+
390
+ All training transcripts were cleaned in two layers:
391
+
392
+ **Deterministic fixes (no API call needed):**
393
+ - En-dash `–` (U+2013) and em-dash `β€”` (U+2014) β†’ ASCII hyphen `-` (fixes UNK token regression)
394
+ - Corrupted Common Voice entries (raw TSV metadata in `text` field) β†’ strip everything after first tab
395
+
396
+ **Gemini 2.5 Flash API calls (2,586 of 23,180 entries needed conversion):**
397
+ - Pre-filtered with a Finnish number-word regex so only entries that actually contain written numbers are sent to the API (cost: ~$0.62)
398
+ - Written Finnish numbers converted to digit form: `sata vuotta` β†’ `100 vuotta`, `seitsemΓ€ntoista` β†’ `17`
399
+ - Explicit DO NOT CONVERT rules: ordinals (`ensimmΓ€inen`, `toinen`), superlative constructions (`yksi tΓ€rkeimmistΓ€`), and `toinen` as "another/other"
400
+
401
+ ### Step 2 β€” TTS Long-Form Data Integration
402
+
403
+ Downloaded `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB, 6,365 entries, mean 16.4s).
404
+
405
+ Aligned to NeMo training format:
406
+ - Path rewritten to relative style: `data/tts_long_data/audio/{filename}`
407
+ - Fields mapped: `language` β†’ `source_lang`/`target_lang`, `task: "transcription"` β†’ `taskname: "asr"`, added `pnc: "yes"`
408
+ - Same Gemini normalisation pass applied (888 entries converted)
409
+
410
+ ### Step 3 β€” Eval Set Construction (TTS Data)
411
+
412
+ The 6,365 normalised TTS entries were split into train / eval / long-form-test:
413
+
414
+ ```
415
+ All TTS entries (6,365)
416
+ β”‚
417
+ β”œβ”€β”€ Long-form pool (>20s): 1,501 entries
418
+ β”‚ β”œβ”€β”€ eval_long_form (sampled): 200 entries ← random.seed(42) shuffle β†’ first 200
419
+ β”‚ └── Returned to training pool: 1,301 entries
420
+ β”‚
421
+ └── Medium pool (10–20s): 4,864 entries
422
+ β”œβ”€β”€ eval_tts (10% hold-out): 487 entries ← stratified by duration bucket
423
+ └── tts_train: 4,377 entries
424
+ ```
425
+
426
+ **Why eval_long_form = 200 entries?**
427
+ The original 1,501 long-form entries (>20s) had a total duration of ~9.4 hours β€” far too long to run as a validation set every epoch. At batch_size=32 on a single GPU, each validation pass over 1,501 entries takes ~25 minutes, adding 2.5h per epoch. 200 entries (β‰ˆ75 minutes of audio) provides a representative sample of the long-form distribution at reasonable cost: ~4 minutes of eval time per epoch.
428
+
429
+ **eval_tts construction:**
430
+ 487 entries were held out from the 10–20s duration range (10% stratified sample). This tests the model's ability to handle medium-length audio and is separate from the original 4 eval sets.
431
+
432
+ ### Step 4 β€” Combined Training Manifest
433
+
434
+ Final `train_manifest_combined.jsonl` composition:
435
+
436
+ | Source | Entries | Notes |
437
+ |--------|---------|-------|
438
+ | Original train (normalised) | 23,180 | Digits + dash fix applied |
439
+ | TTS train (10–20s) | 4,377 | Synthesised long-form speech |
440
+ | Long-form overflow | 1,301 | >20s entries not selected for eval_long_form |
441
+ | **Total** | **28,858** | Mean 9.4s, 75.6h |
442
+
443
+ ### Final Eval Sets (Round 2)
444
+
445
+ | Set | File | Entries | Mean Duration | Purpose |
446
+ |-----|------|---------|--------------|---------|
447
+ | `eval_fleurs` | `eval_fleurs.json` | 918 | 13.0s | Primary benchmark (monitored for checkpointing) |
448
+ | `eval_common_voice` | `eval_common_voice.json` | 1,554 | 5.1s | Crowdsourced quality |
449
+ | `eval_css10` | `eval_css10.json` | 170 | 7.5s | Clean single-speaker |
450
+ | `eval_voxpopuli` | `eval_voxpopuli.json` | 430 | 10.6s | Formal/parliament speech |
451
+ | `eval_tts` | `eval_tts.jsonl` | 487 | 14.5s | Medium-length TTS (new) |
452
+ | `eval_long_form` | `eval_long_form.jsonl` | **200** | 22.5s | Long-form >20s sample (new) |
453
+
454
+ **Checkpoint monitoring:** `val_wer` tracks FLEURS (first validation set). All 6 WERs are logged independently to WandB.
455
+
456
+ ### Round 2 Training Config
457
+
458
+ File: `configs/canary_finetune_finnish_v2.yaml`
459
+ Key settings:
460
+ - `init_from_nemo_model`: `/workspace/Finnish-ASR-Canary-v2/models/canary-1b-v2.nemo` (fresh start from base)
461
+ - `max_duration`: 30.0s (up from 20.0s to include TTS segments up to 25s)
462
+ - `max_steps`: 18,000 (scaled: 28,858 / 32 β‰ˆ 902 steps/epoch Γ— 20 epochs β‰ˆ 18,040)
463
+ - `lr`: 1e-5, `WarmupAnnealing`, 500 warmup steps
464
+ - `precision`: bf16, single GPU, `strategy: auto`
465
+
466
+ ---
467
+
468
+ ## πŸ› οΈ Workflow Status Details
469
+
470
+ ### 1. Data Preparation - DONE
471
+ - [x] Identify and inventory all 4 datasets
472
+ - [x] Create unified processing script (`scripts/prepare_all_manifests.py`)
473
+ - [x] Run `scripts/prepare_all_manifests.py` on devcontainer
474
+ - [x] Verify manifest sample counts and audio file integrity
475
+
476
+ ### 2. Configuration Setup - DONE
477
+ - [x] Create Hydra training config (`configs/canary_finetune_finnish.yaml`)
478
+ - [x] Configure multi-validation with 4 eval datasets
479
+ - [x] Checkpoint monitors primary eval set (FLEURS) via `val_wer`
480
+ - [x] All 4 eval WERs logged independently to WandB
481
+
482
+ ### 3. Training - DONE
483
+ - [x] Run finetuning via `run_training.sh`
484
+ - [x] Monitor per-dataset WER in WandB
485
+
486
+ ### 4. KenLM / NGPU-LM Language Model Integration - DONE
487
+ - [x] Install KenLM tools (`install_beamsearch_decoders.sh`)
488
+ - [x] Gather Finnish text (ASR transcripts + Wikipedia + mc4)
489
+ - [x] Train 3 variants of KenLM (1M, 2M, 5M sentences)
490
+ - [x] Evaluate with LM fusion on all 4 test sets
491
+
492
+ ### 5. Repository & Long-Form Inference - IN PROGRESS
493
+ - [x] Consolidate README and model metadata for Hugging Face release
494
+ - [x] Upload model checkpoints and KenLM bundles to HF Hub
495
+ - [x] Implement Silero VAD-based chunking for long-form audio (`inference_vad.py`)
496
+ - [x] Root-cause analysis of long-form degradation vs. Whisper (see above)
497
+ - [ ] Reduce `chunk_len` to 8–10s and add chunk overlap (Current Focus)
498
+ - [ ] Optimize `alpha` for stability on `moo.wav` (30 min test file)
499
+
500
+ ### 6. Data Quality & Advanced Evaluation - PARTIALLY DONE
501
+ - [x] Fix 28 corrupted Common Voice manifest entries (raw TSV data in text field) β€” done in normalisation pass.
502
+ - [x] Fix en-dash/em-dash UNK token regression β€” done in normalisation pass.
503
+ - [ ] Audit VoxPopuli transcripts for all-lowercase entries (capitalisation missing).
504
+ - [ ] Re-train KenLM with high-quality punctuated text.
505
+ - [ ] Evaluate CER on non-normalized test sets.
506
+
507
+ ### 7. Number Normalisation & UNK Token Fix - DONE
508
+ - [x] Replace en-dash `–` and em-dash `β€”` with ASCII hyphen `-` in all training manifests (85 train + 70 TTS entries fixed).
509
+ - [x] Use Gemini 2.5 Flash to normalise written-out Finnish numbers to digit form (2,586 API calls across train + TTS).
510
+ - [ ] Re-evaluate on the 5-sample number test set after Round 2 training to verify consistency.
511
+
512
+ ### 8. Long-Form Data Expansion - DONE
513
+ - [x] Download `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB zip, 6,365 entries, mean 16.4s).
514
+ - [x] Align TTS manifest to NeMo training format and integrate into combined training manifest.
515
+ - [x] Round 2 training configured and ready to launch (see Round 2 section below).
516
+ - [ ] Benchmark Round 2 model against Round 1 and finetuned Whisper on `moo.wav`.
517
+
518
+ ---
519
+
520
+ ## πŸ› οΈ NeMo Environment Setup
521
+
522
+ This section documents the exact steps to set up a working NeMo inference/training environment, including the fixes required for the `nvcr.io/nvidia/pytorch:25.01-py3` container.
523
+
524
+ ### Installation (from scratch on pytorch:25.01-py3 base image)
525
+
526
+ ```bash
527
+ # 1. Clone the HF model repo (contains NeMo source with patches applied)
528
+ # Skip LFS to avoid downloading the 3.6 GB model during clone
529
+ GIT_LFS_SKIP_SMUDGE=1 git clone \
530
+ "https://user:${HF_TOKEN}@huggingface.co/RASMUS/Finnish-ASR-Canary-v2" \
531
+ /workspace/Finnish-ASR-Canary-v2
532
+
533
+ # 2. Install NeMo in editable mode from the patched source
534
+ cd /workspace/Finnish-ASR-Canary-v2/NeMo
535
+ pip install -e ".[asr]"
536
+
537
+ # 3. Install pinned dependencies
538
+ pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' kaldialign wandb
539
+ ```
540
+
541
+ ### Required Compatibility Fixes
542
+
543
+ The pytorch:25.01-py3 container ships with packages that conflict with NeMo 2.8.0rc0:
544
+
545
+ ```bash
546
+ # Fix 1: Downgrade lightning to the version NeMo requires (<=2.4.0)
547
+ # The container ships lightning 2.4.0 but pip may upgrade it β€” pin it back.
548
+ pip install "lightning==2.4.0" "pytorch-lightning==2.4.0"
549
+
550
+ # Fix 2: Remove incompatible torchvision
551
+ # The container's torchvision (0.20.0a0) was built against torch 2.6.0a0 (the original
552
+ # container torch), but NeMo's install upgrades torch to ~2.10. torchvision then fails
553
+ # on import and blocks NeMo. ASR does not need torchvision.
554
+ pip uninstall -y torchvision
555
+ ```
556
+
557
+ ### Downloading the Finetuned Model
558
+
559
+ ```bash
560
+ # Download the finetuned acoustic model (3.6 GB)
561
+ curl -L \
562
+ -H "Authorization: Bearer ${HF_TOKEN}" \
563
+ "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/canary-finnish.nemo" \
564
+ -o /workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo
565
+
566
+ # KenLM models are also LFS β€” download the 5M variant (best WER):
567
+ curl -L \
568
+ -H "Authorization: Bearer ${HF_TOKEN}" \
569
+ "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/kenlm_5M.nemo" \
570
+ -o /workspace/Finnish-ASR-Canary-v2/kenlm_5M.nemo
571
+ ```
572
+
573
+ ### Quick Inference Smoke Test
574
+
575
+ ```python
576
+ import warnings; warnings.filterwarnings('ignore')
577
+ from nemo.collections.asr.models import EncDecMultiTaskModel
578
+
579
+ model = EncDecMultiTaskModel.restore_from(
580
+ '/workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo',
581
+ map_location='cuda'
582
+ )
583
+ model.eval()
584
+
585
+ results = model.transcribe(
586
+ audio=['path/to/audio.wav'],
587
+ task='asr', source_lang='fi', target_lang='fi', pnc='yes'
588
+ )
589
+ print(results[0].text)
590
+ ```
591
+
592
+ ### Loading the Base Model (for comparison)
593
+
594
+ ```python
595
+ # Downloads ~3.6 GB on first run, cached in ~/.cache/huggingface/
596
+ model_base = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2", map_location='cuda')
597
+ ```
598
 
599
  ---
600
 
601
  ## πŸ“ Progress Log
602
+ - **2026-01-11:** Initial project setup.
603
  - **2026-02-08:** Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
604
+ - **2026-02-10:** **Finetuning complete.** Epoch 11 reached `val_wer=0.1258` on FLEURS.
605
+ - **2026-02-13:** Mermaid diagrams and project documentation for DS team.
606
+ - **2026-02-18:** **KenLM benchmarks finished.** Consolidated repository structure. Applied NeMo patches for inference stability.
607
+ - **2026-02-20:** **Model Released.** Release of `Finnish-ASR-Canary-v2` on HF. Implemented VAD-based inference pipeline. Currently tuning for long-form stability on `moo.wav` with various `alpha` settings (0.0 - 0.4 tested).
608
+ - **2026-02-26:** **Root-cause analysis complete.** Investigated long-form gap vs. Whisper and number handling. Key findings: (1) 77% of training data is <10s, creating distribution shift at inference chunk lengths; (2) No cross-chunk context in Canary's AED architecture; (3) Only 2.5% of training samples contain digit characters β€” numbers are a known weak point; (4) 28 corrupted Common Voice entries found (TSV metadata in text field); (5) `moo.wav` test file confirmed as 30 minutes. Action plan: shorten chunk_len, add chunk overlap, fix data corruption, and plan a long-form training data expansion round.
609
+ - **2026-02-26:** **Live number inference + tokenizer audit completed.** Ran base Canary-v2 vs. finetuned model on 5 FLEURS samples. Confirmed: (1) base model always outputs digits (`100`, `17`); (2) finetuned model regressed to mixed output β€” sometimes written words, sometimes digits β€” due to inconsistent training transcripts; (3) en-dash (`–`) produces UNK token `⁇` in finetuned model, base model degrades gracefully to ASCII hyphen. Policy decision: **standardise on digit output** and fix en-dash encoding in training manifests before next training run. NeMo environment setup documented (with fixes for `torchvision` and `lightning` version conflicts). TTS long-form dataset (`canary_asr_finetune_tts_long_data`, 8GB, mean 16.5s/segment) identified as key data source for next training run. Action plan for next run: (1) normalise numbers to digits via Gemini Flash API, (2) fix en-dash β†’ ASCII hyphen, (3) fix 28 corrupted CV entries, (4) add TTS long-form data.
610
+ - **2026-03-01:** **Round 2 data pipeline complete.** Ran `normalize_manifests.py`: 2,586 Gemini 2.5 Flash API calls (~$0.62), 1,137 number changes in train + 888 in TTS, 85 en-dash and 28 corrupted CV entries fixed. Downloaded and extracted TTS long-form dataset (6,365 entries, 4.8 GB). Split TTS data into train (4,377), eval_tts (487, mean 14.5s), and long-form pool (1,501 entries >20s). Sampled 200 entries into `eval_long_form.jsonl` (seed 42) and returned 1,301 to training, yielding `train_manifest_combined.jsonl` (28,858 entries, 75.6h). Round 2 training config created (`configs/canary_finetune_finnish_v2.yaml`). **Training ready to launch.**
611
+ - **2026-03-01:** **Training crash diagnosed and fixed.** Round 2 training ran 505 steps then crashed with CUDA `vectorized_gather_kernel index out of bounds`. Root cause: entry 14857 in `train_manifest_combined.jsonl` contained 11,247 chars of Python code (Gemini normalization returned a code block instead of a transcript for `voxpopuli_005371.wav`). When tokenized with the canary2 prompt format, the sequence far exceeded the decoder's `max_sequence_length=1024`, causing position-embedding OOB. Additionally, 4 entries in `eval_common_voice.json` had TSV metadata contamination (same v1 issue, not previously caught in the v2 eval set). Both manifests fixed. Config rewritten from full-architecture spec to minimal v1-style format (`tokenizer: update_tokenizer: false`) using `speech_to_text_finetune.py` (which restores the full model from the `.nemo` file). Training re-launched. Manifests synced to `canary-finnish-asr-data` HuggingFace dataset repo.