Open to Collab

15 5 13

Omar Kamali PRO

omarkamali

https://omarkama.li

AI & ML interests

NLP & LLMs for low resource languages.

Recent Activity

liked a dataset 1 day ago

omneity-labs/lid-benchmark

liked a dataset 3 days ago

abdelhaqueidali/Amazigh-Dictionary-Dataset

published a dataset 3 days ago

omarkamali/spacy-nlp

View all activity

Organizations

liked a dataset 1 day ago

omneity-labs/lid-benchmark

Viewer • Updated 10 days ago • 27k • 90 • 2

liked a dataset 3 days ago

abdelhaqueidali/Amazigh-Dictionary-Dataset

Preview • Updated 3 days ago • 172 • 1

published a dataset 3 days ago

omarkamali/spacy-nlp

Viewer • Updated 5 days ago • 6.47M • 7

updated a dataset 5 days ago

omarkamali/spacy-nlp

Viewer • Updated 5 days ago • 6.47M • 7

liked a dataset 5 days ago

omneity-labs/ipa-dict

Viewer • Updated 11 days ago • 5.3M • 80 • 1

repliedto their post 9 days ago

This is super helpful, thanks! I'll get up to speed on the literature and keep your use case in mind :)

liked a Space 10 days ago

LID Benchmark — Language Identification Leaderboard

🌍

Compare 10 LID models across 8 benchmarks and 214 languages

updated a dataset 10 days ago

omneity-labs/lid-benchmark

Viewer • Updated 10 days ago • 27k • 90 • 2

updated a Space 10 days ago

LID Benchmark — Language Identification Leaderboard

🌍

Compare 10 LID models across 8 benchmarks and 214 languages

repliedto their post 10 days ago

So you basically still want ASR-style transcription before the LLM kicks in (perhaps to reduce hallucination? or another purpose?), but would like the representation to be more rich so a downstream LLM can still reason about pronunciation, pauses and so on?

repliedto their post 10 days ago

Hah yeah that rendering bug is for sure a meta joke (played on me :D).

Speech is for sure something I'd like to address. This work is deeply grounded in phonetics as you guessed (I wrote a paper on this topic because I love word plays https://doi.org/10.14746/linpo.2025.67.1.8 and it's kinda a precursor to this method) so it must work with audio. Just have to figure out the right way and objective.

What are the most critical gaps you see in voice AI that need an improvement?

repliedto their post 10 days ago

I knowww. Need to fix the video pipeline lol

Thanks @alfredo-ottomate ! In principle, it should be faster than a conventional LLM at the same scale while also using less VRAM. Mostly because it removes the softmax layer, which is one of the more expensive operations in standard language models. It also removes the embedding table, which usually accounts for roughly 10-20% of the parameters. For example, in Qwen 3.5 4B, that’s about 700M embedding parameters eliminated.

Raw performance-wise, I expect around ~10% generation speed up per-token, ~10% less VRAM usage, and better use of the context window since each token means a full word, not a subword piece.

The question then is how many parameters my replacement mechanism will ultimately need to stay competitive. The approach is already working surprisingly well at around 4M parameters, which is about 0.6% of the alternative at 4B total. Even if that number grows, the efficiency upside still looks very promising.

Fingers crossed! ✌︎

repliedto their post 11 days ago

Quick update, it seems to mostly work as intended 🤯

More details here:
https://x.com/OmarKamali/status/2036932984226320748

posted an update 11 days ago

Post

197

Omneity Labs LID Benchmark is live 🔥

- 8 Evals
- 10 Models (GlotLID, OpenLID, our own Gherbal and others)
- 200+ Languages
- One Leaderboard To Rule Them All!

Come find your language and which LID model supports it best in this space 👇

omneity-labs/lid-benchmark

upvoted a collection 11 days ago

OLDI and friends

Collection

This collection groups the datasets that have been featured as part of WMT’s Open Language Data Initiative shared task. • 5 items • Updated 11 days ago • 5

updated a dataset 11 days ago

omneity-labs/ipa-dict

Viewer • Updated 11 days ago • 5.3M • 80 • 1

published 2 datasets 11 days ago

omneity-labs/ipa-dict

Viewer • Updated 11 days ago • 5.3M • 80 • 1

omneity-labs/lid-benchmark

Viewer • Updated 10 days ago • 27k • 90 • 2

published a Space 11 days ago

LID Benchmark — Language Identification Leaderboard

🌍

Compare 10 LID models across 8 benchmarks and 214 languages

repliedto their post 11 days ago

I added a decoding head to the LLM, so the MLP generates a latent word vector that gets decoded by a GRU into a valid word.

I'm using the same input representation and train a joint encoder-decoder which gets further fine-tuned as part of the "Next Latent Prediction"(?) objective and it seems to be pretty decent for a first shot. Still working out some of the kinks.

Omar Kamali PRO

AI & ML interests

Recent Activity

Organizations

omarkamali's activity

LID Benchmark — Language Identification Leaderboard

LID Benchmark — Language Identification Leaderboard

LID Benchmark — Language Identification Leaderboard