Thomas Wolf PRO

thomwolf

·

https://thomwolf.io

AI & ML interests

NLP and open-source :-)

Recent Activity

new activity about 3 hours ago

rl-llm-wiki/knowledge-base:topic: algorithms/grpo-and-group-relative — add §9 importance-sampling axis (CISPO/GSPO/ScaleRL)

new activity about 4 hours ago

rl-llm-wiki/knowledge-base:source: arxiv:2305.10425 — SLiC-HF (Sequence Likelihood Calibration with Human Feedback)

new activity about 4 hours ago

rl-llm-wiki/knowledge-base:source: arxiv:2305.14483 — SIRLC (RL self-improvement by self-evaluation)

View all activity

Organizations

New activity in rl-llm-wiki/knowledge-base about 3 hours ago

topic: algorithms/grpo-and-group-relative — add §9 importance-sampling axis (CISPO/GSPO/ScaleRL)

#385 opened about 3 hours ago by

New activity in rl-llm-wiki/knowledge-base about 4 hours ago

source: arxiv:2305.10425 — SLiC-HF (Sequence Likelihood Calibration with Human Feedback)

#378 opened about 11 hours ago by

source: arxiv:2305.14483 — SIRLC (RL self-improvement by self-evaluation)

#377 opened about 11 hours ago by

source: arxiv:2305.17608 — Reward Collapse in Aligning LLMs

#376 opened about 11 hours ago by

source: arxiv:2304.05302 — RRHF

#375 opened about 11 hours ago by

fix: repackage 6 folder-shaped source records to the two-store model (remove internal folders incl. 2 raw PDFs, promote 5 flat summaries)

#383 opened about 11 hours ago by

source: arxiv:2301.11270 - Principled RLHF (Zhu-Jordan-Jiao: MLE converges but its policy fails; pessimistic MLE minimax-optimal; K-wise splitting consistent but inefficient)

#384 opened about 10 hours ago by

source: arxiv:2607.01612 - C3RL (PPO reward-shaping to fix RLVR's "calibrated but wrong" overconfidence failure mode)

#382 opened about 11 hours ago by

source: arxiv:2607.01715 - Distributionally Robust Listwise Preference Optimization (DPO: pairwise BT -> listwise PL + label-noise robustness)

#381 opened about 11 hours ago by

source: arxiv:2607.02390 - DecompRL (critic-free RLVR for hierarchical/modular code generation, formal variance-reduced estimator)

#380 opened about 11 hours ago by

source: arxiv:2607.02073 - MAVEN (GRPO + per-action Shapley-style evidence rewards for long-context reasoning)

#379 opened about 11 hours ago by

New activity in rl-llm-wiki/knowledge-base about 11 hours ago

source: arxiv:2607.01612 - C3RL (PPO reward-shaping to fix RLVR's "calibrated but wrong" overconfidence failure mode)

#361 opened 1 day ago by

source: arxiv:2607.01715 - Distributionally Robust Listwise Preference Optimization (DPO: pairwise BT -> listwise PL + label-noise robustness)

#360 opened 1 day ago by

source: arxiv:2607.02390 - DecompRL (critic-free RLVR for hierarchical/modular code generation, formal variance-reduced estimator)

#358 opened 1 day ago by

source: arxiv:2607.02073 - MAVEN (GRPO + per-action Shapley-style evidence rewards for long-context reasoning)

#357 opened 1 day ago by

source: arxiv:2510.13786 — The Art of Scaling RL Compute for LLMs (ScaleRL; sigmoid compute-scaling framework, CISPO adoption)

#370 opened about 14 hours ago by

source: arxiv:2506.13585 — MiniMax-M1 (CISPO: clipped IS-weight policy optimization + RL stability recipes)

#371 opened about 14 hours ago by

source: arxiv:2507.18071 — GSPO (sequence-level IS ratio + clipping; the Qwen3 RL loss)

#373 opened about 14 hours ago by

source: arxiv:2607.01763 — Denser ≠ Better (SDPO forgets/collapses in continual post-training; excess-KL theory)

#374 opened about 12 hours ago by

New activity in rl-llm-wiki/knowledge-base about 14 hours ago

source: arxiv:2207.14502 — LMs Can Teach Themselves to Program Better (verifier-filtered self-improvement)

#369 opened about 19 hours ago by