Themis

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Organization Card

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

arXiv Models Datasets & Benchmarks GitHub Docker

Abstract:

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

Themis reward models are trained using the Bradley-Terry preference framework with a multi-stage data pipeline that mines, filters, scores, and assembles high-quality code preference pairs from open-source repositories. The models are evaluated on Code RewardBench (CRB), a benchmark of 8,866 preference pairs spanning 5 quality aspects and 8 programming languages.

Pipeline Overview

The end-to-end pipeline has three phases: dataset construction, model training, and evaluation.

                          DATASET CONSTRUCTION
                          ────────────────────
  BigQuery (github_repos)
      │
      ▼
  ┌─────────────────────┐   ┌───────────────────┐   ┌──────────────────┐
  │ 1. Commit Mining    │──▶│ 2. Repo Filtering │──▶│ 3. Ext Filtering │
  │    (SQL)            │   │    (allowlists)   │   │    (lang → ext)  │
  └─────────────────────┘   └───────────────────┘   └──────────────────┘
                                                            │
      ┌─────────────────────────────────────────────────────┘
      ▼
  ┌──────────────────────┐   ┌──────────────────┐   ┌──────────────────┐
  │ 4. Content Retrieval │──▶│ 5. Deduplication │──▶│ 6. Aspect Filter │
  │    (git fetch)       │   │    (MinHash LSH) │   │    (ModernBERT)  │
  └──────────────────────┘   └──────────────────┘   └──────────────────┘
                                                            │
      ┌─────────────────────────────────────────────────────┘
      ▼
  ┌──────────────────────┐   ┌──────────────────┐   ┌──────────────────┐
  │ 7. LLM Scoring &     │-─▶│ 8. LLM-as-a-Judge│──▶│ 9. Training Data │
  │    Instruction Synth │   │    (A/B voting)  │   │    Assembly      │
  └──────────────────────┘   └──────────────────┘   └──────────────────┘
                                                            │
                          MODEL TRAINING                    │
                          ──────────────                    │
      ┌─────────────────────────────────────────────────────┘
      ▼
  ┌───────────────────────────────────────────────────────────────────┐
  │ Bradley-Terry preference training with FSDP2 on multi-node GPUs   │
  │ (BT loss + LM regularisation + magnitude penalty, Liger kernels)  │
  └───────────────────────────────────┬───────────────────────────────┘
                                      │
                          EVALUATION  │
                          ──────────  │
      ┌───────────────────────────────┘
      ▼
  ┌───────────────────────────────────────────────────────────────────┐
  │ Code RewardBench: 8,866 pairs × 5 aspects × 8 languages           │
  │ Evaluated across scalar, MoE, and generative RM architectures     │
  └───────────────────────────────────────────────────────────────────┘

Results

Themis-RM models achieve best-in-class accuracy on Themis-CodeRewardBench, a code-specific reward model benchmark, while also matching or exceeding much larger models on established general-domain benchmarks (RewardBench V1, RewardBench V2, JudgeBench). Models are grouped by parameter class; bold marks the best in each group.

Model Themis-CodeRewardBench RewardBench V1 RewardBench V2 JudgeBench
32B - 72B Class
WorldPM-72B 76.96 90.88 67.92 55.21
Athene-RM-70B 78.39 91.22 68.76 63.45
Nemotron-70B-Reward 81.19 93.88 70.49 73.47
Themis-RM-32B 91.82 94.89 72.34 71.65
AceCodeRM-32B 62.95 23.58 67.98 66.77
7B – 14B Class
Themis-RM-14B 91.19 94.11 71.44 70.85
Themis-RM-8B 89.78 93.69 65.87 69.97
Athene-RM-8B 76.58 87.48 62.96 61.12
CodeScaler-8B 79.12 94.66 76.51 70.05
Skywork-Reward-V2-8B 79.97 94.76 76.93 67.90
AceCodeRM-7B 71.11 22.74 63.16 61.09
0.6B - 4B Class
Themis-RM-4B 88.39 92.46 63.81 68.02
CodeScaler-4B 77.97 94.32 75.13 68.44
Skywork-Reward-V2-4B 79.27 94.06 74.26 65.43
Themis-RM-1.7B 83.04 89.17 56.22 63.29
CodeScaler-1.7B 73.75 91.13 68.44 66.17
Skywork-Reward-V2-1.7B 75.60 91.64 67.71 66.48
Themis-RM-0.6B 79.26 83.41 49.61 63.84
Skywork-Reward-V2-0.6B 72.77 86.32 60.83 63.65

Datasets

All datasets are available on HuggingFace:

Dataset Description Samples
Themis-CodeRewardBench Code RM evaluation benchmark: 5 quality dimensions, 8 languages, 19 source subsets 8,866
Themis-CodePreference Training data for the PM stage: code preferences across 5 criteria and 8 languages 354,010
Themis-GeneralPreference Training data for the PT stage: general-domain and code retrieval preferences 110,598
Themis-Git-Commits-Merged Single-file commits from merged PRs across 24 languages (intermediate, pre-classification) ~8M
Themis-Git-Commits Raw mined single-file commits from permissively licensed repos (full unfiltered pool) ~28M

Related Work

Distributed Training Tutorial — A companion tutorial by us that walks through multi-node distributed training of scalar reward models on cloud GPU clusters. Covers cluster provisioning, high-speed networking, container management, and FSDP-based training. Useful as a standalone guide for anyone looking to reproduce the Themis training setup or adapt it to their own reward modelling workloads. Follows a simplified recipe that leverages the Axolotl framework for training reward models with the Bradley-Terry loss.

Citation

@article{themis2025,
  title={Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring},
  author={Paul, Indraneil and Gurevych, Iryna and Glava\v{s}, Goran},
  journal={arXiv preprint arXiv:2605.00754},
  year={2025}
}