ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence Paper β’ 2605.26340 β’ Published 20 days ago β’ 36
Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback Paper β’ 2606.06113 β’ Published 10 days ago β’ 13
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection Paper β’ 2605.30288 β’ Published 16 days ago β’ 22
RewardHarness: Self-Evolving Agentic Post-Training Paper β’ 2605.08703 β’ Published May 9 β’ 10 β’ 4
ClawBench Collection Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces β everything you need to run, regrade, or compare on ClawBench. β’ 5 items β’ Updated May 12
ClawBench Collection Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces β everything you need to run, regrade, or compare on ClawBench. β’ 5 items β’ Updated May 12
ClawBench Collection Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces β everything you need to run, regrade, or compare on ClawBench. β’ 5 items β’ Updated May 12
ClawBench β Browser Agent Benchmark Suite Collection Benchmark dataset (V1+V2), live leaderboard Space, and full V1 execution traces β everything you need to run, regrade, or compare on ClawBench. β’ 5 items β’ Updated May 12 β’ 1