WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models Paper • 2604.18224 • Published 4 days ago • 21
DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation Paper • 2604.14683 • Published 8 days ago • 35
Seedance 2.0: Advancing Video Generation for World Complexity Paper • 2604.14148 • Published 9 days ago • 151
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models Paper • 2604.09459 • Published 11 days ago • 13
CutClaw: Agentic Hours-Long Video Editing via Music Synchronization Paper • 2603.29664 • Published 24 days ago • 48
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation Paper • 2603.23500 • Published about 1 month ago • 35
InCoder-32B: Code Foundation Model for Industrial Scenarios Paper • 2603.16790 • Published Mar 17 • 308
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants Paper • 2603.09652 • Published Mar 10 • 15
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing Paper • 2603.09877 • Published Mar 10 • 48
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale Paper • 2602.23866 • Published Feb 27 • 88
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization Paper • 2602.22675 • Published Feb 26 • 23
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published Feb 13 • 60
REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents Paper • 2602.14234 • Published Feb 15 • 27
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies Paper • 2602.09514 • Published Feb 10 • 11
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters Paper • 2602.10604 • Published Feb 11 • 196