Self-Taught Self-Correction for Small Language Models Paper • 2503.08681 • Published Mar 11, 2025 • 15
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards Paper • 2605.14539 • Published 6 days ago • 4