Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28 • 16
Good Intentions Beyond ACL: Who Does NLP for Social Good, and Where? Paper • 2510.04434 • Published Oct 6 • 5
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search Paper • 2509.25454 • Published Sep 29 • 140
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning Paper • 2509.22576 • Published Sep 26 • 134
The Majority is not always right: RL training for solution aggregation Paper • 2509.06870 • Published Sep 8 • 16
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification Paper • 2508.05629 • Published Aug 7 • 180
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training Paper • 2508.00414 • Published Aug 1 • 93
Phi-Ground Tech Report: Advancing Perception in GUI Grounding Paper • 2507.23779 • Published Jul 31 • 44
view article Article Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face +3 Jul 29 • 202
AGUVIS: Unified Pure Vision GUI Agents Collection https://aguvis-project.github.io • 3 items • Updated Dec 20, 2024 • 7
Through the Valley: Path to Effective Long CoT Training for Small Language Models Paper • 2506.07712 • Published Jun 9 • 18
HardTests: Synthesizing High-Quality Test Cases for LLM Coding Paper • 2505.24098 • Published May 30 • 43
DataDecide Collection A suite of models, data, and evals over 25 corpora, 14 sizes, and 3 seeds to measure how accurately small experiments predict rankings at large scale. • 358 items • Updated 10 days ago • 21
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models Paper • 2505.20236 • Published May 26 • 2
The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models Paper • 2505.18497 • Published May 24 • 2