Robot Critics that Sweat the Small Stuff

1Columbia University, 2Toyota Research Institute

Robot Critics that Sweat the Small Stuff can steer policies towards success.

Abstract

Large vision-language models contain several priors about the world and object interactions, making them useful critics during inference to steer robot policies towards success. However, closed-loop robot manipulation requires judging small visual differences between success and failure, which remains a challenge for current VLMs. We introduce a method to fine-tune critics by constructing pairwise progress supervision using success and failure rollouts obtained from a policy. Our fine-tuned critic excels at fine-grained progress reasoning and subtle failure detection, outperforming prior progress reasoning baselines. Additionally, we use an action-conditioned video model to predict the visual effect of several candidate actions sampled from a policy, and show that our critic can correctly identify successful candidates to execute, improving the average policy success rate by 11% across real-world tasks and 5.9% across simulation tasks.

How It Works

A brief explanation of Robot Critics that Sweat the Small Stuff.

Method

We fine-tune VLM critics for fine-grained progress/failure discrimination by generating paired success and failure rollouts from matched initial conditions, which lets us construct a fine-grained progress-comparison dataset. This only requires binary success/failure labels at the end of each trajectory, and naturally captures the failure modes of the deployed policy.

Real World Policy Rollouts Steered by Our Critic

Successful real-world rollouts showing how our critic steers the policy at test time.

Ours vs. Diffusion Policy

With our critic in the loop (right), the policy recovers from subtle errors that cause the Diffusion Policy baseline (left) to fail.

Diffusion Policy — fails

Ours (critic-guided) — succeeds

Quantitative Results

Policy Results

Task Diffusion Policy
Base Ours
Stacking 23 / 50 32 / 50
Push-Bowl 0.140 mIoU 0.321 mIoU
Pickup Lego 11 / 50 16 / 50
PickPlace Lego To Bowl 6 / 50 8 / 50

Baseline policies vs. our critic-guided policies on real-world and RoboCasa365 simulation tasks. The critic selects candidate proposals from the base policy that avoid failures, improving task success. For RoboCasa365, entries report mean success rate ± standard error.

Critic Accuracy on RoboCasa

Method Cntr→Cab Cab→Cntr Cab→Micro Micro→Cab Cntr→Sink Sink→Cntr Cntr→Stove Stove→Cntr Avg.
Overall Accuracy
Qwen2.5-32B 0.495 0.508 0.502 0.489 0.502 0.491 0.512 0.490 0.499
Gemini 3 0.706 0.578 0.428 0.498 0.659 0.681 0.398 0.424 0.547
GVL 0.528 0.462 0.584 0.527 0.576 0.543 0.538 0.486 0.531
ROVER 0.707 0.677 0.798 0.758 0.727 0.657 0.710 0.737 0.721
ProgressLM 0.206 0.498 0.709 0.533 0.850 0.597 0.774 0.715 0.610
Ours (Success-Only) 0.927 0.876 0.874 0.892 0.892 0.858 0.886 0.885 0.886
Ours (Full Method) 0.966 0.931 0.908 0.942 0.965 0.918 0.916 0.943 0.936
OOD-Only Accuracy
ProgressLM 0.200 0.495 0.704 0.598 0.843 0.502 0.770 0.680 0.599
Ours (Success-Only) 0.940 0.851 0.829 0.872 0.894 0.792 0.897 0.896 0.871
Ours (Full Method) 0.970 0.909 0.863 0.916 0.958 0.870 0.903 0.924 0.914

Progress/failure detection accuracy across RoboCasa pick-and-place tasks (chance = 0.50). In-distribution + out-of-distribution (OOD) objects. Fine-tuning on success and failure data (Ours, Full Method) beats success-only and all baselines.

Coarse vs. Fine-grained Critic Performance

Coarse vs fine-grained critic performance.

Prior VLM critics degrade sharply when judging fine-grained visual differences (small frame intervals), while large-interval (coarse) differences are easy. Fine-tuning for fine-grained progress recognition keeps our critic strong in the subtle regime that practical policy guidance requires.

Acknowledgments

We thank Matei Ciocarlie and Huy Ha for key discussions. This research is based on work partially supported by the Toyota Research Institute and NSF awards #2046910 and #2132519. This work is also partially supported by the funds provided by the National Science Foundation and by DoD OUSD (R&E) under Cooperative Agreement PHY-2229929 (The NSF AI Institute for Artificial and Natural Intelligence).

BibTeX

@misc{sudhakar2026robotcriticssweatsmall,
      title={Robot Critics that Sweat the Small Stuff}, 
      author={Sruthi Sudhakar and Junbang Liang and Sreehari Rammohan and Pavel Tokmakov and Richard Zemel and Carl Vondrick},
      year={2026},
      eprint={2606.21572},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.21572}, 
}