Robot Critics that Sweat the Small Stuff

Sruthi Sudhakar^1,2, Junbang Liang¹, Sreehari Rammohan¹, Pavel Tokmakov²,

¹Columbia University, ²Toyota Research Institute

Robot Critics that Sweat the Small Stuff can steer policies towards success.

Abstract

Large vision-language models contain several priors about the world and object interactions, making them useful critics during inference to steer robot policies towards success. However, closed-loop robot manipulation requires judging small visual differences between success and failure, which remains a challenge for current VLMs. We introduce a method to fine-tune critics by constructing pairwise progress supervision using success and failure rollouts obtained from a policy. Our fine-tuned critic excels at fine-grained progress reasoning and subtle failure detection, outperforming prior progress reasoning baselines. Additionally, we use an action-conditioned video model to predict the visual effect of several candidate actions sampled from a policy, and show that our critic can correctly identify successful candidates to execute, improving the average policy success rate by 11% across real-world tasks and 5.9% across simulation tasks.

How It Works

A brief explanation of Robot Critics that Sweat the Small Stuff.

Method

We fine-tune VLM critics for fine-grained progress/failure discrimination by generating paired success and failure rollouts from matched initial conditions, which lets us construct a fine-grained progress-comparison dataset. This only requires binary success/failure labels at the end of each trajectory, and naturally captures the failure modes of the deployed policy.

Real World Policy Rollouts Steered by Our Critic

Successful real-world rollouts showing how our critic steers the policy at test time.

Ours vs. Diffusion Policy

With our critic in the loop (right), the policy recovers from subtle errors that cause the Diffusion Policy baseline (left) to fail.

Diffusion Policy — fails

Ours (critic-guided) — succeeds

Quantitative Results

Policy Results

Task	Diffusion Policy
Task	Base	Ours
Stacking	23 / 50	32 / 50
Push-Bowl	0.140 mIoU	0.321 mIoU
Pickup Lego	11 / 50	16 / 50
PickPlace Lego To Bowl	6 / 50	8 / 50

Task	Gr00t
Task	Base	Ours
NavigateKitchen	0.040 ± 0.009	0.040 ± 0.009
CloseBlenderLid	0.072 ± 0.016	0.013 ± 0.007
CloseFridge	0.772 ± 0.031	0.863 ± 0.014
CloseToasterOvenDoor	0.356 ± 0.017	0.623 ± 0.032
CoffeeSetupMug	0.160 ± 0.028	0.184 ± 0.031
OpenCabinet	0.500 ± 0.025	0.552 ± 0.019
OpenDrawer	0.544 ± 0.010	0.648 ± 0.019
OpenStandMixerHead	0.588 ± 0.023	0.713 ± 0.015
PickPlaceCounterToCabinet	0.760 ± 0.017	0.752 ± 0.030
PickPlaceCounterToStove	0.696 ± 0.015	0.568 ± 0.020
PickPlaceDrawerToCounter	0.116 ± 0.021	0.180 ± 0.021
PickPlaceSinkToCounter	0.888 ± 0.022	0.757 ± 0.022
PickPlaceToasterToCounter	0.780 ± 0.019	0.729 ± 0.022
SlideDishwasherRack	0.300 ± 0.021	0.530 ± 0.018
TurnOffStove	0.104 ± 0.021	0.203 ± 0.017
TurnOnElectricKettle	0.348 ± 0.016	0.620 ± 0.032
TurnOnMicrowave	0.384 ± 0.025	0.427 ± 0.029
TurnOnSinkFaucet	0.440 ± 0.017	0.500 ± 0.013
Average	0.436 ± 0.020	0.495 ± 0.021

Baseline policies vs. our critic-guided policies on real-world and RoboCasa365 simulation tasks. The critic selects candidate proposals from the base policy that avoid failures, improving task success. For RoboCasa365, entries report mean success rate ± standard error.

Critic Accuracy on RoboCasa

Method	Cntr→Cab	Cab→Cntr	Cab→Micro	Micro→Cab	Cntr→Sink	Sink→Cntr	Cntr→Stove	Stove→Cntr	Avg.
Overall Accuracy
Qwen2.5-32B	0.495	0.508	0.502	0.489	0.502	0.491	0.512	0.490	0.499
Gemini 3	0.706	0.578	0.428	0.498	0.659	0.681	0.398	0.424	0.547
GVL	0.528	0.462	0.584	0.527	0.576	0.543	0.538	0.486	0.531
ROVER	0.707	0.677	0.798	0.758	0.727	0.657	0.710	0.737	0.721
ProgressLM	0.206	0.498	0.709	0.533	0.850	0.597	0.774	0.715	0.610
Ours (Success-Only)	0.927	0.876	0.874	0.892	0.892	0.858	0.886	0.885	0.886
Ours (Full Method)	0.966	0.931	0.908	0.942	0.965	0.918	0.916	0.943	0.936
OOD-Only Accuracy
ProgressLM	0.200	0.495	0.704	0.598	0.843	0.502	0.770	0.680	0.599
Ours (Success-Only)	0.940	0.851	0.829	0.872	0.894	0.792	0.897	0.896	0.871
Ours (Full Method)	0.970	0.909	0.863	0.916	0.958	0.870	0.903	0.924	0.914

Progress/failure detection accuracy across RoboCasa pick-and-place tasks (chance = 0.50). In-distribution + out-of-distribution (OOD) objects. Fine-tuning on success and failure data (Ours, Full Method) beats success-only and all baselines.

Coarse vs. Fine-grained Critic Performance

Prior VLM critics degrade sharply when judging fine-grained visual differences (small frame intervals), while large-interval (coarse) differences are easy. Fine-tuning for fine-grained progress recognition keeps our critic strong in the subtle regime that practical policy guidance requires.

Acknowledgments

We thank Matei Ciocarlie and Huy Ha for key discussions. This research is based on work partially supported by the Toyota Research Institute and NSF awards #2046910 and #2132519. This work is also partially supported by the funds provided by the National Science Foundation and by DoD OUSD (R&E) under Cooperative Agreement PHY-2229929 (The NSF AI Institute for Artificial and Natural Intelligence).

BibTeX

@misc{sudhakar2026robotcriticssweatsmall,
      title={Robot Critics that Sweat the Small Stuff}, 
      author={Sruthi Sudhakar and Junbang Liang and Sreehari Rammohan and Pavel Tokmakov and Richard Zemel and Carl Vondrick},
      year={2026},
      eprint={2606.21572},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.21572}, 
}