SCAN: Self-Denoising Monte Carlo Annotation
for Robust Process Reward Learning

Soochow University   ▶ Tencent

We propose Self-Denoising Monte Carlo Annotation (SCAN),
an efficient Process Reward Model (PRM) data synthesis and noise-tolerant learning framework.

Background

  • Process Reward Model: Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning.
  • Data Scaling Bottleneck: The development of PRMs is constrained by the high cost and limited scalability of human annotations. Synthetic data generated via Monte Carlo (MC) estimation offers a scalable alternative, but its high noise ratio often leads to overfitting and hinders effective large-scale training.
  • Preliminary Study of Noise Distribution

    The above figure illustrates the noise distribution of Monte Carlo Estimation, where \( t_{pred} \) denotes the annotated error location (label), and \( t_{true} \) denotes the ground-truth error location (label).

    Here we list an important observations on noise distribution:

  • False Positive Noise: For predicted positive samples (i.e., \( t_{pred} = \text{inf}\)), the noise positive ratio is significantly lower in high self-confidence samples (left and middle columns), making them more suitable for training.
  • Inaccurate Negative Noise: Annotator model can roughly identifies error locations but often overestimate them, i.e., \(t_{pred} > t_{true}\). The number of noisy samples decreases as the deviation increases (right column).
  • Method Overview

    Building upon the insights, we propose SCAN framework, consisting of two modules: (1) an efficient data synthesis framework to reduce substantial inference costs, and (2) robust training methods to mitigate the high noise ratio in synthetic data and enable robust learning with noisy labels.

  • Data Synthesis: SCAN first estimate a confidence score of annotator model \(\pi\) on an given question \(q_i\): $$ SC_{\pi_1}(q_i) = \frac{1}{N} \sum\limits_{j=1}^{N} \mathcal{J}(r_{i}^{(j)}, a_i),\quad \text{where } r_i \sim P_{\theta}(\cdot \mid q_i) $$ where \(\mathcal{J}(r_{i}^{(j)}, a^i)\) evaluates the correctness of the generated response. The positive samples in high self-confidence regions, contain minimal noise (from the above discovery). Therefore, we directly use these samples as positive training examples.
  • Robust Learning: The term \(SC_{\pi}(q)\) denotes the self-confidence score of the completer model \(\pi\) for question \(q\). We then train PRMs with the reweighted step label: $$ \begin{aligned} \mathcal{L}_{\text{SCAN}}{(\theta)} &= -\mathbb{E}_{(x_{\leq t}, y_t) \sim D_{\text{final}}}[y_t\log(P_{\theta}(y_t|q, \mathbf{x}_{\leq t})) + (1 - y_t)\log(1 - P_{\theta}(y_t|q, \mathbf{x}_{\leq t}))] \\ \hat{y}_{t} &= \begin{cases} \min(c_t / SC_{\pi}(q), 1), & \text{if } t_{pred}^{e} - t \leq d \\ \mathbb{I}(c_t > 0), & \text{Otherwise} \\ \end{cases}, \quad\text{where }c_t = P_{\pi}(y_t = \text{correct} | q, \mathbf{x}_{\leq t}), \\ \end{aligned} $$ The completer model tends to overestimate the correctness of the current step due to its strong self-correction capability. As errors continue to accumulate, the model eventually makes mistakes, leading to \(t_{pred} > t_{true}\), with a high probability of similar errors occurring at nearby positions (from the above discovery). To enable more robust learning with these noisy labels, we propose a noise-tolerant labeling strategy that applies soft labels to steps preceding the error, within a tolerance distance \(d\).
  • Dataset

    With SCAN, we construct two datasets, using lightweight models:
  • SCAN-Base (101K samples generated by a 1.5B model): link
  • SCAN-Pro (197K samples generated by multiple models up to 7B): link
  • Full Huggingface Collections (Datasets with Models): link
  • Performance of SCAN

    We evaluate the effectiveness of the Process Reward Model (PRM) from two key perspectives:
  • Best-of-N (BoN) Evaluation: In this evaluation, the PRM functions as a verifier to select the best response from multiple candidate answers generated by a policy model.
  • Step-wise Error Detection: We use ProcessBench as the evaluation benchmark, which measures the PRM's capability to identify the first error location in a given response.
  • BibTeX

    @article{ding2025scan,
      title={SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning},
      author={Ding, Yuyang and Shi, Xinyu and Li, Juntao and Liang, Xiaobo and Tu, Zhaopeng and Zhang, Min},
      journal={arXiv preprint arXiv:2509.16548},
      year={2025}
    }