SCAN

SCAN: Self-Denoising Monte Carlo Annotation
for Robust Process Reward Learning

Yuyang Ding, Xinyu Shi, Juntao Li, Xiaobo Liang, Zhaopeng Tu, Min Zhang

▶ Soochow University ▶ Tencent

We propose Self-Denoising Monte Carlo Annotation (SCAN),
an efficient Process Reward Model (PRM) data synthesis and noise-tolerant learning framework.

Background

Process Reward Model: Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning.

Data Scaling Bottleneck: The development of PRMs is constrained by the high cost and limited scalability of human annotations. Synthetic data generated via Monte Carlo (MC) estimation offers a scalable alternative, but its high noise ratio often leads to overfitting and hinders effective large-scale training.

Preliminary Study of Noise Distribution

The above figure illustrates the noise distribution of Monte Carlo Estimation, where $ t_{pred} $ denotes the annotated error location (label), and $ t_{true} $ denotes the ground-truth error location (label).

Here we list an important observations on noise distribution:

False Positive Noise: For predicted positive samples (i.e., $ t_{pred} = \text{inf}$), the noise positive ratio is significantly lower in high self-confidence samples (left and middle columns), making them more suitable for training.

Inaccurate Negative Noise: Annotator model can roughly identifies error locations but often overestimate them, i.e., $t_{pred} > t_{true}$. The number of noisy samples decreases as the deviation increases (right column).

Method Overview

Building upon the insights, we propose SCAN framework, consisting of two modules: (1) an efficient data synthesis framework to reduce substantial inference costs, and (2) robust training methods to mitigate the high noise ratio in synthetic data and enable robust learning with noisy labels.

Data Synthesis: SCAN first estimate a confidence score of annotator model $\pi$ on an given question $q_i$: $$ SC_{\pi_1}(q_i) = \frac{1}{N} \sum\limits_{j=1}^{N} \mathcal{J}(r_{i}^{(j)}, a_i),\quad \text{where } r_i \sim P_{\theta}(\cdot \mid q_i) $$ where $\mathcal{J}(r_{i}^{(j)}, a^i)$ evaluates the correctness of the generated response. The positive samples in high self-confidence regions, contain minimal noise (from the above discovery). Therefore, we directly use these samples as positive training examples.

Robust Learning: The term $SC_{\pi}(q)$ denotes the self-confidence score of the completer model $\pi$ for question $q$. We then train PRMs with the reweighted step label: $$ \begin{aligned} \mathcal{L}_{\text{SCAN}}{(\theta)} &= -\mathbb{E}_{(x_{\leq t}, y_t) \sim D_{\text{final}}}[y_t\log(P_{\theta}(y_t|q, \mathbf{x}_{\leq t})) + (1 - y_t)\log(1 - P_{\theta}(y_t|q, \mathbf{x}_{\leq t}))] \\ \hat{y}_{t} &= \begin{cases} \min(c_t / SC_{\pi}(q), 1), & \text{if } t_{pred}^{e} - t \leq d \\ \mathbb{I}(c_t > 0), & \text{Otherwise} \\ \end{cases}, \quad\text{where }c_t = P_{\pi}(y_t = \text{correct} | q, \mathbf{x}_{\leq t}), \\ \end{aligned} $$ The completer model tends to overestimate the correctness of the current step due to its strong self-correction capability. As errors continue to accumulate, the model eventually makes mistakes, leading to $t_{pred} > t_{true}$, with a high probability of similar errors occurring at nearby positions (from the above discovery). To enable more robust learning with these noisy labels, we propose a noise-tolerant labeling strategy that applies soft labels to steps preceding the error, within a tolerance distance $d$.

Dataset

With SCAN, we construct two datasets, using lightweight models:

SCAN-Base (101K samples generated by a 1.5B model): link

SCAN-Pro (197K samples generated by multiple models up to 7B): link

Full Huggingface Collections (Datasets with Models): link

Performance of SCAN

We evaluate the effectiveness of the Process Reward Model (PRM) from two key perspectives:

Best-of-N (BoN) Evaluation: In this evaluation, the PRM functions as a verifier to select the best response from multiple candidate answers generated by a policy model.

Step-wise Error Detection: We use ProcessBench as the evaluation benchmark, which measures the PRM's capability to identify the first error location in a given response.

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

We propose Self-Denoising Monte Carlo Annotation (SCAN),an efficient Process Reward Model (PRM) data synthesis and noise-tolerant learning framework.