GitHub - ModalMinds/MM-PRM: An open implementation of OmegaPRM and its corresponding training pipeline

Introduction

Multimodal Large Language Models (MLLMs) have shown promising performance on various reasoning tasks, yet their ability to reliably solve complex multi-step problems remains limited. Process Reward Models (PRMs) address this challenge by explicitly evaluating the correctness of each intermediate reasoning step, guiding models toward more robust solutions. However, effectively training PRMs typically requires substantial amounts of step-level supervision data, which are expensive and challenging to obtain.

In this work, we provide a complete implementation of OmegaPRM, an automated Monte Carlo Tree Search-based data pipeline, to generate scalable and high-quality multimodal step-level supervision data. Using this pipeline, we introduce a multimodal PRM based on the InternVL series.

Our contributions include the open implementation of OmegaPRM and the corresponding PRM training pipeline. We hope this implementation serves as a practical foundation, supporting and stimulating future research in multimodal reasoning models. We release all our codes, models, etc. at https://github.com/ModalMinds/MM-PRM.

Methodology

OmegaPRM

To achieve greater scalability and efficiency in collecting process supervision data, we employ **OmegaPRM** as our data pipeline. OmegaPRM utilizes a divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm to automate the generation of high-quality supervision data without human intervention. Specifically, the data generation for each sample involves three iterative stages: select, binary search, and maintain.

During the select stage, OmegaPRM prioritizes and selects the most valuable rollout from a pool of existing partial solutions based on heuristic statistics such as Monte Carlo estimations of correctness and step lengths. These rollouts are particularly chosen to surface mistakes that the model makes with high-confidence intermediate states.

In the subsequent binary search stage, the algorithm efficiently locates the first incorrect step within the selected rollout by iteratively dividing the sequence of reasoning steps and performing Monte Carlo rollouts. This approach reduces the complexity of error localization significantly, from linear to logarithmic with respect to the number of steps.

Finally, the maintain stage updates the statistical information stored within the Monte Carlo tree, including state visit counts, correctness estimations, and heuristic scores used for future selections. This continuous update ensures balanced and high-quality supervision data.

Starting from a seed dataset consisting of approximately 10k K-12 mathematical problems, OmegaPRM efficiently expands this data into over 1M high-quality process supervision annotations.

Illustration of the data pipeline from OmegaPRM’s original paper. Each iteration involves three stages: Select (choosing the most valuable rollout), Binary Search (efficiently locating the first erroneous step), and Maintain (updating statistics for future selections).

PRM Formulation

To train the Process Reward Model (PRM), we concatenate each math problem with its corresponding multi-step solution into a single input sequence. Special tokens are inserted immediately after each reasoning step, at which positions the model is required to predict the correctness of the preceding step. The model learns to output 'Yes' or 'No' at these positions, and we use the softmax probabilities associated with these tokens as the final predictions.

Experiments

Training Pipeline

Our training pipeline consists of three main stages: policy model construction, process supervision data generation and Process Reward Model (PRM) training.