2.8m Gmail.txt Info

: The model is tested on subsets ranging from 200k to 2.8 million samples.

: Qwen2.5-VL-72B-Instruct is used as the judge model for calculating visual rewards during training [11]. 4. Experimental Results

: Uses 22k data pairs focusing on textual accuracy ( 2.8M GMAIL.txt

: Uses 11k pairs with a balance of textual and visual rewards (

: The SFT stage requires 60 hours of training on 16 H800 GPUs . The RL stages take an additional 34 hours on 24 H800 GPUs [11]. : The model is tested on subsets ranging from 200k to 2

To break the plateau, the authors implement a two-stage Reinforcement Learning (RL) process [11].

) used in the RL stages or the used to measure the success of the 2.8M dataset? Experimental Results : Uses 22k data pairs focusing

: Increasing data from 2M to 2.8M results in no further performance gains, confirming the plateau [22]. Multimodal Structured Reinforcement Learning (MSRL) :