Abstract
Prior works in multi-objective reinforcement learning typically use linear reward scalarization with fixed weights, which provably fail to capture non‑convex Pareto fronts and thus yield suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs.
We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space.
We introduce two approaches of increasing sophistication and generalizability: (1) hypervolume-guided weight adaptation and (2) gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms (including GRPO, REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.

Objectives exhibit varying convergence rates. Validation performance of individual objectives during training. Each objective reaches saturation at different training stages. For example, under the conciseness-only weight configuration, the model achieves optimal response brevity at approximately step 165 (b), whereas theclarity-only configuration reaches the best clarity after step 240 (c).

Pareto fronts obtained from hypervolume-guided weight adaptation compared to fixed-weight baselines. Hypervolume-guided weighting outperforms baselines across most objectives, weight configurations, and RL algorithms, and in certain cases, demonstrates full dominance with superior results on all three objectives.

Pareto fronts obtained from gradient-based weight optimization compared to fixed-weight baselines. Clearly, the gradient-based weighting approach yields superior Pareto fronts compared to all baselines across both GRPO and REINFORCE training setups , and achieves improved performance on two key objectives (conciseness and clarity) under RLOO training.

First row: Pareto fronts of Qwen3-8B model trained on MATH algebra problems using GRPO. Second row: Pareto fronts of Deepseek-7B model trained on Math500 dataset using GRPO. Similar to our main experiment results, our two methods achieve superior trade-offs between accuracy, response length, and clarity compared to the baseline, with the gradient-based weighting showing the best overall performance.

Learning differs for objectives. Under our gradient-based weighting method, the weight for conciseness rapidly converges to approximately 0.2, with the lost weight mostly shifted to accuracy. This aligns with our intuition that accuracy is a more challenging objective which requires continual learning. The higher weights for accuracy and clarity compared to conciseness also support our preliminary findings that accuracy and clarity objectives are highly intertwined in the optimization process, effectively playing similar roles in model updates, while conciseness works orthogonally.
BibTeX
@misc{lu2025learningoptimizemultiobjectivealignment,
title={Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting},
author={Yining Lu and Zilong Wang and Shiyang Li and Xin Liu and Changlong Yu and Qingyu Yin and Zhan Shi and Zixuan Zhang and Meng Jiang},
year={2025},
eprint={2509.11452},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.11452},
}