ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

📧 ftp8nr@virginia.edu,  📧 shengzhuang.chen@thomsonreuters.com,  📧 Michaelp@graphcore.ai, 
📧 hartvigsen@virginia.edu,  📧 jonathan.schwarz@thomsonreuters.com
1University of Virginia,  2Thomson Reuters Foundational Research,  3Imperial College London,  4Graphcore
In Progress

*Indicates Equal Contribution

Abstract

Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms.
Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance.
Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost.
We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of baselines, resulting in speed-ups of over 500% in determining the best data mixture on our largest experiments. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area. Finally, we highlight rich opportunities for future research in this area, helping bridge the gap towards a comprehensive understanding of the broader effects of training data on model generalization.

Diagram of the ADMIRE-BayesOpt method

An Overview of our method. We model the contribution of training domains to a target evaluation with a Gaussian Process. By finding the maximum of an acquisition function that provides a numeric trade-off between exploration and exploitation, ADMIRE-BayesOpt rapidly finds mixtures that outperform common practices such as random exploration.

Violin plot of validation performance

(a)

Violin plot of validation performance

(b)

Results obtained running our data mixture optimization pipeline. (a) ADMIRE-BO in comparison to RegMix on the Tülu 3 SFT dataset. Shown is the performance of a Qwen-2.5 7B model when trained on a discovered mixture. (b) Experiment scheduling and performance for ADMIRE-MFBO. Experimental runs are divided into broadly three phases.

BibTeX

@misc{chen2025admirebayesoptaccelerateddatamixture,
      title={ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization}, 
      author={Shengzhuang Chen and Xu Ouyang and Michael Arthur Leopold Pearce and Thomas Hartvigsen and Jonathan Richard Schwarz},
      year={2025},
      eprint={2508.11551},
      archivePrefix={arXiv},
      primaryClass={stat.ML},
      url={https://arxiv.org/abs/2508.11551}, 
}