In this section, we discuss the growing capabilities of large language models (LLMs) and the challenges we face in reliably evaluating their long-form outputs. Recent studies indicate that these models, after being trained on a variety of tasks, can adapt to follow new human instructions. This adaptability makes them promising candidates for automatically rating their own outputs. While human evaluation is valuable for understanding model performance, it often suffers from subjectivity, variability among different raters, and high costs associated with extensive assessments. To ensure that our LLM autoraters align with human preferences, it is vital to train them using human judgments. However, gathering these judgments can be both expensive and time-consuming.
Although we can collect existing evaluations from previous research, we encounter issues like a lack of standardization, varying evaluation criteria, insufficient documentation, and concerns about data privacy and proprietary information. On the other hand, training autoraters using model outputs can provide consistency but it also risks reinforcing biases and inaccuracies and may breach the terms of use for proprietary LLM services.
To tackle these challenges, we have curated and standardized human evaluations from earlier studies to create FLAME, a collection of over 102 quality assessment tasks that includes more than 5.3 million human judgments. FLAME covers a wide range of tasks from evaluating machine translation quality to assessing how well AI assistants follow user instructions. We believe that training on this extensive and varied dataset will help LLM autoraters learn strong generalized patterns of human judgment, reducing the influence of noisy or low-quality evaluations.
For transparency and reproducibility, we only use publicly available human evaluation data that comes with permissive licenses from prior studies. To address the difficulties in collecting such data, which often lacks standardization and documentation, we carefully reviewed the related research and consulted with the original authors to clarify any ambiguities, spending several hours on each dataset. We train our LLM autoraters through supervised multitask fine-tuning using our curated data. Following the unified task format inspired by T5, we convert all our quality assessment tasks into a text-to-text format complete with manually crafted task definitions and evaluation instructions.
Each training example is structured as an input-target pair where the input provides task-specific context and the target contains the expected human evaluations. This method allows for effective transfer learning across different tasks, enabling our models to interpret and respond consistently. Our approach aims to develop general-purpose LLM autoraters capable of handling various quality assessment tasks. We demonstrate that training an instruction-tuned LLM, specifically PaLM 2 24B, on our FLAME collection significantly enhances its ability to generalize across various tasks, outperforming models like GPT-4, Claude 3, and LLaMA 3 in many instances. This indicates that our large-scale multitask instruction tuning effectively equips the model with versatile quality assessment skills.
Encouraged by these findings, we further explore the benefits of using FLAME as a strong foundation for fine-tuning on specific downstream applications using reward modeling evaluation as a case study. We fine-tune FLAME for just 50 steps on a mix of four datasets that include human pairwise preference judgments covering areas like chat, reasoning, and safety. Our resulting FLAME IRM 24B model shows a notable improvement in performance on reward benchmarks, achieving an overall accuracy of 87.8%, up from 86.0%. It stands out as the top-performing generative model trained solely on permissively licensed data, surpassing both GPT-4 0125 and GPT-4. Additionally, we introduce FLAME OPTM, a method that optimizes our FLAME multitask mixture for targeted reward modeling evaluation by employing a new fine-tuning technique.
We analyze how each dataset affects specific reward benchmarks distributions, allowing us to identify the best proportions of individual datasets in our mixture. After fine-tuning the initial instruction-tuned PaLM 2 24B checkpoint on this optimized mixture for 5,000 steps, we achieve competitive performance on reward benchmarks, reaching 87.0% accuracy while using about 25 times fewer training data points compared to FLAME.
Overall, our FLAME variants outperform all popular proprietary LLM-as-a-judge models we evaluated across 8 out of 12 autorater benchmarks, which include 53 quality assessment tasks like reward benchmarks and LLM AGRA. We also investigate potential biases in our LLM autoraters, a common concern with LLM-as-a-judge models, and their usefulness in AI development, particularly in identifying high-quality model responses. Our analysis shows that FLAME variants exhibit significantly less bias than popular LLM-as-a-judge models on the cobbler autorater bias benchmark, demonstrating greater robustness to variations in pairwise ordering, response length, and irrelevant context. Furthermore, we find that FLAME effectively ranks LLM responses to Python programming prompts in the human eval benchmark, improving the pass rate by 6 to 10% across different settings.
In summary, our main contributions include:
In this section, we discuss existing literature related to autoraters and how it connects to our work with FLAME. First, we look at automatic evaluation metrics. Traditional metrics like BLEU and ROUGE evaluate how much the output from a model overlaps with human-written references. With the introduction of BERT, new methods have emerged that utilize pre-trained models to assess how similar the distributions of words are or to calculate the probabilities of specific tokens. Some research has focused on statistical techniques to measure how different two text distributions are, while other studies have fine-tuned pre-trained models based on human ratings to develop automatic evaluation metrics for various tasks such as machine translation, text summarization, question answering, and text simplification. In contrast to these task-specific metrics, FLAME is trained on a wide range of detailed quality assessment tasks and can adapt to new tasks during inference.
Next, we explore the use of large language models (LLMs) as judges in auto rating. With the rise of models like ChatGPT, recent studies have employed these models to evaluate their own capabilities across different benchmarks, including AlpacaEval, MT-Bench, and WildBench. However, we find that LLM-as-a-judge autoraters often show a preference for their generated responses, displaying biases related to factors like length, order, and preference for certain entities. Our models, on the other hand, are trained on a large and varied set of human evaluations, enabling them to learn unbiased and generalized patterns of human judgment. Unlike LLM-as-a-judge autoraters, our models do not evaluate their outputs, which helps eliminate self-preference bias.
We also consider recent efforts to create general-purpose LLM autoraters. For instance, Tiger Score is a model based on LLaMA 2 that has been trained using error analysis data generated by GPT-4 across several tasks, including summarization and translation. Other similar models include InstructScore, Prometheus, and PROMT2. Our approach, however, relies exclusively on open-source human evaluations rather than model outputs. We demonstrate that FLAME significantly outperforms Prometheus II in our reward bench experiments.
Our work is also related to the development of reward models (RMs) that align LLMs with human preferences through reinforcement learning with human feedback (RLHF). In RLHF, human preference data can either train standalone discriminative RMs or be directly integrated into LLMs using algorithms like DPO or SLIF. While we assess our models as RMs in our reward bench experiments, there are important differences. RMs typically train on pairwise preference data, while our models utilize a variety of task types in a unified format. Additionally, RMs focus on overall preference, whereas our models can be prompted to evaluate specific aspects of model responses, such as safety.
We have created the FLAME collection by fine-tuning instruction-tuned LLMs on a diverse mixture of standardized human evaluations, which includes 102 tasks and 5.3 million human judgments. This collection is carefully curated to cover a wide range of LLM capabilities. We have manually developed task definitions and evaluation instructions, ensuring that all tasks are formatted consistently in a text-to-text style. When we refer to a task, we mean a specific assignment for the model, which involves presenting a piece of text, such as a machine-generated summary, along with its context, like the original article. We then instruct the model to evaluate certain aspects of the text based on given criteria. Each task has unique definitions and evaluation guidelines, and we can derive different tasks from the same dataset.
Moreover, tasks that share similar definitions and evaluation criteria but come from different datasets are treated as separate tasks. Based on this understanding, the FLAME collection comprises a total of 102 distinct tasks.
In terms of our data collection principles, we prioritize using public open-source datasets to ensure transparency and reproducibility. We only utilize datasets that are permissively licensed from sources like Hugging Face datasets, TensorFlow datasets, or the original author's GitHub repositories. We exclusively rely on datasets with human-labeled annotations, steering clear of those generated by models like GPT-4 due to potential inaccuracies and legal issues highlighted in recent research. To improve the generalizability of our models, we gather datasets from a wide variety of task types. These include pairwise evaluation (where we compare two responses to determine a preference), pointwise evaluation (which assesses specific attributes of individual responses), classification (where we categorize responses into predefined groups), and open-ended evaluation (which allows for unrestricted answers).
In this section, we describe our approach to training general-purpose language model autoraters, which we refer to as FLAME. We begin with a baseline method that involves supervised multitask training, where we train an instruction-tuned PaLM 2 24B model on a mixture of tasks for a set number of 30,000 training steps. To ensure a balanced representation of tasks, we use mixture weights that are proportional to the number of examples, with a maximum limit of 65,536 per task to prevent oversampling from larger datasets.
Our FLAME model shows a significant improvement in its ability to generalize across a variety of tasks that were not included in the training, outperforming models such as GPT-4, Claude 3, and LLaMA 3 on many of these tasks. This supports our idea that large-scale multitask instruction tuning can effectively provide the model with the ability to assess quality in a general sense. However, we also discover that this method is not the best for specialized applications, such as reward modeling evaluation, which leads us to develop targeted approaches for specific downstream tasks.
Next, we focus on fine-tuning FLAME for reward modeling evaluation, which we call FLAME IRM. Building on our earlier findings, we explore how FLAME can serve as a strong foundation for further fine-tuning on specific applications. For our case study, we fine-tune FLAME on a mixture of four pairwise evaluation datasets, which include HelpSteer, PRM 800K, CommitPack, and HHH RLHF Harmlessness, all mixed equally. Since FLAME has already been trained on these datasets, we only need to fine-tune it for 50 steps. The resulting FLAME IRM model shows a notable increase in the overall score on reward benchmarks, improving from 86.0% to 87.8% accuracy. Impressively, FLAME IRM 24B becomes the top-performing generative model trained solely on permissively licensed data, surpassing both GPT-4 0125 and GPT-4.
We then turn our attention to optimizing the FLAME multitask mixture for reward modeling evaluation, which we call FLAME OPTM. While our initial FLAME mixture performs well across various tasks, we find that it requires extensive training to achieve strong results on certain specialized applications, such as reward benchmarks. We believe this is due to suboptimal mixture weights that do not adequately sample beneficial tasks during training.
To tackle this issue, we introduce a new strategy called tail patch ablation, which helps us analyze how each dataset impacts targeted distributions. This allows us to determine the best proportions for each dataset in our multitask mixture, optimizing all mixing weight parameters simultaneously. By fine-tuning the initial instruction-tuned PaLM 2 24B model on this optimized mixture for just 5,000 steps, we achieve competitive performance on reward benchmarks, reaching 87.0% while using about 25 times fewer training data points compared to our baseline FLAME approach. We note that our aim is not to achieve the highest possible results on reward benchmarks, but to show how we can optimize our multitask mixture for specific distributions. We also observed that additional training or fine-tuning, as done for FLAME IRM, can further enhance our reward benchmark performance, although we did not submit these FLAME OPT IRM results to the official leaderboard.
Furthermore, FLAME OPT IRM demonstrates strong performance across other held-out tasks, indicating that we have not overfitted to reward benchmarks and that FLAME OPTM is broadly applicable to various tasks.
To determine the best tasks for our training mixture, we recognize that setting the right mixing weight for each task is challenging due to the large number of tasks involved. Therefore, we assess the impact of each task on targeted distributions and use this information to assign weights. We start with a checkpoint that has been partially trained on our initial mixture, which shows decent but not optimal performance across reward benchmark categories. We then conduct a brief fine-tuning phase, referred to as tail patch, on each individual training task, limited to 3,000 training steps. We believe that training on beneficial tasks will help us achieve better performance. This process is done once for each downstream application and can be performed with smaller models to reduce computational costs.
After completing the tail patch training, we evaluate how helpful each training task is for each category of reward benchmarks using a rating system. We categorize tasks into seven bundles based on their ratings, which range from significantly helpful to harmful. We assign fixed mixing weights for each bundle, with higher weights for generally helpful tasks and lower weights for others. A task can belong to multiple bundles, and its final weight is the sum of the weights from all bundles it is part of. For instance, if a task is generally helpful and also beneficial for two specific categories, its contribution to the final mixture is calculated accordingly. We also prioritize the top two most helpful tasks in three categories that performed poorly, assigning them a higher fixed weight. These weight values were initially based on our intuition and were not extensively fine-tuned.
Finally, we initialize FLAME OPTM with the instruction-tuned PaLM 2 24B and then fine-tune it using our reweighted multitask mixture.
In this section, we outline the training details for our models FLAME and FLAME OPTM, which we initialize using the PaLM 2 24B model that has been instruction-tuned on the FLAME collection. We train FLAME for 30,000 steps and FLAME OPT for 5,000 steps. After this, we fine-tune FLAME for an additional 50 steps to develop FLAME IRM. Our training process utilizes t5x with the Adam optimizer, setting the learning rate at 0.00001 and the dropout rate at 0.05 for training. We employ 256 Cloud TPU chips with a batch size of 32 for FLAME, while FLAME IRM and FLAME OPTM use 128 Cloud TPU chips with a batch size of eight.
Next, we present our main experiments where we compare FLAME against several well-known LLM-as-a-judge autoraters. We use an evaluation suite that consists of 12 autorater benchmarks, including one held-in and 11 held-out benchmarks, which cover a total of 53 quality assessment tasks. Our findings indicate that our FLAME variants, which are trained solely on permissively licensed data, outperform LLMs trained on proprietary data such as GPT-4 and Claude 3 in 8 out of the 12 benchmarks.
To evaluate the general quality assessment capabilities of FLAME, we use a diverse set of tasks, both held-in and held-out. We format each task into a unified text-to-text format and prompt our models accordingly. For benchmarks that have multiple categories, like reward benchmarks and LLM AGRA, we maintain consistent prompt instructions across these categories. To manage model API costs, we randomly sample 256 examples for each evaluation task, except for reward benchmarks, where we use the complete evaluation sets.
For held-in evaluations, we assess FLAME’s performance on various aspects such as helpfulness, correctness, coherence, complexity, and verbosity using the HelpSteer validation set. In terms of held-out evaluations, we focus on Reward Bench, a well-known benchmark for evaluating reward models, which emphasizes their capabilities and safety. This benchmark involves tasks where reward models must choose the better response between two options based on a prompt, and it includes four main categories: chat, chat hard, reasoning (math and coding), and safety, comprising 23 individual datasets.
We also evaluate FLAME using LLM AGRA, which measures the grounding capabilities of autoraters. In this benchmark, given a reference document and a claim, the autorater determines if the claim is fully supported by the document. This comprehensive benchmark includes 10 attribution datasets used in recent studies on LLM factuality. In addition to reward bench and LLM AgraFact, we assess FLAME on a variety of other held-out benchmarks that involve pairwise comparisons and pointwise evaluations. These include summary comparisons, helpful, honest, and harmless alignment, Alpaca Farm, paraphrase evaluation, sequence continuation preference, poem preference, literary translation comparisons, long-form QA evaluation, and text continuation preference. Importantly, none of the tasks in these benchmarks were part of our training data.
For our evaluations, we compare several popular LLM-as-a-judge models as baselines, including LLaMA 3 70B Instruct MIAL with 87 billion parameters, Claude 3, Optus GPT-3.5 Turbo 0125, GPT-4 0125, and OpenAI’s current flagship model GPT-4. We also align our results with several models on the official reward bench leaderboard, particularly Gini 1.5 Pro, Prometheus with 27 billion parameters, NVIDIA’s NEMO Tron 3-340B reward, and LLaMA 3 70B Steer LM.
We evaluate all three of our FLAME variants (FLAME, FLAME IRM, and FLAME OPTM), as detailed in previous sections. Additionally, we include the initial instruction-tuned PaLM 2 24B checkpoint, which has not been trained on our FLAME data, to help us understand the effects of instruction tuning versus FLAME training. Finally, we present our main results across all evaluation benchmarks, with specific results for reward benches
In addition to the incredible tools mentioned above, for those looking to elevate their video creation process even further, Topview.ai stands out as a revolutionary online AI video editor.
TopView.ai provides two powerful tools to help you make ads video in one click.
Materials to Video: you can upload your raw footage or pictures, TopView.ai will edit video based on media you uploaded for you.
Link to Video: you can paste an E-Commerce product link, TopView.ai will generate a video for you.