How to Train Your Vicuna - Finetuning, Evaluating, and Serving LLMs in the Wild - Hao Zhang

alpa.ai
lmsys.org
vllm.ai

Background: LLMs and Open Source LLMs

Topics

  • How we trained Vicuna
  • One core problem after Vicuna: Evaluation
  • Evaluation
    • LLM-as-judge
    • Chatbot arena

Background

Meta Llama weights
Used ShareGPT to collect conversations from API and train Vicuna

Why is ShareGPT small yet effective?

ShareGPT data contains the user preferences implicit in the ChatGPT conversations that were trained using RLHF (Reinforcement Learning with Human Feedback)

Llama + ShareGPT = Vicuna

Strongly preferred by users than base models
Does not improve, sometimes decreases academic NLP metrics

”For better or worse, benchmarks shape a field” - David Patterson, Turing Award 2017

LLMs are extremely hard to evaluate

Unreliable: Benchmarks cannot tell human preference

  • Requires expertise to evaluate
  • Data contamination
    Cost: Very expensive to evaluate

How to evaluate human preferences?

Use humans to rate chatbots

Human Evaluation

Chatbot answers interpreted by humans, so it makes sense for humans to be ultimate arbiters
Ideally, for every question we want to rank all LLMs
Ranking N choices is hard:

  • Easier to pick best of N
  • Even easier to pick best of two!

How to scale human evaluation

Tournament
ELO rating

Benchmark platform for LLMs that features anonymous, randomized battles in a crowdsourced manner


How to evaluate it?

  • Humans take long and are expensive
  • How about we use the strongest LLM to evaluate?

Evaluation: MT-Bench

Our initial method: LLM Exams and LLM-as-a-Judge

  • Let two chatbots generate responses to the same set of questions
  • Use GPT-4 to assess the model responses
  • Simultaneously, let human raters assess the model responses
  • Study the alignment between GPT-4 and human raters

Can we really trust LLM as judge?

Limitations: not unlike humans

Position bias: prefer first position
Verbosity bias: prefer long answers
Self-enhancement bias: prefer answers from itself
Limited reasoning: not good at grading math questions

High agreement despite limitations

Agreement between GPT-4 and humans over 80%: same/better as human-human agreement

Summary

  • LLM evaluation is extremely hard
  • Cracking this problem requires new techniques
    • LLMs as judges
    • Scalable human evaluation
  • Many challlenges remain
    • Contamination: generating unique exams difficult
    • Diversity: most questions are easy; need hard questions to differentiate between LLMs