How to Train Your Vicuna - Finetuning, Evaluating, and Serving LLMs in the Wild - Hao Zhang

alpa.ai
lmsys.org
vllm.ai

Background: LLMs and Open Source LLMs

Topics

How we trained Vicuna

One core problem after Vicuna: Evaluation

Evaluation

LLM-as-judge

Chatbot arena

Background

Meta Llama weights
Used ShareGPT to collect conversations from API and train Vicuna

Why is ShareGPT small yet effective?

ShareGPT data contains the user preferences implicit in the ChatGPT conversations that were trained using RLHF (Reinforcement Learning with Human Feedback)

Llama + ShareGPT = Vicuna

Strongly preferred by users than base models
Does not improve, sometimes decreases academic NLP metrics

”For better or worse, benchmarks shape a field” - David Patterson, Turing Award 2017

LLMs are extremely hard to evaluate

Unreliable: Benchmarks cannot tell human preference

Requires expertise to evaluate

Data contamination
Cost: Very expensive to evaluate

How to evaluate human preferences?

Use humans to rate chatbots

Human Evaluation

Chatbot answers interpreted by humans, so it makes sense for humans to be ultimate arbiters
Ideally, for every question we want to rank all LLMs
Ranking N choices is hard:

Easier to pick best of N

Even easier to pick best of two!

How to scale human evaluation

Tournament
ELO rating

Chatbot Arena

Benchmark platform for LLMs that features anonymous, randomized battles in a crowdsourced manner

How to evaluate it?

Humans take long and are expensive

How about we use the strongest LLM to evaluate?

Evaluation: MT-Bench

Our initial method: LLM Exams and LLM-as-a-Judge

Let two chatbots generate responses to the same set of questions

Use GPT-4 to assess the model responses

Simultaneously, let human raters assess the model responses

Study the alignment between GPT-4 and human raters

Can we really trust LLM as judge?

Limitations: not unlike humans

Position bias: prefer first position
Verbosity bias: prefer long answers
Self-enhancement bias: prefer answers from itself
Limited reasoning: not good at grading math questions

High agreement despite limitations

Agreement between GPT-4 and humans over 80%: same/better as human-human agreement

Summary

LLM evaluation is extremely hard

Cracking this problem requires new techniques

LLMs as judges

Scalable human evaluation

Many challlenges remain

Contamination: generating unique exams difficult

Diversity: most questions are easy; need hard questions to differentiate between LLMs

Carter's Digital Garden

Explorer

COGS 9 Lecture 26

How to Train Your Vicuna - Finetuning, Evaluating, and Serving LLMs in the Wild - Hao Zhang

Graph View

Backlinks