Machine Learning

What’s instructive about Instruct Fine-Tuning: a weightwatcher analysis – calculated

What’s instructive about Instruct Fine-Tuning: a weightwatcher analysis – calculated
  • PublishedFebruary 4, 2025


Are you Fine-Tuning an open-source LLMs ? Like Llama, Mistral, or Qwen? A That is, Instruct Fine Tuning. Whether you are using SFT, DPO, or PPO, this post is for you. Why?

Weightwatcher can help you determine if the Fine-Tuning went well, or if something weird happened that you need to look into. And you don’t expensive evals to do this.

WeightWatcher provides Data-Free Diagnostics for Deep Learning models

And its free. WeightWatcher is open-souce

pip install weightwatcher

Analyzing Fine-Tuned Models

In an earlier post, we learned what to do when Evaluating Fine-Tuned LLMs with WeightWatcher. I encourage you to review this post to get started.

Or, if you like, I can run the analysis for you using the up-and-coming weightwatcher-pro SAAS product and provide you a detailed write-up. Here’s a screenshot of some of what you will get.

Lets’ do a deep dive into a few common LLM and see what we can learn from weightwatcher:

Fine-Tuned Instruct Updates vs the Base

We examine weightwatcher results on several Instruct Fine-Tuned (FT) models including cases from the popular open-source base models: Mistral, Llama3.1, and Qwen2.5.

Below we plot the weightwatcher layer quality metric alpha ( \alpha  ) for every layer as a histogram. As explained in our groundbreaking HTSR theory paper, the model is predicted to perform best when the layer alphas lie in the safe-zone, alpha between [2,6] $(latex \alpha \in [2,6] &bg=ffffff )$,

Notice the base Mistral model has many underfit layers than FT case, but both models stlll have many underfit layers. This is unusual for a large, Instruction Fine-Tuned model
For comparison, here are the Instruction FT parts of some other popular models of the same size, including the Mistral-7B-Instruct itself, Llama-3.1-8B-Instruct, and Qwen2.5-7B Instruct,

Here, while the base models have many underfit layers, the Instruction Fine Tuned updates have almost all layer alphas in the safe/white zone. Exactly as predicted by the HTSR theory.

Correlation Flow:

We now look at how the layer alphas vary from layer-to-layer as data passes through the model, from left to right for each architecture. This is called a Correlation Flow plot, and is described in the weightwatcher 2021 Nature Communications paper. As shown, most architectures (VGG, ResNet, etc) show a distinct pattern that represents how correlations (i.e information) flow in the model. from the data to the labels. Examining the Correlation Flow is critical to understand if a model architecture is likely to converge well for every layer.

Here we show Correlation Flow plots for Mistral-7B-v0.2, and for Llama3.1-8B and 70B Instruct components. Notice that all 3 plots look similar in that there are a few undertrained layers nearer to the left (closer to the data) , but most cluster towards the right (farthest from the data). This is typical; correlations in the data flow through the layers from the data to the labels, but sometimes the information just does not make it all the way through.

Notice that the underfit layers $( \alpha\gt 6 &bg=ffffff )$ lie mostly towards the right, and closer to the labels. This is common.

alpha vs. alpha:

Let us now compare the base model alphas to those of the fine-tuned updates. Here we examine the alpha-vs-alpha for the LLama3.1-70B-Instruct model. The x-axis is the base model LLama-70B, and the y-axis is the instructed fine-tuned part.

In models like the Instruct-Fine-Tuned models shown above (Mistral, Llama, Qwen, etc), even when the base model layers are weakly trained, they can still be fine-tuned. And when comparing alpha-vs-alpha, we find:

Llama3.1-70B-Instruct

Lessons Learned:

  • The smaller the base-model alpha, the smaller the fine-tuned alpha
  • If the base model layer alpha ~ 2, this could lead to an FT alpha < 2, suggesting the layer is overfitting
  • Even when the base alpha ~15, the FT alphas are between 2 and 6.
  • These patterns are common among all large FT instruct models–so far.

What about counterexamples?

There are some FT models where these results do not appear to hold. The smaller Llama3.2 1B and 3B models. (but not the 90B variant!) The recently developed Polish language models Bielik. In a future post, we will look into these counterexamples and try to understand why this happens and what you can do to diagnose and remediate such issues.

Conclusions

Fine-Tuning LLMs hard. WeightWatcher can help. WeightWatcher can tell you if your Instruct Fine-Tuned model is working as expected, or if something odd happened during training. Something that can not be detected with expensive evaluations or other known methods

WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).

WeightWatcher is an open-source, data-free diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, using the new Theory of Heavy-Tailed Self-Regularization (HT-SR), published in JMLR, Nature Communications, and NeurIPS2023.

I invented WeightWatcher to help my clients who are training and/or fine-tuning their own AI models. Need help with AI ? Reach out today. #talkToChuck #theAIguy



Source link

Written By
cyberlytech

Leave a Reply

Your email address will not be published. Required fields are marked *