以Llama模型为例学习如何进行LLM模型微调

2025-04-17T13:23:24+08:00 | 9分钟阅读 | 更新于 2025-04-17T13:23:24+08:00

Macro Zhao

以Llama模型为例学习如何进行LLM模型微调

推荐超级课程：

本文档将深入探讨如何使用Unsloth库对Llama 3.1模型进行微调，重点关注参数高效微调（PEFT）方法中的低秩适配（LoRA）技术。 Unslolh 提供了4位精度的量化模型，使其具有一定高的内存效率。我们将使用 ‘unsloth/Meta-Llama-3.1–8B-bnb-4bit’ 模型和 ‘mlabonne/FineTome-100k’ 数据集。通过本文档，您将了解如何使用有限的资源有效地微调模型。

背景

在进行实施之前，了解与大型语言模型和微调过程相关的关键概念至关重要。

预训练

预训练是大型语言模型（LLM）在大量数据集上进行训练以获取通用知识的初始阶段。这个过程计算成本很高，需要大量的 GPU 资源和时间。预训练完成后，我们获得了一个 基础模型，它作为后续微调的基础，以更新模型在特定领域的知识。

微调

微调涉及使用新数据集训练预训练模型，使其在特定领域或任务上专业化。在预训练过程中，模型从各种数据集中学习，这可能不会提供特定主题的深入知识。因此，基础模型可能在某些用户提示下表现不佳。使用专门的数据集进行微调可以使模型生成更准确和相关的响应，将基础模型转变为“指令模型”。此外，为了使模型能够处理对话任务，可以使用包含 AI 和用户之间对话的数据集对其进行进一步微调，从而生成“基于聊天的模型”。

全部微调

微调的一种方法是使用新数据集重新训练整个预训练模型，更新所有参数。然而，这种方法成本高昂，耗时且风险大，因为它可能导致模型丢失之前学到的知识。因此，需要一种在保留预训练模型参数的同时获取新知识的方法。这就是 PEFT（参数高效微调）的用武之地。

参数高效微调

PEFT 是一种微调方法，旨在在计算资源有限的场景中使用，尤其是在处理大型预训练模型时。在 PEFT 中，只有一小部分参数被调整，而预训练模型的大多数参数保持冻结。这种方法显著降低了计算要求，同时在目标任务上保持了竞争性性能。

低秩适配 (LoRA)

LoRA 是一种专门的 PEFT 技术，它在预训练模型的现有层中引入了两个新的、更小的矩阵。LoRA 使用两个较小的矩阵，而不是使用全秩矩阵来增加可训练参数的数量，当这两个矩阵相乘时，可以近似模型所需的调整。在微调过程中，这两个矩阵被更新，而预训练模型的参数保持冻结。这种方法特别适合微调大型语言模型。

量化 LoRA (QLoRA)

QLoRA 通过结合量化来扩展 LoRA 的效率，量化可以减少模型参数的大小。通常，模型参数以 32 位浮点格式存储。量化将此值减少到 8 位或 4 位值，从而减少了模型的大小和计算要求。在本实施中，我们将使用 Unsloth 团队提供的 4 位量化模型。

Unsloth 库

Unsloth 是一个开源平台，专门设计用于高效微调大型语言模型 (LLM)。它专注于使微调过程更快、内存占用更低，这在资源受限的环境中尤其有价值。目前，Unsloth 优化了单 GPU 设置。我们将在 Google Colab 上使用它，Google Colab 提供了对 NVIDIA T4 等 GPU 的访问。然而，用户应该了解 Colab 的局限性，例如时间限制和资源可用性。

安装

可以使用以下命令直接从 GitHub 仓库安装库。

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Unsloth 需要 Xformers 库进行内存高效的训练，该库必须与适当的 PyTorch 版本对应：

from torch import __version__  
from packaging.version import Version as V  
if V(__version__) < V("2.4.0"):  
  xformers = "xformers==0.0.27"    
else:  
  xformers =  "xformers"

Unsloth 还依赖于其他库，可以使用以下命令安装：

!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

trl: 该库将强化学习（RL）技术与基于 Transformer 的模型相结合。您可能会想知道为什么我们在主要的微调方法为监督学习时将强化学习纳入其中。原因是标准监督学习可能不足以使模型完全与特定目标对齐，例如提高输出质量，例如一致性和创造力。强化学习使模型能够从奖励中学习，优化这些质量，超越仅使用标记数据所能实现的水平。
accelerate: 该库简化了跨不同硬件配置（例如单 GPU、多 GPU 或 TPU 设置）训练和部署模型的过程。它抽象了硬件管理的复杂性，使我们能够专注于模型构建和训练代码，而不是分布式计算的细节。
bitsandbytes: 该库旨在使大型模型能够以低比特精度进行训练，特别是 8 位和 4 位精度。这减少了内存使用并加快了训练速度。
triton: 这是一个深度学习编译器库，它允许用 Python 编写高度优化的 GPU 内核。

实施

在本节中，我们将逐步介绍使用 LoRA 适配器微调 Llama 3.1 模型的实施过程，重点关注设置、数据准备、模型训练和测试。

设置依赖项

要继续进行微调过程，您需要以下依赖项：

import torch  
from trl import SFTTrainer  
from datasets import load_dataset  
from transformers import TrainingArguments, TextStreamer  
from unsloth.chat_templates import get_chat_template  
from unsloth import FastLanguageModel, is_bfloat16_supported  
from transformers import TextStreamer

加载模型和分词器

Unsloth 提供了各种基础模型和指令调整模型，包括 4 位量化和标准格式。为了进行内存高效的微调，我们将使用 Llama 3.1 模型的 4 位版本：

model, tokenizer = FastLanguageModel.from_pretrained(  
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",  
    max_seq_length = 1024,  
    dtype = None,  
    load_in_4bit = True  
)

FastLangunageModel.from_pretrained() 加载预训练模型及其相应的分词器。
unsloth/Meta-Llama-3.1–8B-bnb-4bit 指定模型是 Unsloth 提供的 Llama 3.1 的 4 位量化版本。8B 表示模型有 80 亿个参数。
这里最大序列长度设置为 1024。因此，最多可以将 1024 个标记输入到模型中。
dtype = None 指定模型张量的数据类型。例如，torch.float32。将其设置为 None 允许模型使用其默认数据类型。
load_in_4bit 设置为 True。这表明模型应以 4 位精度加载。这是加载大型语言模型的内存高效方式。

数据准备

我们将从 Hugging Face 加载微调数据集并准备用于训练。

get_chat_template() 函数用于修改分词器，使其能够处理基于聊天的输入。这使分词器能够以反映人类和 AI 之间对话的方式对输入数据进行结构化。

tokenizer = get_chat_template(  
    tokenizer,  
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},  
    chat_template="chatml",  
)  
  
def apply_template(examples):  
    messages = examples["conversations"]  
    text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]  
    return {"text": text}  
  
dataset = load_dataset("mlabonne/FineTome-100k", split="train")  
dataset = dataset.map(apply_template, batched=True)

映射告诉分词器如何解释对话的不同部分。示例数据，

[  
   {'from': 'human', 'value': 'What's your name?'},  
   {'from': 'gpt', 'value': 'I'm Daniel!'},  
   {'from': 'human', 'value': 'Ok! Nice!'},  
   {'from': 'gpt', 'value': 'What can I do for you?'},  
   {'from': 'human', 'value': 'Oh nothing :)'},  
]

我们将预处理整个数据集，将每个对话转换为训练所需的格式。

微调前的模型性能

让我们评估预训练模型的性能。我们将使用一个模板来生成输入数据，其中指令和输入被放置在提示中的特定插槽{}中。

prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  
  
### Instruction:  
{}  
  
### Input:  
{}  
  
### Response:  
{}"""  
  
FastLanguageModel.for_inference(model)   
inputs = tokenizer(  
[  
    prompt.format(  
        "answer for this question", # instruction  
        "Is 9.11 larger than 9.9?", # input  
        "", # output—leave this blank for generation!  
    )  
], return_tensors = "pt").to("cuda")  
  
  
text_streamer = TextStreamer(tokenizer)  
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

这是预训练模型的响应。

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  
  
### Instruction:  
answer for this question  
  
### Input:  
Is 9.11 larger than 9.9?  
  
### Response:  
yes  
<|end_of_text|>

您会注意到，该模型为这个问题生成了一个错误的答案。此外，我们再检查另一个问题。

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  
  
### Instruction:  
{}  
  
### Input:  
{}  
  
### Response:  
{}"""  
  
# alpaca_prompt = Copied from above  
FastLanguageModel.for_inference(model) # Enable native 2x faster inference  
inputs = tokenizer(  
[  
    alpaca_prompt.format(  
        "answer for this question", # instruction  
        "How is it possible for a black hole to emit X-rays if its gravitational pull is strong enough to prevent light, including X-rays, from escaping?.", # input  
        "", # output—leave this blank for generation!  
    )  
], return_tensors = "pt").to("cuda")  
  
from transformers import TextStreamer  
text_streamer = TextStreamer(tokenizer)  
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

这是问题的响应。

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  
  
### Instruction:  
answer for this question  
  
### Input:  
How is it possible for a black hole to emit X-rays if its gravitational pull is strong enough to prevent light, including X-rays, from escaping?.  
  
### Response:  
The X-rays are not coming from the black hole itself, but from the surrounding gas that is being heated by the black hole. The X-rays are produced when the gas is heated to high temperatures by the intense gravitational pull of the black hole.<|end_of_text|>

总的来说，预训练模型在处理这种关键问题时没有表现出色。为了解决这个问题，我们将使用新数据进行微调。让我们开始微调过程！

带有 LoRA 适配器的模型

Unsloth 允许我们将 LoRA 适配器应用于模型，以便进行高效的微调：

model = FastLanguageModel.get_peft_model(  
    model,  
    r = 8; # 选择任何大于 0 的数字！建议 8、16、32、64、128  
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"].  
["gate_proj", "up_proj", "down_proj",]  
    lora_alpha = 16.  
    lora_dropout = 0, # 支持任何，但 = 0 是优化的  
    bias = "none",    # 支持任何，但 = "none" 是优化的  
    # [NEW] "unsloth" 使用 30% 更少的 VRAM 并适应 2 倍更大的批处理大小！  
    use_gradient_checkpointing = "unsloth", # True 或 "unsloth" 用于非常长的上下文  
    random_state = 3407,  
    use_rslora = False,  # 我们支持秩稳定 LoRA  
    loftq_config = None, # 以及 LoftQ  
)

get_peft_model() 函数修改原始预训练模型以包含 PEFT 技术。这意味着这将向模型层中添加 LoRA 适配器。例如，将向多头机制和前馈网络等添加这两个小的低秩矩阵。
每个矩阵的 rank 设置为 8。较高的 rank 值提高了模型适应能力，但代价是内存使用量和计算开销更高。
traget_modules 指定 LoRA 将应用于模型的层。这些层通常包括投影和其他关键组件。
lora_alpha 控制低秩更新对预训练模型权重的影响。这是添加 LoRA 适配器后模型的描述。

Unsloth 2024.8 修补了 32 层，其中 32 层为 QKV 层，32 层为 O 层，32 层为 MLP 层。

模型训练

我们现在准备开始微调过程，在 60 步中训练模型一个周期：

trainer = SFTTrainer(  
    model = model,  
    tokenizer = tokenizer,  
    train_dataset = dataset,  
    dataset_text_field = "text",  
    max_seq_length = 1024,  
    dataset_num_proc = 2,  
    packing = False, # 可使训练对短序列快 5 倍。  
    args = TrainingArguments(  
        per_device_train_batch_size = 2,  
        gradient_accumulation_steps = 4,  
        warmup_steps = 5  
        # num_train_epochs = 1, # 设置此值以进行 1 次完整的训练运行。  
        Max_steps = 60,  
        Learning_rate = 2e-4,  
        fp16 = not is_bfloat16_supported(),  
        bf16 = is_bfloat16_supported(),  
        logging_steps = 1.  
        optim = "adamw_8bit",  
        Weight_decay = 0.01,  
        lr_scheduler_type = "linear",  
        seed = 3407,  
        output_dir = "outputs";  
    ),  
)  
  
trainer_stats = trainer.train()

在这里，gradient_accumulation_steps 设置为 4，因为该模型的权重更新不会在每个批次后发生。因此，这将在 4 个批次中累积梯度后才更新。
在这里，warmup_steps 在训练开始时逐渐增加学习率 5 步。达到最大值后，它将减少到定义的值。
为了防止 过拟合，我们在这里对模型权重应用了一些 权重衰减。

您可以看到训练结果。

==((====))==  Unsloth: 2x faster free finetuning | Num GPUs = 1  
      /|    Num examples = 100,000 | Num Epochs = 1  
O^O/ _/     Batch size per device = 2 | Gradient Accumulation steps = 4  
        /    Total batch size = 8 | Total steps = 60  
 "-____-"     Number of trainable parameters = 20,971,520  
 [60/60 12:51, Epoch 0/1]  
Step Training Loss  
1 2.120500  
2 1.891200  
3 2.070200  
4 2.303700  
5 2.336400  
6 2.327600  
7 2.311000  
...
...
56 2.215800
57 2.476000
58 2.139000
59 1.921000
60 2.199500

微调后模型性能

就像之前一样，让我们测试一下微调后模型的性能。我使用了同一个问题进行测试，由于模型是用这种类型的问题进行微调的，因此它生成了正确的响应。

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  
  
### Instruction:  
{}  
  
### Input:  
{}  
  
### Response:  
{}"""  
  
# alpaca_prompt = Copied from above  
FastLanguageModel.for_inference(model) # Enable native 2x faster inference  
inputs = tokenizer(  
[  
    alpaca_prompt.format(  
        "answer for this question", # instruction  
        "Is 9.11 larger than 9.9?", # input
        "", # output—leave this blank for generation!
    )  
], return_tensors = "pt").to("cuda")  
  
from transformers import TextStreamer  
text_streamer = TextStreamer(tokenizer)  
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

在这里，微调后的模型正确解释了 9.9 大于 9.11。

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.  
  
### Instruction:  
answer for this question  
  
### Input:  
Is 9.11 larger than 9.9?  
  
### Response:  
No, 9.9 is larger than 9.11  
<|im_end|>

本地保存 LoRA 适配器

微调后，您可以使用以下步骤将带有适配器的模型和分词器本地保存：

model.save_pretrained("llama-lora_model") # Local saving  
tokenizer.save_pretrained("llama-lora_model")

在 Hugging Face 中保存 LoRA 适配器

如果您想将其保存到您的 Hugging Face 帐户，可以按照以下步骤操作。确保使用您 Hugging Face 帐户中的令牌：

token = <your hugginface token>  
tokenizer.push_to_hub("priyanthan/FineTune-Llama-3.1-8B", token = token) # Online saving  
model.push_to_hub("priyanthan/FineTune-Llama-3.1-8B", token = token)

结论

在本博客中，我们探讨了预训练基础模型的概念以及如何通过微调过程将其转换为指令模型或基于聊天的模型。我们讨论了在资源有限的情况下微调大型语言模型的挑战，并展示了如何使用 Unsloth 库简化此过程。具体来说，我们演示了如何在 Google Colab 中使用 Unsloth 对 Llama 3.1 模型进行微调。我们还评估了模型在微调前后的性能，注意到微调后的模型在回答关键问题时表现出色。最后，我们将微调后的模型及其分词器保存到本地和 Hugging Face Hub。

PS: 如果你无法使用Google Colab，最后确定你的本地电脑有可用的GPU或者TPU，毕竟，训练模型需要花费很长时间。。。

上一页【AI】实现中文文章摘要的AI模型

下一页对DeepSeekR1模型进行微调