Chengshuo Dai

Full fine-tuning of Large Language Models requires updating billions of parameters, demanding immense computational resources and memory. Parameter-Efficient Fine-Tuning (PEFT) techniques address this bottleneck by freezing the pre-trained model weights and only training a small number of task-specific parameters. Among these techniques, Low-Rank Adaptation (LoRA) has emerged as the industry standard.

LoRA operates on the hypothesis that the change in weights during fine-tuning has a low "intrinsic rank." Instead of updating a large weight matrix W directly, LoRA injects two smaller trainable matrices, A and B, into the transformer architecture. If the original weight matrix has dimensions d x k, matrix A will have dimensions d x r and matrix B will have dimensions r x k, where r (the rank) is a very small number (e.g., 8 or 16). The forward pass is then computed as h = Wx + BAx. During training, only A and B receive gradient updates. This reduces the number of trainable parameters by up to 10,000 times and decreases GPU memory requirements by a factor of 3, as optimizer states are only stored for the small matrices.

QLoRA (Quantized LoRA) extends this efficiency even further. It introduces 4-bit NormalFloat (NF4) quantization, a data type theoretically optimal for normally distributed weights. In QLoRA, the base model is loaded in 4-bit precision, drastically reducing the memory footprint. To prevent performance degradation, QLoRA employs Double Quantization (quantizing the quantization constants) and Paged Optimizers (using unified memory to manage CPU/GPU memory spikes). The LoRA adapters remain in 16-bit precision (bfloat16) and are trained on top of the frozen 4-bit base model.

This combination allows a 65B parameter model to be fine-tuned on a single 48GB GPU, democratizing access to LLM customization. Once training is complete, the LoRA weights can be merged back into the base model weights for inference, resulting in zero additional latency compared to the original model.

References:

Hugging Face Blog: PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models - https://huggingface.co/blog/peft
Sebastian Raschka: Understanding Parameter-Efficient Fine-Tuning - https://magazine.sebastianraschka.com/p/understanding-parameter-efficient