Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Science & Technology
Introduction
In this section, we discuss fully sparsely activated large language models (LLMs). We have observed that LLMs perform exceptionally well across various natural language processing tasks. However, we face challenges when deploying these models in real-world scenarios due to their significant computational demands and memory usage, particularly during the inference phase. To tackle this issue, recent studies have explored different methods to enhance the efficiency of LLMs, such as quantization, pruning, distillation, and improved decoding techniques.
One effective strategy we can employ is to leverage sparsity to decrease the number of active parameters in LLMs. Sparsity enhances efficiency in two main ways. First, it minimizes the computational load during matrix multiplication since we do not need to compute the zero elements. Second, it reduces the input-output (I/O) operations required to transfer parameters between memory and computation units, which is a major bottleneck during inference.
A common method for achieving sparsity in LLMs is through weight sparsity, which involves pruning model weights to save on computation. However, we find that unstructured weight sparsity is challenging to parallelize on GPU devices, while structured weight sparsity can significantly affect model accuracy. Alternatively, we can use activation sparsity, which decreases the number of active elements in the activation tensors. We can achieve activation sparsity through mechanisms like the mixture of experts (MoE), modifying the activation function, or predicting which positions to sparsify. Unfortunately, these methods do not fully enable activation sparsity in LLMs, which can limit the efficiency improvements during inference.
Additionally, we note that the scaling laws for sparsely activated LLMs compared to dense models have not been thoroughly investigated. To fully explore the potential of sparsity in LLMs, we introduce Q-Sparse, a straightforward yet effective method to achieve complete activation sparsity.
The primary change we make to LLMs involves the linear projection, specifically the matrix multiplication. For each linear projection, we implement a top-K sparsification function that selects the top K activations from the input tensor. During backpropagation, we utilize a straight-through estimator to calculate the gradients of the activations. We also introduce a squared ReLU function for the feed-forward layers to enhance activation sparsity further. Q-Sparse is compatible with both full-precision and quantized LLMs.
To investigate the scaling laws of sparsely activated LLMs, we conduct a series of experiments and derive an optimal scaling law for inference. Our findings indicate that sparsely activated models outperform dense baselines when given the same inference compute budget, which refers to the number of activated parameters or floating point operations. As the number of parameters increases, the performance gap between sparsely activated models and dense baselines narrows. Specifically, we find that sparsely activated models with around 40% sparsity can match the performance of dense models of the same size and training data with a given inference budget. A sparsely activated full-precision model with a sparsity ratio of approximately 45.58% can achieve optimal performance. For models with 1.58-bit precision, the best sparsity ratio is around 61.2%.
We also evaluate the effectiveness of Q-Sparse in various scenarios, including training from scratch, continuing training on existing LLMs, and fine-tuning. Our results demonstrate that Q-Sparse can deliver performance comparable to baseline LLMs while being significantly more efficient during inference.
Regarding the architecture of Q-Sparse, it is based on the Transformer architecture with modifications to facilitate activation sparsity. The Transformer uses a linear operation to perform projections in both attention and feed-forward layers. We introduce a top-K sparsity function that operates on this matrix multiplication, identifying the top K activations in the input tensor based on their absolute values. To further refine the results, we rescale the tensor by its L2 norm after applying the top-K sparsity function. We also present a quantized version of the top-K sparsity function, which reduces memory usage and computational costs without sacrificing performance. This quantized function converts the input tensor into an 8-bit representation, ensuring that we avoid division by zero by incorporating a small constant.
Q-Sparse can be applied to both full-precision and quantized LLMs. Specifically, when using Q-Sparse with one-bit LLMs, we perform quantization on the weight tensor to achieve a 1.58-bit representation. To enhance activation sparsity even further, we utilize a squared ReLU function in the feed-forward layers. This function is defined as the square of the ReLU output. Following the LLaMA architecture, we incorporate a gated linear unit (GLU) into the feed-forward layers, resulting in a combined ReLU-squared GLU function.
Summary:
In this section, we introduce Q-Sparse, a method designed to enhance the efficiency of large language models (LLMs) by enabling full sparsity of activations, which reduces computational costs and memory usage during inference. Our approach incorporates a top-K sparsification function in linear projections and employs a squared ReLU activation to further optimize performance, demonstrating that sparsely activated models can outperform dense baselines while maintaining comparable training costs.
Training
In this section, we discuss the training of sparsely activated models. Most of the current approaches utilize the standard backpropagation algorithm to calculate the gradient through the sparsity function. Here, we have a mask tensor that highlights the top K activations in the input tensor and we perform element-wise multiplication. However, the standard backpropagation method has a drawback: it sets the gradients of the non-activated elements to zero, which can cause the vanishing gradient problem, particularly when the sparsity ratio is high.
To address this issue, we propose using the straight-through estimator (STE) to backpropagate the gradients through the sparsity function. This method allows the gradients to pass through the sparsity function without being zeroed out.
We visualize the average L2 norm of the gradients for each projection across different layers for dense models and Q-Sparse models, both with and without the straight-through estimator. We set the top K to 50% for Q-Sparse. Our observations indicate that without the straight-through estimator, the gradients are significantly smaller in the lower layers, while the straight-through estimator helps maintain the magnitude of the gradients. This is illustrated in our figures, showing that the straight-through estimator effectively mitigates the vanishing gradient issue, especially in the lower layers. We provide additional visualizations for each component in the appendix.
Next, we explore Q-Sparse in various training scenarios, including training from scratch, continuing training, and fine-tuning. In the continuing training and fine-tuning scenarios, we follow the same architecture and training procedures as in the training-from-scratch approach. The only difference is that we start with pre-trained weights and continue training with the sparsity function activated. For pre-trained models that do not include the squared ReLU function in the feed-forward layers, we apply the top-K sparsity function after the activation function, like GeLU, in those layers. This adjustment enhances the sparsity of the activations without altering the model architecture.
We also examine scaling laws. Recent studies on large language models have demonstrated that their performance improves with both model size and the amount of training data. It has been suggested that the performance of a dense transformer model with a certain number of parameters follows a power law scaling law. In our work, we investigate the scaling law for sparsely activated large language models. We find that their performance also adheres to a power law scaling law, which we express in terms of the model's parameters and sparsity ratio.
To derive the scaling law for sparsely activated large language models, we conduct a series of scaling experiments. We train various language models using Q-Sparse, ranging from 300 million to 7 billion parameters, on the Red Pajama dataset. We preprocess the data using the sentence piece tokenizer from LLaMA. Alongside Q-Sparse, we also train dense baseline models under the same conditions. The observed losses for both the sparsely activated models and the dense baselines are presented in our figures. From our findings, we summarize that the performance of the sparsely activated model scales with both the model size and the sparsity ratio. For a fixed sparsity ratio, the performance of these models follows a power law scaling law in relation to the model size. Conversely, for a fixed number of parameters, the performance follows an exponential scaling law concerning the sparsity ratio. As the number of parameters increases, the performance gap between the sparsely activated models and the dense baselines narrows.
Summary:
In this section, we discuss our approach to training sparsely activated models using the straight-through estimator (STE) to mitigate the vanishing gradient problem caused by traditional backpropagation methods. We also explore the scaling laws of these models, demonstrating that their performance scales with both model size and sparsity ratio, following a combination of power law and exponential law relationships.
Exponential Law and the Sparsity Ratio
In this section, we explore the exponential law related to the sparsity ratio in our models. Our findings indicate that the performance of models that use sparse activation adheres to an exponential scaling law based on the sparsity ratio, which we denote as S. Consequently, we can infer that the scaling factor, which we refer to as a(S), should also follow an exponential pattern. Additionally, for any given model size, represented as N, we observe that the scaling function increases as the sparsity ratio S increases. This leads us to conclude that a(S) is a non-decreasing function.
We express the scaling factor a(S) in terms of other parameters: B represents the scaling factor for models that are extremely sparse, C is for dense models, and β is the scaling exponent of a(S) concerning the sparsity ratio S. To fit the parameters of this scaling law, we analyze the observed losses from our sparsely activated models. We employ the LBFGS algorithm to minimize the Huber loss between our predicted log loss and the observed log loss. We set Δ to a small value of 1,000 and choose the best fit from a range of initial conditions around potential local optima. Our estimates for the parameters α, B, C, and β are approximately:
- α: 1.86
- B: 0.01
- C: 1.89
- β: 0.10
Next, we examine the diminishing performance gap between our sparsely activated models and the dense baselines, both having the same model size N and sparsity ratio S. As we increase the model size N, we find that the performance gap narrows. We can express this performance gap mathematically, and since α is a positive constant, it confirms that as N grows, the performance of our sparsely activated models can eventually equal that of the dense baselines.
We also transform the scaling law to focus on the activated parameters, denoted as N_act, which reflects the effective computation during inference. Here, N_act equals N multiplied by (1 - S). Given that a(S) is an increasing function and (1 - S) raised to the power of α is a decreasing function, we identify a specific sparsity ratio, denoted as S*, that minimizes the loss for our sparsely activated models. We find that S* is approximately 45.58%. This indicates that a sparsely activated model with a sparsity ratio of 45.58% or 1.84 times N_act parameters can achieve optimal performance within the same inference budget.
We apply the same methodology to estimate the inference-optimal scaling law for models using 1.58-bit Q-Sparse, discovering that the optimal sparsity ratio here is 61.2% or 2.58 times N_act parameters. Our results illustrate the inference-optimal scaling curves for both full-precision and 1.58-bit weight models, demonstrating that our sparsely activated models can significantly reduce the number of activated parameters or floating point operations during inference while maintaining comparable performance. This scaling law provides a framework for optimizing the performance of our sparsely activated models by adjusting the sparsity ratio S, which can guide our training processes and enhance model performance during inference.
In our experiments, we assess the effectiveness of Q-Sparse across various scenarios, including training from scratch, continuing training on existing large language models, and fine-tuning. We set up our training by developing a series of language models using Q-Sparse in both full-precision and 1.58-bit formats, training them with 50 billion tokens on the Red Pajama dataset. We compare the performance of Q-Sparse against dense baselines under the same data sets and conditions. The results illustrated in our figures reveal that Q-Sparse with a 40% sparsity ratio can achieve performance levels comparable to those of the dense baselines, given that both have the same model size and training tokens.
Summary:
In this section, we demonstrate that the performance of sparsely activated models adheres to an exponential scaling law relative to the sparsity ratio, indicating that as model size increases, the performance gap between these models and dense baselines diminishes. We also establish an inference-optimal scaling law, revealing that a sparsity ratio of approximately 45.58% maximizes performance efficiency during inference, and we validate our findings through experiments comparing Q-Sparse models with dense baselines.
BitNet B 1.58 + Q-Sparse
In this section, we evaluate the effectiveness of Q-Sparse on one-bit large language models (LLMs) by training a series of BitNet B 1.58 models with different scales of Q-Sparse. We plot the training loss curves for both Q-Sparse and the BitNet B 1.58 baseline. Our findings indicate that the models using sparsely activated BitNet B 1.58 perform better than the dense baselines when given the same computational resources during inference. This shows that Q-Sparse works well with one-bit LLMs, and together, they can enhance model performance during inference.
Next, we conduct an ablation study to assess the impact of the top-K sparsity function. We compare the performance of models using the top-K sparsity function against those using the ReLU sparsity function. Additionally, we investigate the effect of the straight-through estimator (STE) by comparing models with and without it. The results illustrate that removing the STE or substituting it with the ReLU function significantly reduces performance. We also observe that the sparsity ratio in models using the ReLU function decreases over the training process, while the sparsity ratio remains stable in models using the top-K sparsity function. Our analysis reveals that the decline in sparsity primarily comes from the QKV projection, the gating projection, and the up projection of the feed-forward layers, confirming the superiority of the top-K function over ReLU for our settings.
We continue training the Mistral 7B model on the fine WebAdO dataset for 40 billion tokens. We preprocess the data using the sentence piece tokenizer from Mistral, with a batch size of 4 million tokens and a learning rate of 5 * 10^-5. We utilize the Adam optimizer with a weight decay of 0.01. Further training details can be found in the appendix. In our results, we ensure a fair comparison by continuing to train the Mistral 7B model using the same approach as the dense baseline. We compare Q-Sparse with the Reification and Dru sparsification methods, which modify the activation function to create sparsity. Following the original papers' methodology, we implement a two-stage training strategy that first replaces non-ReLUs with ReLUs for Reification models and then adds ReLU functions for the Dru sparsification method. We adhere to the original paper's implementation.
We evaluate these models across various language tasks, including ARC Challenge, H-SWaG, WinoGrande, MMLU, and TruthfulQA. The results show that Q-Sparse performs comparably to the dense baseline while being significantly more efficient during inference. Furthermore, Q-Sparse outperforms both Reification and Dru sparsification in terms of performance and sparsity ratio. We break down the sparsity of each model component and find that Q-Sparse achieves a higher sparsity ratio than the other methods. Specifically, the sparsity ratios for the query, key, value, output, up, and down tensors exceed 40%, while the gate tensor sparsity ratio surpasses 60%. This demonstrates that Q-Sparse can achieve full sparsity of activations in LLMs.
In another setting, we fine-tune the base models of Mistral 7B and Qwen 1.5 7B on the open Orca dataset. For both the dense baselines and Q-Sparse, we set the batch size to 128 and select learning rates from 3 * 10^-6, 5 * 10^-6, and 7 * 10^-6. All models are trained for one epoch to ensure a fair comparison, with hyperparameters detailed in the appendix. We evaluate these models on the same range of language tasks mentioned earlier. The results indicate that Q-Sparse with 3.6 billion activated parameters significantly outperforms the Qwen 1.5 4 billion dense model. Additionally, Q-Sparse with around 4 billion activated parameters achieves performance comparable to both the Mistral 7B model and the Qwen 1.5 7B model. This shows that Q-Sparse can effectively fine-tune a dense pre-trained model into a much more efficient sparse model with minimal loss in accuracy.
In our discussion and future work, we highlight the promising results of combining BitNet B 1.58 with Q-Sparse and plan to scale up our training in terms of model size and the number of training tokens. We also aim to incorporate Yoko to tackle the KV cache issue for LLM inference. The integration of BitNet Q-Sparse and Yoko offers a comprehensive strategy for optimizing all data types in LLM inference and deployment, including the systematic optimization of model weights, activations, and KV cache. We note that Q-Sparse, combined with the mixture of experts (MoE) method, has been widely used to achieve sparse activations in LLMs. Since Q-Sparse is orthogonal, it can be seamlessly integrated with MoE.
Lastly, we acknowledge that the current implementation of Q-Sparse is not well-suited for batch training and inference, and we are working on making it compatible with batch mode through innovations in both modeling and implementation.
Summary:
In this section, we demonstrate the effectiveness of Q-Sparse combined with one-bit models and validate its performance through various experiments. We also discuss the superiority of the top-K sparsity function over the ReLU function and highlight the success of Q-Sparse in different language tasks. We plan future work to scale up training and incorporate additional methods to optimize LLM inference and deployment.
Keyword
Keywords: Q-Sparse, Large Language Models (LLMs), sparse activations, inference efficiency, top-K sparsity function, straight-through estimator (STE), exponential scaling law, BitNet, fine-tuning, sparsity ratio, transformer architecture, quantization, scaling laws, mixture of experts (MoE).
FAQ
What is Q-Sparse?
Q-Sparse is a method designed to enable full sparsity of activations in large language models (LLMs), reducing their computational costs and memory usage during inference.
How does sparsity enhance efficiency in LLMs?
Sparsity enhances efficiency by minimizing the computational load during matrix multiplication and reducing the I/O operations required to transfer parameters between memory and computation units.
What are the two main types of sparsity discussed?
The two main types of sparsity are weight sparsity, which involves pruning model weights, and activation sparsity, which decreases the number of active