close

4 Popular Model Compression Techniques Explained

Object Detection

4 Popular Model Compression Techniques Explained

Sabina Pokhrel / January 19, 2022

Model compression reduces the size of a neural network (NN) without compromising accuracy. This size reduction is important because bigger NNs are difficult to deploy on resource-constrained devices. In this article, we will explore the benefits and drawbacks of 4 popular model compression techniques.

A brief history of model compression

In recent years, machine learning and deep learning have shown remarkable improvements in computer vision. 

Let’s look at the evolution of deep neural networks in the last decade.

The ImageNet ILSVRC challenge results (top-5 error (_)) from 2010 to 2015
The ImageNet ILSVRC challenge results (top-5 error (%)) from 2010 to 2015.

The chart above shows the top-5 error rate and number of layers of the winning models for the ImageNet challenge from 2010 to 2015.

In 2012, AlexNet won the ImageNet challenge with a very large margin improvement. 

That was the turning point in computer vision history because it showed that deep neural networks (DNNs) could perform tasks that were considered very difficult for computers, not to mention that DNNs could do this with an unprecedented level of accuracy. 

Each year, the accuracy of the winning models continued to increase, and so did their size.

The need for model compression

Many real-world applications demand real-time, on-device processing capabilities. For example, the AI on your home security camera must be able to process and notify you if there is someone unknown attempting to enter your house.

The main challenge with using state-of-the-art AI today is that the edge devices are resource-constrained. As a result, they have limited memory and limited processing capabilities. 

The deep learning models that perform well are big in size. However, the bigger the model, the more storage space it needs, making it difficult to deploy on resource-constrained devices. Moreover, a bigger model means a higher inference time and more energy consumption during inference. While these models have achieved great results in the lab, they aren’t usable in many real-world applications.

That leaves us with one option: reducing the size of the model.

Making a smaller model that can run under the constraints of edge devices is a key challenge. It’s also important to do this without compromising accuracy.

It is just not enough to have a small model that can run on resource-constrained devices. It should perform well, both in terms of accuracy and inference speed.

That’s where model compression techniques come in.

As the name suggests, model compression helps reduce the size of the neural network without compromising too much on accuracy.

The resulting models are both memory and energy efficient.

Many model compression techniques can be used to reduce the model size. This article will focus on four popular compression techniques:

  1. Pruning
  2. Quantization
  3. Knowledge distillation
  4. Low-rank factorization

1. The pruning technique 

Pruning is a powerful technique to reduce the number of deep neural networks parameters. In DNNs, many parameters are redundant because they do not contribute much during training. So, after training, such parameters can be removed from the network with little effect on accuracy.

Before and after pruning
Before and after pruning

Pruning results in compressed models that run faster. It reduces the computational cost involved in network training and reduces the overall model size. Importantly, it also saves computation time and energy. 

A model can be pruned during or after the training. Various techniques exist for this pruning process; some of these are:

  • Weight Pruning

Weight connections that are below some predefined thresholds are pruned (zeroed out).

Weight pruning
Weight pruning
  • Neuron Pruning

 Instead of removing the weights one by one, which can be time-consuming, we prune the neurons.

Neuron pruning
Neuron pruning
  • Filter Pruning

Filters are ranked according to their importance, and the least important filters are removed from the network. The importance of the filters can be calculated by L1/L2 norm.

Filter pruning
Filter pruning
  • Layer Pruning

Layers can also be pruned.

Layer pruning
Layer pruning

The graphs below show the size and speed vs accuracy trade-offs for different pruning methods and families of architectures. 

Pruning - speed and size tradeoffs
Pruning – speed and size trade-offs

By looking at these graphs, we can observe that pruned models sometimes perform better than the original architecture, but they rarely outperform the better architecture.

To summarize, pruning can be applied during or after training. It can be applied to both convolutional and fully connected layers. Both the original and pruned model have the same architecture. Pruned models sometimes outperform the original architecture but rarely outperform a better architecture.

2. The quantization technique 

In DNNs, weights are stored as 32-bit floating point numbers.

Quantization
Quantization

Quantization compresses the original network by reducing the number of bits required to represent each weight. For example, the weights can be quantized to 16-bit, 8-bit, 4-bit and even 1-bit. By reducing the number of bits used, the size of the DNN can be significantly reduced.

There are two forms of quantization. These are post-training quantization and quantization aware training.

Comparison of quantized and non-quantized models on ImageNet
Comparison of quantized and non-quantized models on ImageNet

Quantization can be applied both during and after training. It can be applied to both convolutional and fully connected layers. However, quantized weights make neural networks harder to converge and make back-propagation infeasible.

The table below shows the comparison of different DNN compression model compression techniques on the ImageNet dataset and standard pre-trained models.

Table Comparison of Different DNN compression techniques on the ImageNet dataset and standard pre-trained models
Table Comparison of Different DNN compression techniques on the ImageNet dataset and standard pre-trained models

By using pruning on AlexNet, the resulting network was 9 times smaller and 3 times faster than the original network without any reduction in accuracy. When further quantization was applied, the resulting network was 35 times smaller and 3 times faster than the original network.

Similarly, VGG16 was reduced 13 times by applying pruning, which sped up the inference by 5 times without significantly reducing accuracy. When further quantization was applied, the size was reduced 49 times that of the original model. 

3. The knowledge distillation technique 

In knowledge distillation, a large, complex model is trained on a large dataset in knowledge distillation. When this model can generalize and perform well on unseen data, the knowledge is transferred to a smaller network. The larger model is known as the teacher model, and the smaller network is known as the student network.

Knowledge distillation
Knowledge distillation

In knowledge distillation, knowledge types, distillation strategies and the teacher-student architectures play a crucial role in the student learning.

Knowledge distillation is different from transfer learning, where we use the same model architecture and learned weights, and only replace some of the fully connected layers with new layers as per the requirements of the applications.

Currently, the knowledge distillation technique is limited to classification-based applications only. 

It is challenging to train student networks from the teacher for tasks such as object detection and segmentation. 

The activations, neurons or features of intermediate layers also can be used as the knowledge to guide the learning of the student model. Different forms of knowledge can be transferred. These are response-based knowledge, feature-based knowledge, and relation-based knowledge.

Different knowledge distillation strategies are:

  • Offline distillation
  • Online distillation
  • Self-distillation
Knowledge distillation - strategies
Knowledge distillation – strategies

4. The low-rank factorization technique 

Low-rank factorization identifies redundant parameters of deep neural networks by employing the matrix and tensor decomposition. When reducing the model size is necessary, a low-rank factorization technique helps by decomposing a large matrix into smaller matrices.

A weight matrix A with m x n dimension and having a rank r can be decomposed into smaller matrices.

Low-rank factorization
Low-rank factorization

The low-rank factorization of the dense layer matrices mainly improves the storage requirements and makes the model storage-friendly, while the factorization of convolutional layers makes the inference process faster.

The accuracy and model performance depends on proper factorization and rank selection. Here, the main challenge is that the decomposition process results in harder implementations and is computationally intensive.

To summarize, low-rank factorization can be applied during or after training. Furthermore, it can be applied to both convolutional and fully connected layers. When applied during training, it can reduce training time. In addition, the factorization of the dense layer matrices reduces model size and improves speed up to 30-50% compared to the full-rank matrix representation.

Model compression and its challenges

The model compression and acceleration methods of DNNs have achieved significant momentum over the last couple of years. However, a major challenge that still exists with DNNs is finding the right balance between varying resource availability and system performance for resource-constrained devices. 

Model compression techniques are complementary to each other. These techniques can be applied to pre-trained models as a post-processing step to reduce model size and increase inference speed. They can be applied during training time as well. 

Pruning and Quantization have now been baked into machine learning frameworks such as Tensorflow and PyTorch.

There are many more compression approaches beyond the four common ones covered in this article, such as weight sharing-based model compression, structural matrix, transferred filters, and compact filters.

Spread the love

About the Author

    Sabina is an AI Specialist and Machine Learning Engineer. She is a Writer and a former Editorial Associate at Towards Data Science.

Trusted by

Xailient’s commercial partners

Press releases

March 14, 2022

Xailient specializes in extremely efficient low-power computer vision. Intel's OpenVINO specializes in maximizing the performance and speed of computer vision AI workloads. OpenVINO improved Xailient FPS 9.5x on Intel hardware to 448 FPS. Together, Xailient-Intel outperforms the comparable MobileNet_SSD by 80x. Even after Intel worked the OpenVINO magic on MobileNet_SSD, Xailient-OpenVINO is 14x faster.

November 29, 2021

Xailient’s Face Recognition enables high-speed edge AI processing with low-power consumption using Sony’s IMX500 – a chip so small it can fit on the tip of your finger.

OnEdge Newsletter

A weekly newsletter with the best hand-picked resources about Edge AI and Computer Vision

OnEdge is a free weekly newsletter that keeps you ahead of the curve on low-powered Edge devices and computer vision AI.

 

You’ll get insights and resources into:

  • Edge computing use cases.
  • Market trends in Edge computing.
  • Computer Vision AI at the Edge.
  • Machine learning at the Edge.
Cookie Policy