Demystifying the Lottery Ticket Hypothesis in Deep Learning (2024)

Why lottery tickets are the next big thing in training neural networks

Published in

Towards Data Science

4 min read

Mar 3, 2022

Training neural networks is expensive. OpenAI’s GPT-3 has been calculated to have a training cost of $4.6M using the lowest-cost cloud GPU on the market. It’s no wonder that Frankle and Carbin’s 2019 Lottery Ticket Hypothesis started a gold rush in research, with attention from top academic minds and tech giants like Facebook and Microsoft. In the paper, they prove the existence of winning (lottery) tickets: subnetworks of a neural network that can be trained to produce performance as good as the original network, with a much smaller size. In the post, I’ll tackle how this works, why it is revolutionary, and the state of research.

Traditional wisdom says that neural networks are best pruned after training, not at the start. By pruning weights, neurons, or other components, the resulting neural network is smaller, faster, and consumes fewer resources during inference. When done right, the accuracy is unaffected while the network size can shrink manifold.

By flipping traditional wisdom on its head, we can consider whether we could have pruned the network before training and achieved the same result. In other words, was the information from the pruned components necessary for the network to learn, even if not to represent its learning?

The Lottery Ticket Hypothesis focuses on pruning weights and offers empirical evidence that certain pruned subnetworks could be trained from the start to achieve similar performance to the entire network. How? Iterative Magnitude Pruning.

When a task like this was tried historically, the pruned networks weights would be reinitialized randomly and the performance would drop off quickly.

The key difference here is that the weights were returned to their original initialization. When trained, the results matched the original performance in the same training time, at high levels of pruning.

Demystifying the Lottery Ticket Hypothesis in Deep Learning (3)

This suggests that these lottery tickets exist, as an intersection of a specific subnetwork and initial weights. They are “winning the lottery,” so to say, as the match of that architecture and those weights perform as well as the entire network. Does this hold for bigger models?

For bigger models, this does not hold true with the same approach. When looking at sensitivity to noise, Frankle and Carbin duplicated the pruned networks and trained them on data ordered differently. IMP succeeds where linear mode connectivity exists, a very rare phenomenon where multiple networks converge to the same local minima. For small networks, this happens naturally. For large networks, it does not. So what to do?

Starting with a smaller learning rate results in IMP working for large models, as sensitivity to initial noise from the data is lessened. The learning rate can be increased over time. The other finding is that rewinding our pruned neural network’s weights to their values at a later training iteration rather than the first iteration works as well. For example, the weights at the 10th iteration in a 1000 iteration training.

These results have held steady across architectures as different as transformers, LSTMs, CNNs, and reinforcement learning architectures.

While this paper proved the existence of these lottery tickets, it does not yet provide a way to identify them. Hence, the gold rush in finding their properties and whether they can be identified before training. They’re also inspiring work in heuristics for pruning early, since our current heuristics are focused on pruning after training.

One Ticket to Win Them All (2019) shows that lottery tickets encode information that is invariant to datatype and optimizers. They are able to successfully transfer lottery tickets between networks trained on different datatypes (e.g. VGG to ImageNet), finding success.

A key indicator was the relative size of the training data for the networks. If the lottery ticket source was trained on a larger dataset than the destination network, it performed better; otherwise, similarly or worse.

Demystifying the Lottery Ticket Hypothesis in Deep Learning (4)

Drawing Early-Bird Tickets (2019): This paper aims to prove that lottery tickets can be found early in training. Each training iteration, they compute a pruning mask. If the mask in the last iteration and this one have a mask distance (using Hamming distance) below a certain threshold, the network stops to prune.

Pruning Neural Networks Without Any Data by Iteratively Conserving Synaptic Flow (2020): This paper focuses on calculating pruning at initialization with no data. It outperforms existing state-of-the-art pruning pruning algorithms at initialization. The technique focuses on maximizing critical compression, the maximum pruning that can occur without impacting performance. To do so, the authors aim to prevent entire layers from being pruned. The network does this by positively scoring keeping layers and reevaluating the score every time the network prunes.

The existence of small subnetworks in neural architectures that can be trained to perform as well as the entire neural network is opening a world of possibilities for efficient training. In the process, researchers are learning a lot about how neural networks learn and what is necessary for learning. And who knows? One day soon we may be able to prune our networks before training, saving time, compute, and energy.

Demystifying the Lottery Ticket Hypothesis in Deep Learning (5)

FAQs

Demystifying the Lottery Ticket Hypothesis in Deep Learning? ›

The Lottery Ticket Hypothesis (LTH) is the ultimate representation of the 80–20 principle in Deep Learning. It posits that within randomly initialized, dense neural networks lie subnetworks capable of achieving the same performance as the full network after training, but with significantly fewer parameters.

Discover More Details ›

What is the lottery ticket hypothesis? ›

The Lottery Ticket Hypothesis (LTH) states that a dense neural network model contains a highly sparse subnetwork (i.e., winning tickets) that can achieve even better performance than the original model when trained in isolation.

Show Me More ›

What is hypothesis in deep learning? ›

A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data.

Get More Info Here ›

What is the psychology behind buying lottery tickets? ›

The pleasure principle plays a key role here—our brain's dopamine release, linked to feelings of pleasure and reward, occurs not just when we win, but also in the anticipation of winning. Therefore, even the act of buying a lottery ticket triggers excitement, reinforcing our desire to play again.

How is deep learning used for prediction? ›

Deep learning algorithms, built upon the structure of ANNs, enhance prediction accuracy through their multi-layered networks of nodes, or neurons. These neurons process and transmit data, enabling the network to learn from and accurately predict outcomes based on large datasets.

Tell Me More ›

What is the lottery hypothesis in AI? ›

Learn More Now ›

What is the message of the lottery ticket? ›

Answer and Explanation: In "The Lottery Ticket", Chekhov develops the theme that the love of money can destroy one's satisfaction. Chekhov creates this theme through Ivan and his wife's reactions to the idea of money as well as each other.

See Details ›

What are the 3 major types of hypothesis? ›

Types of hypothesis are: Simple hypothesis. Complex hypothesis. Directional hypothesis.

Learn More Now ›

What is the null hypothesis in deep learning? ›

Often denoted as H0, the null hypothesis is a general statement or default position that there is no relationship between two measured phenomena or no association among groups. In other words, it assumes that any kind of difference or significance you see in a set of data is due to chance.

Find Out More ›

What are the three stages to build the hypotheses or models in machine learning? ›

In machine learning, building a hypothesis or model typically involves three stages:

Model selection: This is the process of selecting the most appropriate type of model for the problem at hand. ...
Model training: Once a model is selected, the next step is to train the model on the available data. ...
Model evalua.

Get More Info ›

What is the real purpose of The Lottery theory? ›

The lottery paradox was designed to demonstrate that three attractive principles governing rational acceptance lead to contradiction: It is rational to accept a proposition that is very likely true. It is irrational to accept a proposition that is known to be inconsistent and is jointly inconsistent.

Get More Info Here ›

Who is most likely to buy a lottery ticket? ›

The tendency to play the lottery in a given year increases for people in their twenties and thirties — the proportion hovers around 70% in those age groups.

Keep Reading ›

What is the idea behind The Lottery? ›

Answer and Explanation: In 'The Lottery,' the central idea is that people should not blindly follow traditions without questioning them.