Sitemap

Sharding Models: Slicing Giants into Dancing Fairies

4 min readApr 28, 2025

In the sacred scrolls of Computer Science, there is a golden proverb:

“If a task is too heavy for one, divide it among many.”

And so, when today’s Large Language Models — veritable Titans — grew so large they could no longer fit into the humble GPU memory of mere mortals, the ancient art of Sharding was reborn.

Pull up a chair, sip some masala chai, and let us embark upon the enchanting journey of How to use Sharding in Models. 🚀

🌟 What is Sharding?

Imagine you baked a cake so gigantic that no single plate could hold it.
Would you weep? Nay! You would cut it into slices and pass it around.

Sharding is exactly this:

Splitting a large model across multiple devices (or nodes or GPUs) so that each holds a small, manageable piece.

Each shard holds a part of the model’s parameters, and together, they whisper secrets to each other across the network to behave like one single mighty model.

🛕 Why Shard a Model?

  • Memory Constraints: Your model is thicc (scientific term), and your GPU can’t hold it alone.
  • Faster Training: Many hands make light work.
  • Scalability: Tomorrow’s models will have trillions of parameters. You either shard, or you get left behind, weeping in 8GB VRAM.

🛡️ Ways to Shard Models

Now, in the royal courts of machine learning, there are different schools of sharding:

1. Tensor Sharding

Cut the tensors themselves. Slice that giant matrix horizontally, vertically, diagonally — like a ninja slicing watermelons.

Example:

  • GPU 1 holds the first half of a weight matrix.
  • GPU 2 holds the second half.

Used in DeepSpeed and Megatron-LM.

Poetry in code:

# Pseudo code: Partition a tensor
tensor = torch.randn(1000, 1000)
shard1, shard2 = tensor.chunk(2, dim=0)

2. Layer Sharding (aka Pipeline Parallelism)

Ah, the assembly line of dreams!

Each device holds a different layer (or set of layers) of the model.

Like a relay race: GPU 1 computes Layer 1, hands the baton to GPU 2 for Layer 2, and so forth.

Poetry in code:

# Example with PyTorch PipelineParallel
from torch.distributed.pipeline.sync import Pipe

model = nn.Sequential(layer1, layer2, layer3, layer4)
model = Pipe(model, chunks=8)

3. Expert Sharding (Mixture of Experts)

This is like having a room full of wise old men, but only a few are consulted at a time.

Only parts of the model (“experts”) are active for each input.
Popular in models like Switch Transformer.

Efficiency: You shard everything, but you don’t wake everyone up unless needed. (Let the lazy geniuses nap.)

🛠️ How to Implement Sharding — A Ritual of Three Steps

Step 1: Pick a Framework

Several enchanted tools exist:

  • DeepSpeed (Microsoft’s battle-tested sharding spellbook)
  • FairScale (Meta’s gift to mankind)
  • FSDP (Fully Sharded Data Parallelism from PyTorch itself)

Each has its own charm and cost.

Step 2: Wrap Your Model Carefully

Depending on the framework, you usually just “wrap” your model.

For example, in FSDP:

import torch.distributed.fsdp as fsdp

model = MyLargeModel()
sharded_model = fsdp.FullyShardedDataParallel(model)

It’s like tucking a giant into a perfectly tailored suit!

Step 3: Training and Saving Models

Training proceeds like normal — except now, under the hood, weights fly like carrier pigeons from GPU to GPU!

Saving needs special care too. Often, you’ll “gather” shards to save a single monolithic checkpoint or save shards individually.

⚡ Quick Tips from the Wise

  • Network is the new bottleneck: Sharding means more communication!
    Invest in a good network stack — or your GPUs will sit around like bored cows.
  • Balance your shards: Uneven sharding leads to sad GPUs. Like uneven marriages — someone’s doing all the work.
  • Checkpoint wisely: Sharded checkpoints can become messy if not managed properly. Use your framework’s recommended methods.
  • Test with small models first: A cracked pot leaks wisdom before the whole dam bursts.

🧙 Final Words: Sharding — Ancient, Yet Modern

In a world where models grow fatter than feasting kings, sharding is the humble carpenter’s knife — carving giants into manageable heroes.

So go forth, young magus, and slice those models like an ancient master chef! 🍰
The AI future awaits those who dare to cut smart and stitch wisely.

--

--

Aditya Mangal
Aditya Mangal

Written by Aditya Mangal

Tech enthusiast weaving stories of code and life. Writing about innovation, reflection, and the timeless dance between mind and heart.

No responses yet