Understand how your CNN processes images
Draw a digit (0-9) using your mouse or touch
Draw your own digit to see how the CNN classifies it in real-time. The model processes your drawing as a 28x28 grayscale image, just like MNIST training data.
These are the learned filters (kernels) of the selected Conv2D layer. Each filter detects specific patterns in the input. Blue/red colors show positive/negative weights.
Feature maps (activations) show how each filter responds to the input image. Bright areas indicate strong activations - where the filter found its target pattern.
The Flatten layer converts 2D feature maps into a 1D vector. This is necessary before passing data to Dense (fully connected) layers. The spatial structure is preserved in the ordering of values.
Softmax converts raw network outputs (logits) into probabilities that sum to 1. Large positive values become high probabilities, while negative values become low probabilities. This is how the network expresses confidence in each class.
The saliency map highlights which parts of the input image most strongly influence the model's prediction. Warmer colors indicate higher importance.
Watch how a convolution filter slides over the input image, computing element-wise products at each position to produce the output feature map.
Attention-based architecture for sequence modeling
Learn how diffusion models generate images step by step
Diffusion models learn to reverse the process of adding noise to data. Imagine watching ink spread in water, then playing it backwards - the model learns to "gather" the ink back into its original shape.
Diffusion models learn to reverse the natural process of adding noise. Given a noisy image, they predict what noise to subtract.
The forward process gradually adds Gaussian noise to an image until it becomes pure noise. This is a fixed process - we don't learn it.
q(x_t | x_{t-1}) = N(√(1-β_t) x_{t-1}, β_t I)
The neural network learns to predict the noise that was added at each step. By subtracting this predicted noise, we gradually recover the original image.
p_θ(x_{t-1} | x_t) = N(μ_θ(x_t, t), Σ_θ(x_t, t))
Key: The network predicts ε (the noise), not the clean image directly.
The denoising network is a U-Net: an encoder-decoder with skip connections. It takes a noisy image and timestep as input, outputs the predicted noise.
The noise schedule β_t controls how quickly noise is added. Different schedules affect generation quality.
Linear: Simple ramp from β_start to β_end. Cosine: Slower start, preserves more signal longer.
Different algorithms for the reverse process trade off speed vs quality.
DDPM: Stochastic, higher quality. DDIM: Deterministic, faster, same seed = same output.
Guidance scale controls how strongly the model follows the class label. Higher = more recognizable but less diverse.
ε_guided = ε_uncond + s · (ε_cond - ε_uncond)
Low guidance (1-2): Diverse, fuzzy. High guidance (7-10): Sharp, less variety.
Understand how transformers process language
Transformers are neural network architectures that revolutionized NLP. Unlike RNNs that process sequences step-by-step, transformers process all tokens in parallel using self-attention.
"Attention Is All You Need" (2017) showed that attention mechanisms alone, without recurrence, can achieve state-of-the-art results.
Type text below to see how it gets split into tokens. GPT-2 uses Byte Pair Encoding (BPE).
Tokenization breaks text into subword units. Common words become single tokens, while rare words are split into pieces. This allows handling any text with a fixed vocabulary.
Embeddings convert discrete tokens into continuous vectors that the network can process. Position embeddings tell the model where each token appears in the sequence.
What am I looking for?
What do I contain?
Attention weights
What to retrieve?
Self-attention allows each token to "attend" to all other tokens. Query and Key dot products determine attention weights, which are used to aggregate Value vectors.
Instead of one attention mechanism, transformers use multiple "heads" in parallel, each learning different relationships.
Multiple attention heads allow the model to jointly attend to information from different representation subspaces. GPT-2 uses 12 heads per layer.
GPT generates text one token at a time, using all previous tokens as context.
Code snippets for each architecture
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, input_dim=2, hidden=8):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.net(x)
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), # [B,1,28,28] → [B,32,28,28]
nn.ReLU(),
nn.MaxPool2d(2), # → [B,32,14,14]
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2), # → [B,64,7,7]
)
self.fc = nn.Sequential(
nn.Flatten(), # → [B,3136]
nn.Linear(64*7*7, 128),
nn.ReLU(),
nn.Linear(128, 10) # 10 digit classes
)
class Attention(nn.Module):
def forward(self, x):
B, T, C = x.shape
q = self.query(x) # [B,T,head_dim]
k = self.key(x)
v = self.value(x)
# Attention scores: Q @ K^T / sqrt(d)
att = (q @ k.transpose(-2,-1)) * C**-0.5
att = F.softmax(att, dim=-1)
return att @ v # Weighted sum of values
class GCNLayer(nn.Module):
def forward(self, x, adj):
# x: node features [N, F]
# adj: adjacency matrix [N, N]
# 1. Aggregate neighbor features
agg = adj @ x # message passing
# 2. Transform
out = self.linear(agg)
return F.relu(out)
# Forward: add noise
def q_sample(x0, t, noise):
return sqrt(alpha[t]) * x0 + sqrt(1-alpha[t]) * noise
# Reverse: predict & remove noise
def p_sample(model, xt, t):
pred_noise = model(xt, t) # UNet predicts noise
x_prev = (xt - beta[t]*pred_noise) / sqrt(alpha[t])
return x_prev + sigma[t] * torch.randn_like(xt)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(epochs):
for x, y in dataloader:
pred = model(x)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward() # compute gradients
optimizer.step() # update weights
Quick-start presets for different learning scenarios
Classic non-linearly separable problem. Shows why hidden layers are needed - a single layer cannot solve XOR, but adding one hidden layer with a few neurons solves it easily.
A challenging dataset with two interleaved spirals. Requires deeper networks and more neurons. Try adding additional input features like x² or sin(x) to help!
Train a CNN to recognize handwritten digits (0-9). The default architecture achieves ~98% accuracy. Use the Explain button after training to visualize what the network learned.
Generate text with PicoGPT, a tiny transformer trained on Shakespeare. Watch the attention patterns as each new token is predicted based on context.
Watch a diffusion model progressively denoise random noise into a coherent image. Select different patterns and observe the step-by-step denoising process.