Neural Playground

Epoch 0
Loss 0.000
Accuracy 50.0%
0.03
0.00

Neural Network

Output

Class A Class B
Epoch 0
Loss 0.000
Train Acc 0.0%
Test Acc -

CNN Architecture

Feature Maps

CNN Explainer

Understand how your CNN processes images

Input Image

Prediction

Layer Info


                            

Draw a Digit ? The model was trained on MNIST handwritten digits. Try drawing numbers 0-9 in the center of the canvas with a thick stroke, similar to how you'd write with a marker.

Draw a digit (0-9) using your mouse or touch

Live Prediction

Processed Input (28x28)

Draw your own digit to see how the CNN classifies it in real-time. The model processes your drawing as a 28x28 grayscale image, just like MNIST training data.

These are the learned filters (kernels) of the selected Conv2D layer. Each filter detects specific patterns in the input. Blue/red colors show positive/negative weights.

Feature maps (activations) show how each filter responds to the input image. Bright areas indicate strong activations - where the filter found its target pattern.

Before Flatten (2D Feature Maps) ? After convolution and pooling, data exists as multiple 2D feature maps (height × width × channels). Each "channel" represents one filter's output.

Flatten

After Flatten (1D Vector) ? The flatten operation "unrolls" the 2D feature maps into a single 1D vector. This allows the data to connect to dense (fully connected) layers for classification.

The Flatten layer converts 2D feature maps into a 1D vector. This is necessary before passing data to Dense (fully connected) layers. The spatial structure is preserved in the ordering of values.

Raw Logits (Before Softmax) ? Logits are the raw output scores from the final dense layer, before normalization. They can be any real number - positive values indicate higher confidence for that class.

Softmax σ(z)ᵢ = eᶻⁱ / Σⱼ eᶻʲ

Probabilities (After Softmax) ? Softmax converts logits to probabilities that sum to 1.0 (100%). The exponential function amplifies differences - the highest logit becomes a high probability, while others become near zero.

Softmax converts raw network outputs (logits) into probabilities that sum to 1. Large positive values become high probabilities, while negative values become low probabilities. This is how the network expresses confidence in each class.

The saliency map highlights which parts of the input image most strongly influence the model's prediction. Warmer colors indicate higher importance.

Watch how a convolution filter slides over the input image, computing element-wise products at each position to produce the output feature map.

Status Ready
Tokens 0
0.7

Text Generation

Click "Generate" to create text...

Transformer Architecture

2

Message Passing

Layer 1

Node Embeddings (2D Projection)

Ready
20
3.0

Generated Digits

4×4 Grid
Transformers

Transformers

Attention-based architecture for sequence modeling

Diffusion Model Explainer

Learn how diffusion models generate images step by step

What is a Diffusion Model?

Diffusion models learn to reverse the process of adding noise to data. Imagine watching ink spread in water, then playing it backwards - the model learns to "gather" the ink back into its original shape.

Key Insight

Diffusion models learn to reverse the natural process of adding noise. Given a noisy image, they predict what noise to subtract.

Applications

  • DALL-E 2 - Text to image
  • Stable Diffusion - Open source
  • Midjourney - Artistic images
  • Sora - Video generation

The Forward Process (Noising)

The forward process gradually adds Gaussian noise to an image until it becomes pure noise. This is a fixed process - we don't learn it.

t=0
q(x_t | x_{t-1}) = N(√(1-β_t) x_{t-1}, β_t I)

The Reverse Process (Denoising)

The neural network learns to predict the noise that was added at each step. By subtracting this predicted noise, we gradually recover the original image.

p_θ(x_{t-1} | x_t) = N(μ_θ(x_t, t), Σ_θ(x_t, t))

Key: The network predicts ε (the noise), not the clean image directly.

U-Net Architecture

The denoising network is a U-Net: an encoder-decoder with skip connections. It takes a noisy image and timestep as input, outputs the predicted noise.

Encoder (downsample) Decoder (upsample) Skip connections

Noise Schedules

The noise schedule β_t controls how quickly noise is added. Different schedules affect generation quality.

Linear: Simple ramp from β_start to β_end. Cosine: Slower start, preserves more signal longer.

DDPM vs DDIM Sampling

Different algorithms for the reverse process trade off speed vs quality.

DDPM (50 steps)

-

DDIM (10 steps)

-

DDPM: Stochastic, higher quality. DDIM: Deterministic, faster, same seed = same output.

Classifier-Free Guidance

Guidance scale controls how strongly the model follows the class label. Higher = more recognizable but less diverse.

3.0
ε_guided = ε_uncond + s · (ε_cond - ε_uncond)

Low guidance (1-2): Diverse, fuzzy. High guidance (7-10): Sharp, less variety.

Transformer Explainer

Understand how transformers process language

What is a Transformer?

Transformers are neural network architectures that revolutionized NLP. Unlike RNNs that process sequences step-by-step, transformers process all tokens in parallel using self-attention.

Key Innovation: Attention

"Attention Is All You Need" (2017) showed that attention mechanisms alone, without recurrence, can achieve state-of-the-art results.

  • Parallel processing of sequences
  • Direct connections between any tokens
  • Learnable relationships

Applications

  • GPT - Text generation
  • BERT - Understanding
  • T5 - Text-to-text
  • Vision Transformers - Images

Interactive Tokenizer

Type text below to see how it gets split into tokens. GPT-2 uses Byte Pair Encoding (BPE).

Tokenization breaks text into subword units. Common words become single tokens, while rare words are split into pieces. This allows handling any text with a fixed vocabulary.

Token Embeddings

Each token ID maps to a learned 768-dimensional vector that captures its meaning.

+

Position Embeddings

Since attention is position-agnostic, we add position information to each token.

=

Combined Embedding

The final embedding combines semantic meaning with positional information.

Embeddings convert discrete tokens into continuous vectors that the network can process. Position embeddings tell the model where each token appears in the sequence.

Self-Attention Mechanism

Query (Q)

What am I looking for?

×

Key (K)

What do I contain?

Scores

Attention weights

×

Value (V)

What to retrieve?

Attention(Q, K, V) = softmax(QKT / √dk)V

Self-attention allows each token to "attend" to all other tokens. Query and Key dot products determine attention weights, which are used to aggregate Value vectors.

Multi-Head Attention

Instead of one attention mechanism, transformers use multiple "heads" in parallel, each learning different relationships.

Head 1: Subject-verb relationships
Head 2: Adjective-noun pairs
Head 3: Coreference (pronouns)

Multiple attention heads allow the model to jointly attend to information from different representation subspaces. GPT-2 uses 12 heads per layer.

Autoregressive Generation

GPT generates text one token at a time, using all previous tokens as context.

Sampling Strategies

Greedy: Always pick highest probability
Temperature: Scale logits before softmax
Top-k: Sample from top k tokens
Nucleus (Top-p): Sample from smallest set summing to p

TRANSFORMER EXPLAINER

1.0
INPUT:

Embeddings

Token + Position

Transformer Block

Multi-Head Self-Attention 4 heads × 16 dims
Feed-Forward Network 64 → 256 → 64

Output Predictions

Generated Sequence
Top-5 Predictions
Token Count: 0
Current Token: -
Model: PicoGPT (2 layers, 4 heads, 64 dim)

PyTorch Code Reference

Code snippets for each architecture

Neural Net (MLP)

import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim=2, hidden=8):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)

CNN (Convolutional)

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),  # [B,1,28,28] → [B,32,28,28]
            nn.ReLU(),
            nn.MaxPool2d(2),                  # → [B,32,14,14]
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),                  # → [B,64,7,7]
        )
        self.fc = nn.Sequential(
            nn.Flatten(),                     # → [B,3136]
            nn.Linear(64*7*7, 128),
            nn.ReLU(),
            nn.Linear(128, 10)               # 10 digit classes
        )

Transformer (Attention)

class Attention(nn.Module):
    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)  # [B,T,head_dim]
        k = self.key(x)
        v = self.value(x)

        # Attention scores: Q @ K^T / sqrt(d)
        att = (q @ k.transpose(-2,-1)) * C**-0.5
        att = F.softmax(att, dim=-1)

        return att @ v  # Weighted sum of values

GNN (Message Passing)

class GCNLayer(nn.Module):
    def forward(self, x, adj):
        # x: node features [N, F]
        # adj: adjacency matrix [N, N]

        # 1. Aggregate neighbor features
        agg = adj @ x  # message passing

        # 2. Transform
        out = self.linear(agg)

        return F.relu(out)

Diffusion (Denoising)

# Forward: add noise
def q_sample(x0, t, noise):
    return sqrt(alpha[t]) * x0 + sqrt(1-alpha[t]) * noise

# Reverse: predict & remove noise
def p_sample(model, xt, t):
    pred_noise = model(xt, t)  # UNet predicts noise
    x_prev = (xt - beta[t]*pred_noise) / sqrt(alpha[t])
    return x_prev + sigma[t] * torch.randn_like(xt)

Training Loop

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(epochs):
    for x, y in dataloader:
        pred = model(x)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()      # compute gradients
        optimizer.step()     # update weights

Example Configurations

Quick-start presets for different learning scenarios

XOR Problem

Neural Net

Classic non-linearly separable problem. Shows why hidden layers are needed - a single layer cannot solve XOR, but adding one hidden layer with a few neurons solves it easily.

2 inputs → 4 neurons → 1 output | Activation: tanh

Spiral Classification

Neural Net

A challenging dataset with two interleaved spirals. Requires deeper networks and more neurons. Try adding additional input features like x² or sin(x) to help!

2+ inputs → 8→8 neurons → 1 output | Use feature engineering

MNIST Digit Recognition

CNN

Train a CNN to recognize handwritten digits (0-9). The default architecture achieves ~98% accuracy. Use the Explain button after training to visualize what the network learned.

28x28 input → Conv→Pool→Conv→Pool→Dense → 10 classes

Text Generation

Transformer

Generate text with PicoGPT, a tiny transformer trained on Shakespeare. Watch the attention patterns as each new token is predicted based on context.

Vocab: 65 chars | 2 layers | 4 attention heads | 64 dimensions

Image Generation

Diffusion

Watch a diffusion model progressively denoise random noise into a coherent image. Select different patterns and observe the step-by-step denoising process.

8x8 grid | 20 denoising steps | UNet-style architecture