Neural Playground

Epoch 0

Loss 0.000

Accuracy 50.0%

Data

Learning Rate 0.03

Activation

Regularization 0.00

x₁ x₂ x₁² x₂² x₁x₂ sin cos

Params 0

Layers 2

Dataset

Epoch 0

Loss 0.000

Train Acc 0.0%

Test Acc -

Optimizer

Batch

Backend

Input Image

Prediction

Layer Info

Select Layer:

Draw a Digit ? The model was trained on MNIST handwritten digits. Try drawing numbers 0-9 in the center of the canvas with a thick stroke, similar to how you'd write with a marker.

Draw a digit (0-9) using your mouse or touch

Live Prediction

Processed Input (28x28)

Draw your own digit to see how the CNN classifies it in real-time. The model processes your drawing as a 28x28 grayscale image, just like MNIST training data.

Layer:

These are the learned filters (kernels) of the selected Conv2D layer. Each filter detects specific patterns in the input. Blue/red colors show positive/negative weights.

Layer:

Feature maps (activations) show how each filter responds to the input image. Bright areas indicate strong activations - where the filter found its target pattern.

Before Flatten (2D Feature Maps) ? After convolution and pooling, data exists as multiple 2D feature maps (height × width × channels). Each "channel" represents one filter's output.

Flatten

After Flatten (1D Vector) ? The flatten operation "unrolls" the 2D feature maps into a single 1D vector. This allows the data to connect to dense (fully connected) layers for classification.

The Flatten layer converts 2D feature maps into a 1D vector. This is necessary before passing data to Dense (fully connected) layers. The spatial structure is preserved in the ordering of values.

Raw Logits (Before Softmax) ? Logits are the raw output scores from the final dense layer, before normalization. They can be any real number - positive values indicate higher confidence for that class.

Softmax σ(z)ᵢ = eᶻⁱ / Σⱼ eᶻʲ

Probabilities (After Softmax) ? Softmax converts logits to probabilities that sum to 1.0 (100%). The exponential function amplifies differences - the highest logit becomes a high probability, while others become near zero.

Softmax converts raw network outputs (logits) into probabilities that sum to 1. Large positive values become high probabilities, while negative values become low probabilities. This is how the network expresses confidence in each class.

The saliency map highlights which parts of the input image most strongly influence the model's prediction. Warmer colors indicate higher importance.

Watch how a convolution filter slides over the input image, computing element-wise products at each position to produce the output feature map.

Model

Status Ready

Tokens 0

Temperature 0.7

Max Length

Device

Loading model...

Model DistilGPT2

Parameters 82M

Layers 6

Hidden Size 768

Attention Heads 12

Vocab Size 50257

Architecture

Embeddings

↓

Self-Attention ×6

↓

Feed Forward

↓

LM Head

Input Prompt:

Generated Output:

Click "Generate" to create text...

Layer:

Head:

Generate text to see attention patterns

Dataset

Architecture

Layers 2

Click a node to inspect

Epoch 0

Loss -

Accuracy -

Status Ready

Steps 20

Guidance 3.0

Class

Backend

Parameters ~100K

Input 28×28×1

Timesteps 1000

Training Mode

Architecture Transformers

View

Transformers

Attention-based architecture for sequence modeling

When to Use

Sequential data, NLP, time series

Key Innovation

Self-attention mechanism

Typical Size

10M - 175B parameters

Scaling Guide

Key Papers

What is a Diffusion Model?

Diffusion models learn to reverse the process of adding noise to data. Imagine watching ink spread in water, then playing it backwards - the model learns to "gather" the ink back into its original shape.

Key Insight

Diffusion models learn to reverse the natural process of adding noise. Given a noisy image, they predict what noise to subtract.

Applications

DALL-E 2 - Text to image
Stable Diffusion - Open source
Midjourney - Artistic images
Sora - Video generation

The Forward Process (Noising)

The forward process gradually adds Gaussian noise to an image until it becomes pure noise. This is a fixed process - we don't learn it.

Noise Level (t) t=0

q(x_t | x_{t-1}) = N(√(1-β_t) x_{t-1}, β_t I)

The Reverse Process (Denoising)

The neural network learns to predict the noise that was added at each step. By subtracting this predicted noise, we gradually recover the original image.

p_θ(x_{t-1} | x_t) = N(μ_θ(x_t, t), Σ_θ(x_t, t))

Key: The network predicts ε (the noise), not the clean image directly.

U-Net Architecture

The denoising network is a U-Net: an encoder-decoder with skip connections. It takes a noisy image and timestep as input, outputs the predicted noise.

Encoder (downsample) Decoder (upsample) Skip connections

Noise Schedules

The noise schedule β_t controls how quickly noise is added. Different schedules affect generation quality.

Linear: Simple ramp from β_start to β_end. Cosine: Slower start, preserves more signal longer.

DDPM vs DDIM Sampling

Different algorithms for the reverse process trade off speed vs quality.

DDPM (50 steps)

DDIM (10 steps)

DDPM: Stochastic, higher quality. DDIM: Deterministic, faster, same seed = same output.

Classifier-Free Guidance

Guidance scale controls how strongly the model follows the class label. Higher = more recognizable but less diverse.

Guidance Scale 3.0

ε_guided = ε_uncond + s · (ε_cond - ε_uncond)

Low guidance (1-2): Diverse, fuzzy. High guidance (7-10): Sharp, less variety.

What is a Transformer?

Transformers are neural network architectures that revolutionized NLP. Unlike RNNs that process sequences step-by-step, transformers process all tokens in parallel using self-attention.

Key Innovation: Attention

"Attention Is All You Need" (2017) showed that attention mechanisms alone, without recurrence, can achieve state-of-the-art results.

Parallel processing of sequences
Direct connections between any tokens
Learnable relationships

Applications

GPT - Text generation
BERT - Understanding
T5 - Text-to-text
Vision Transformers - Images

Interactive Tokenizer

Type text below to see how it gets split into tokens. GPT-2 uses Byte Pair Encoding (BPE).

Tokenization breaks text into subword units. Common words become single tokens, while rare words are split into pieces. This allows handling any text with a fixed vocabulary.

Token Embeddings

Each token ID maps to a learned 768-dimensional vector that captures its meaning.

Position Embeddings

Since attention is position-agnostic, we add position information to each token.

Combined Embedding

The final embedding combines semantic meaning with positional information.

Embeddings convert discrete tokens into continuous vectors that the network can process. Position embeddings tell the model where each token appears in the sequence.

Self-Attention Mechanism

Query (Q)

What am I looking for?

Key (K)

What do I contain?

→

Scores

Attention weights

Value (V)

What to retrieve?

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Self-attention allows each token to "attend" to all other tokens. Query and Key dot products determine attention weights, which are used to aggregate Value vectors.

Multi-Head Attention

Instead of one attention mechanism, transformers use multiple "heads" in parallel, each learning different relationships.

Head 1: Subject-verb relationships

Head 2: Adjective-noun pairs

Head 3: Coreference (pronouns)

Multiple attention heads allow the model to jointly attend to information from different representation subspaces. GPT-2 uses 12 heads per layer.

Autoregressive Generation

GPT generates text one token at a time, using all previous tokens as context.

Sampling Strategies

Greedy: Always pick highest probability

Temperature: Scale logits before softmax

Top-k: Sample from top k tokens

Nucleus (Top-p): Sample from smallest set summing to p

TRANSFORMER EXPLAINER

Temperature 1.0

Layer

INPUT:

Generated Sequence

Top-5 Predictions

Token Count: 0

Current Token: -

Model: PicoGPT (2 layers, 4 heads, 64 dim)

Neural Net (MLP)

import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim=2, hidden=8):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)

CNN (Convolutional)

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),  # [B,1,28,28] → [B,32,28,28]
            nn.ReLU(),
            nn.MaxPool2d(2),                  # → [B,32,14,14]
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),                  # → [B,64,7,7]
        )
        self.fc = nn.Sequential(
            nn.Flatten(),                     # → [B,3136]
            nn.Linear(64*7*7, 128),
            nn.ReLU(),
            nn.Linear(128, 10)               # 10 digit classes
        )

Transformer (Attention)

class Attention(nn.Module):
    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)  # [B,T,head_dim]
        k = self.key(x)
        v = self.value(x)

        # Attention scores: Q @ K^T / sqrt(d)
        att = (q @ k.transpose(-2,-1)) * C**-0.5
        att = F.softmax(att, dim=-1)

        return att @ v  # Weighted sum of values

GNN (Message Passing)

class GCNLayer(nn.Module):
    def forward(self, x, adj):
        # x: node features [N, F]
        # adj: adjacency matrix [N, N]

        # 1. Aggregate neighbor features
        agg = adj @ x  # message passing

        # 2. Transform
        out = self.linear(agg)

        return F.relu(out)

Diffusion (Denoising)

# Forward: add noise
def q_sample(x0, t, noise):
    return sqrt(alpha[t]) * x0 + sqrt(1-alpha[t]) * noise

# Reverse: predict & remove noise
def p_sample(model, xt, t):
    pred_noise = model(xt, t)  # UNet predicts noise
    x_prev = (xt - beta[t]*pred_noise) / sqrt(alpha[t])
    return x_prev + sigma[t] * torch.randn_like(xt)

Training Loop

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(epochs):
    for x, y in dataloader:
        pred = model(x)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()      # compute gradients
        optimizer.step()     # update weights

XOR Problem

Neural Net

Classic non-linearly separable problem. Shows why hidden layers are needed - a single layer cannot solve XOR, but adding one hidden layer with a few neurons solves it easily.

2 inputs → 4 neurons → 1 output | Activation: tanh

Spiral Classification

Neural Net

A challenging dataset with two interleaved spirals. Requires deeper networks and more neurons. Try adding additional input features like x² or sin(x) to help!

2+ inputs → 8→8 neurons → 1 output | Use feature engineering

MNIST Digit Recognition

CNN

Train a CNN to recognize handwritten digits (0-9). The default architecture achieves ~98% accuracy. Use the Explain button after training to visualize what the network learned.

28x28 input → Conv→Pool→Conv→Pool→Dense → 10 classes

Text Generation

Transformer

Generate text with PicoGPT, a tiny transformer trained on Shakespeare. Watch the attention patterns as each new token is predicted based on context.

Vocab: 65 chars | 2 layers | 4 attention heads | 64 dimensions

Image Generation

Diffusion

Watch a diffusion model progressively denoise random noise into a coherent image. Select different patterns and observe the step-by-step denoising process.

8x8 grid | 20 denoising steps | UNet-style architecture

Neural Playground

Neural Network

Output

CNN Architecture

Feature Maps

Text Generation

Transformer Architecture

Message Passing

Node Embeddings (2D Projection)

Generated Digits

Transformers

DDPM (50 steps)

DDIM (10 steps)

Query (Q)

Key (K)

Scores

Value (V)

TRANSFORMER EXPLAINER

Embeddings

Transformer Block

Output Predictions

Neural Network

Output

CNN Architecture

Feature Maps

CNN Explainer

Input Image

Prediction

Layer Info

Draw a Digit ? The model was trained on MNIST handwritten digits. Try drawing numbers 0-9 in the center of the canvas with a thick stroke, similar to how you'd write with a marker.

Live Prediction

Before Flatten (2D Feature Maps) ? After convolution and pooling, data exists as multiple 2D feature maps (height × width × channels). Each "channel" represents one filter's output.

After Flatten (1D Vector) ? The flatten operation "unrolls" the 2D feature maps into a single 1D vector. This allows the data to connect to dense (fully connected) layers for classification.

Raw Logits (Before Softmax) ? Logits are the raw output scores from the final dense layer, before normalization. They can be any real number - positive values indicate higher confidence for that class.

Probabilities (After Softmax) ? Softmax converts logits to probabilities that sum to 1.0 (100%). The exponential function amplifies differences - the highest logit becomes a high probability, while others become near zero.

Text Generation

Transformer Architecture

Message Passing

Node Embeddings (2D Projection)

Generated Digits

Transformers

Diffusion Model Explainer

What is a Diffusion Model?

Key Insight

Applications

The Forward Process (Noising)

The Reverse Process (Denoising)

U-Net Architecture

Noise Schedules

DDPM vs DDIM Sampling

DDPM (50 steps)

DDIM (10 steps)

Classifier-Free Guidance

Transformer Explainer

What is a Transformer?

Key Innovation: Attention

Applications

Interactive Tokenizer

Token Embeddings

Position Embeddings

Combined Embedding

Self-Attention Mechanism

Query (Q)

Key (K)

Scores

Value (V)

Multi-Head Attention

Autoregressive Generation

Sampling Strategies

Embeddings

Transformer Block

Output Predictions

PyTorch Code Reference

Neural Net (MLP)

CNN (Convolutional)

Transformer (Attention)

GNN (Message Passing)

Diffusion (Denoising)

Training Loop

Example Configurations

XOR Problem

Spiral Classification

MNIST Digit Recognition

Text Generation

Image Generation