It’s 11 PM and Your Terminal Just Hit 97% Accuracy
Your eyes are burning. Three cups of chai, cold now. And on the screen, blinking in that default terminal font, a number you weren’t expecting to see tonight: Test Accuracy: 0.9748.
You built this. Not copied from a tutorial video with some guy’s face in the corner. Not cloned from a GitHub repo with 14,000 stars. You wrote a neural network, from scratch, in Python, using TensorFlow — and it just recognized handwritten digits better than most humans can read their own doctor’s handwriting.
That feeling? That’s why people get obsessed with deep learning.
I’m going to walk you through exactly how to get there. By the end of this tutorial, you’ll have a working neural network trained on the MNIST dataset, and you’ll actually understand what every single line does. Not vaguely. Not “oh yeah, I think that’s the activation function.” Actually understand it.
Fair warning: once you ship your first model, you won’t stop. I didn’t.
What We’re Building (and Why MNIST Is Still Worth Your Time)
So here’s the plan. We’re going to build a feedforward neural network that looks at pictures of handwritten digits — the numbers 0 through 9 — and figures out which number each one is. Sounds simple, right?
It is. And that’s the point.
MNIST is a dataset of 70,000 grayscale images, each one 28 by 28 pixels. Sixty thousand for training, ten thousand for testing. Every image is a single handwritten digit. People have been using it as a machine learning benchmark since the late 1990s, and yeah, some folks on Stack Overflow will tell you it’s “too easy” or “solved.” They’re not wrong, exactly. But they’re missing the bigger picture.
MNIST is where you learn the workflow. Load data. Preprocess. Build. Compile. Train. Evaluate. Save. That pipeline stays the same whether you’re classifying digits or detecting tumors in MRI scans. The dataset changes; the steps don’t.
Tip: If someone tells you to skip MNIST and go straight to CIFAR-10 or ImageNet, ignore them. You don’t learn to drive in a Formula 1 car. Get the fundamentals down first. MNIST trains in under a minute on a CPU — that fast feedback loop is exactly what you need when you’re learning.
And honestly? There’s something satisfying about watching a model you built from nothing correctly identify a sloppy “7” that even you’d squint at. Probably just me, but I doubt it.
Getting Your Environment Ready
Before we write a single line of model code, we need TensorFlow installed. And we need it installed properly, which means using a virtual environment. Don’t skip this. I’ve seen people install TensorFlow globally, break their system Python, and then spend three hours on Stack Overflow trying to figure out why pip stopped working.
# Create and activate a virtual environment
# python -m venv tf_env
# source tf_env/bin/activate (Linux/Mac)
# tf_env\Scripts\activate (Windows)
# Install TensorFlow
# pip install tensorflow numpy matplotlib
Quick sidebar on versions: you want TensorFlow 2.16 or later for this tutorial. Older versions will probably work, but some of the Keras import paths changed when they merged Keras back into TensorFlow, and debugging import errors is a terrible way to spend your evening. Just use the latest.
Once everything’s installed, run this to make sure things are working:
import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {len(tf.config.list_physical_devices('GPU')) > 0}")
If you see a version number, you’re good. The GPU line will probably say False unless you’ve gone through the whole CUDA/cuDNN setup process. Don’t worry about it. MNIST is small enough that CPU training takes maybe 30 to 60 seconds. GPU acceleration matters when you’re training ResNet on ImageNet for 12 hours, not here.
Common Mistake: The “Which TensorFlow” Problem
People new to machine learning often get confused by the ecosystem. There’s tensorflow, tensorflow-gpu (deprecated now — the main package handles both), tf-nightly, and then there’s the whole keras vs tf.keras split that existed for years before they unified it. Right now, in 2026, you just pip install tensorflow and you get everything. Keras is included. NumPy comes as a dependency. The only extra thing you need for this tutorial is matplotlib, for visualization.
| Package | What It Does | Do You Need It? |
|---|---|---|
tensorflow |
ML framework with Keras built in | Yes |
numpy |
Array operations (installed with TF) | Already there |
matplotlib |
Plotting and visualization | Install separately |
tensorflow-gpu |
Legacy GPU package | No — merged into main |
keras |
Standalone Keras | No — use tf.keras |
Loading the MNIST Dataset
Here’s where things start getting fun. TensorFlow bundles MNIST right inside the library, so loading it is literally one function call. No downloading ZIP files. No hunting for CSV links on Kaggle. One line.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Inspect the shapes
print(f"Training set: {x_train.shape}, Labels: {y_train.shape}")
print(f"Test set: {x_test.shape}, Labels: {y_test.shape}")
print(f"Pixel value range: {x_train.min()} to {x_train.max()}")
print(f"Label examples: {y_train[:10]}")
Running that gives you 60,000 training images and 10,000 test images. Each image comes as a 28×28 NumPy array where every pixel is an integer between 0 (black) and 255 (white). Labels are just the integers 0 through 9 — nothing fancy.
Let’s actually look at some of these digits, because staring at shape tuples only tells you so much:
# Visualize a few samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
ax.imshow(x_train[i], cmap='gray')
ax.set_title(f"Label: {y_train[i]}", fontsize=12)
ax.axis('off')
plt.suptitle("Sample MNIST Images", fontsize=14)
plt.tight_layout()
plt.savefig("mnist_samples.png", dpi=100)
plt.show()
When you look at the output, you’ll notice something immediately: these aren’t clean, computer-generated digits. They’re messy. Some 4s look like 9s. Some 1s are basically vertical slashes. Some 7s have that little cross-stroke, some don’t. That messiness is exactly why this problem is interesting — a simple rule-based system can’t handle it, but a neural network can learn to.
Tip: Always visualize your data before building a model. I cannot stress this enough. I once spent two hours debugging a model that was performing terribly, only to realize the images had been loaded upside-down. Five seconds of matplotlib would’ve caught that instantly.
Preprocessing: Making the Data Neural-Network-Friendly
Raw pixel values range from 0 to 255. Neural networks don’t love that. They work much better when inputs are small — ideally between 0 and 1. Why? Because of how gradients flow during backpropagation. Large input values mean large activations, which mean large gradients, which mean your optimizer is basically stumbling around in the dark with a flashlight taped to a jackhammer. Normalization fixes this.
We divide every pixel by 255.0. Simple.
# Normalize pixel values to [0, 1]
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
# Verify normalization
print(f"After normalization: min={x_train.min()}, max={x_train.max()}")
# Split off a validation set from training data
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
print(f"Training: {x_train.shape[0]} samples")
print(f"Validation: {x_val.shape[0]} samples")
print(f"Test: {x_test.shape[0]} samples")
Notice we’re also carving out a validation set here — 10,000 images pulled from the end of the training data. That leaves us with 50,000 for training, 10,000 for validation, and 10,000 for testing. Three separate pools.
Wait, Why Three Splits?
Good question. And from what I’ve seen, this trips up beginners more than almost anything else.
- Training set — what the model learns from. Weights get updated based on these examples.
- Validation set — what you check during training to see if the model is overfitting. The model never trains on this data, but you do use it to make decisions (like when to stop training).
- Test set — the final exam. You touch this exactly once, at the very end. It gives you an unbiased estimate of how the model will perform on data it’s truly never seen.
If you’ve ever seen someone report 99.9% accuracy and then their model falls apart in production, it’s probably because they accidentally leaked test data into training. Or they tuned hyperparameters against the test set so many times that they effectively memorized it. Keep these boundaries clean.
Building the Neural Network Architecture
Alright. Here’s the part that everyone’s been waiting for. Let’s build the actual model.
We’re using TensorFlow’s Keras API, specifically the Sequential class. Think of it like stacking LEGO bricks — each layer goes on top of the previous one, data flows in at the bottom and out at the top. No branching, no skip connections, no weird graph topology. Just a straight pipeline.
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout
model = Sequential([
# Input layer: flatten 28x28 images to 784-element vectors
Flatten(input_shape=(28, 28)),
# First hidden layer: 128 neurons, ReLU activation
Dense(128, activation='relu'),
# Dropout for regularization (20% of neurons randomly deactivated)
Dropout(0.2),
# Second hidden layer: 64 neurons, ReLU activation
Dense(64, activation='relu'),
# Dropout again
Dropout(0.2),
# Output layer: 10 neurons (one per digit), softmax for probabilities
Dense(10, activation='softmax')
])
# Print the model architecture
model.summary()
When you call model.summary(), you’ll see roughly 109,000 trainable parameters. That might sound like a lot, but it’s actually tiny by modern standards — GPT-4 has something like 1.8 trillion. Our little model is a rounding error by comparison.
Let me break down each layer because these matter.
Flatten
Takes each 28×28 image and unrolls it into a single vector of 784 numbers. Like taking a sheet of graph paper and stretching it into a ribbon. The network needs a flat input — it can’t reason about 2D spatial structure (that’s what convolutional networks are for, and we’ll get there in a future post).
Dense (128 neurons, ReLU)
A “dense” layer means every input connects to every neuron. All 784 pixels wire into all 128 neurons. ReLU activation just means: if the output is negative, make it zero. If it’s positive, keep it. Sounds too simple to work. It does. ReLU basically solved the vanishing gradient problem that plagued early neural networks for decades.
Dropout (0.2)
During training, randomly turns off 20% of neurons each forward pass. Seems counterproductive, right? But it forces the network to develop redundant pathways — no single neuron can become a “crutch.” It’s a regularization technique, and it’s shockingly effective at preventing overfitting. At test time, dropout turns off and all neurons participate.
Dense (64 neurons, ReLU)
Same idea as the first hidden layer, but narrower. We’re compressing the representation. From 784 dimensions to 128, now down to 64. Each layer extracts higher-level features from the previous layer’s output.
Dense (10 neurons, Softmax)
The output layer. Ten neurons for ten possible digits. Softmax squishes the outputs so they all add up to 1.0, turning them into a probability distribution. The model might say: “I’m 94% sure this is a 7, 3% it’s a 1, 2% it’s a 9, and everything else is near zero.” You take the highest probability as the prediction.
Tip: Why 128 and 64 neurons specifically? Honestly, there’s no deep mathematical reason for those exact numbers. They’re common starting points. Powers of 2 are slightly more efficient on GPUs, and these sizes work well empirically for MNIST. You could try 256 and 128, or 64 and 32 — I think you’d see similar results, within a percent or two. Hyperparameter tuning is its own rabbit hole.
How Does This Compare to Alternatives?
| Architecture | MNIST Accuracy | Parameters | Complexity | Training Time |
|---|---|---|---|---|
| Our model (Dense + Dropout) | ~97.5% | ~109K | Beginner | ~30-60s CPU |
| Simple CNN (Conv2D + MaxPool) | ~99.0% | ~93K | Intermediate | ~2-3 min CPU |
| LeNet-5 (classic) | ~99.1% | ~60K | Intermediate | ~2-3 min CPU |
| ResNet (overkill for digits) | ~99.5% | ~11M | Advanced | ~15+ min GPU |
| Logistic Regression (no NN) | ~92% | ~7.8K | Beginner | Seconds |
Our approach sits in a sweet spot. Good accuracy, simple code, fast training. Perfect for learning. Once you’re comfortable with this, upgrading to a CNN for that extra ~1.5% accuracy is a natural next step.
Compiling the Model
Before training can happen, you need to tell TensorFlow three things: how to measure error, how to reduce it, and what metric to display while it works.
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
Three lines. But each one represents a mountain of research.
Adam optimizer. Short for “Adaptive Moment Estimation.” It adapts the learning rate for each parameter individually based on past gradients. Before Adam came along (published in 2014 — seems like forever ago now), people spent hours tuning learning rates manually. Adam mostly handles that for you. Not perfectly — sometimes SGD with a carefully tuned schedule outperforms it — but for 90% of use cases, Adam is where you start.
Sparse categorical crossentropy. Ugly name. Simple concept. It measures how far off the model’s predicted probabilities are from the true labels. “Sparse” means our labels are integers (like 3, 7, 0) rather than one-hot encoded vectors. If your labels were one-hot, you’d use categorical_crossentropy instead. Same math, different input format.
Accuracy metric. Just the percentage of predictions that are correct. During training, TensorFlow will show you this after each epoch so you can watch the model improve in real time. It’s weirdly addictive.
Training: Where the Magic Actually Happens
Ready? Deep breath. This is the part where the model actually learns.
# Train the model
history = model.fit(
x_train, y_train,
epochs=15,
batch_size=32,
validation_data=(x_val, y_val),
verbose=1
)
An epoch means one complete pass through all 50,000 training images. We’re doing 15 of them. Batch size of 32 means the model looks at 32 images at a time, computes the average error across those 32, and updates its weights. Then the next 32. Then the next. That’s roughly 1,563 batches per epoch.
Watch your terminal as it runs. You’ll see something like:
Epoch 1/15 - loss: 0.4512 - accuracy: 0.8671 - val_loss: 0.1423 - val_accuracy: 0.9578
That first epoch already gets you to about 96% validation accuracy. By epoch 5 or 6, you’ll probably be above 97%. The later epochs squeeze out smaller and smaller improvements — diminishing returns. If the validation loss starts increasing while training loss keeps decreasing, that’s overfitting. Your model is memorizing training data instead of learning patterns. Our dropout layers help prevent this, but it’s worth watching for.
Let’s plot the training history so you can see the learning curve:
# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(history.history['loss'], label='Training Loss')
ax1.plot(history.history['val_loss'], label='Validation Loss')
ax1.set_title('Model Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax2.plot(history.history['accuracy'], label='Training Accuracy')
ax2.plot(history.history['val_accuracy'], label='Validation Accuracy')
ax2.set_title('Model Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()
plt.tight_layout()
plt.savefig("training_history.png", dpi=100)
plt.show()
A healthy training curve shows both lines (training and validation) improving together and then leveling off. If there’s a big gap between them, the model is overfitting. If both are flat from the start, the model isn’t learning at all (check your learning rate, or there might be a bug in your data pipeline).
Common Mistake: Training Too Long
Beginners sometimes crank epochs to 100 or 200, thinking more training = better model. Nope. After a certain point, you’re just overfitting. For MNIST with this architecture, 10-15 epochs is plenty. I’ve seen people train for 50 epochs and end up with worse test accuracy than stopping at 12. If you want to be rigorous about it, look into early stopping callbacks — TensorFlow can automatically halt training when validation loss stops improving.
Evaluating the Model: The Final Exam
Training’s done. Now we bring out the test set — the 10,000 images the model has genuinely never encountered.
# Evaluate on the test set
test_loss, test_accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
# Make predictions on individual images
predictions = model.predict(x_test[:5])
for i in range(5):
predicted_label = np.argmax(predictions[i])
actual_label = y_test[i]
confidence = predictions[i][predicted_label] * 100
print(f"Image {i}: Predicted={predicted_label}, "
f"Actual={actual_label}, Confidence={confidence:.1f}%")
# Save the model for later use
model.save("mnist_model.keras")
print("Model saved to mnist_model.keras")
# Load and verify
loaded_model = tf.keras.models.load_model("mnist_model.keras")
loaded_loss, loaded_acc = loaded_model.evaluate(x_test, y_test, verbose=0)
print(f"Loaded model accuracy: {loaded_acc:.4f}")
You should see test accuracy above 97%. Maybe 97.3%, maybe 97.8%. It varies slightly between runs because of random weight initialization and dropout randomness. But consistently above 97%.
Look at those confidence scores. For clear, well-written digits, the model will output 99%+ confidence. Where it gets interesting is the edge cases — a 4 that could be a 9, an 8 that’s basically a blob. On those, you might see confidences in the 60-80% range. The model knows what it doesn’t know (sort of).
And notice we’re saving the model to mnist_model.keras. That file contains everything: architecture, trained weights, optimizer state. You can load it tomorrow, next week, on a different machine, and make predictions without retraining. Ship it to a server, wrap it in a Flask API, deploy it as a microservice. Your model just became portable.
What Actually Happens Inside a Neural Network?
I glossed over the theory deliberately. You needed to see it work first — that builds motivation. But now that you have a trained model, let’s peek under the hood for a minute.
When a 28×28 image enters the network, the Flatten layer stretches it into 784 numbers. Each of the 128 neurons in the first dense layer computes a weighted sum of all 784 inputs, adds a bias term, and passes the result through ReLU. Mathematically: output = max(0, weights * inputs + bias). That’s it. Every neuron does that same operation.
But here’s what’s wild. During training, backpropagation adjusts those weights — all 109,000 of them — so that the network’s output matches the correct labels. Nobody programs the network to look for curves in a “6” or straight lines in a “1.” It figures that out on its own. The first layer might learn to detect edges. The second layer combines edges into strokes and loops. By the output layer, it can distinguish digits.
At least, that’s the intuition. Whether that’s exactly what happens is… hard to say. Neural network interpretability is still an active area of research. We know they work. We have theories about why they work. But the full picture remains fuzzy, and anyone who tells you otherwise is probably oversimplifying.
Things That Can Go Wrong (and How to Fix Them)
Your first run might not be clean. Here are the problems I’ve hit and the ones I see most often when people post on Stack Overflow.
- Import errors with Keras: If you see
ModuleNotFoundError: No module named 'keras', you’re probably importing the wrong way. Usefrom tensorflow.keras import ..., notfrom keras import .... The standalone keras package and tf.keras have diverged and converged multiple times over the years. Stick with the tf.keras path. - Model accuracy stuck around 10%: Random chance for 10 classes is 10%, so if your model isn’t learning, something’s fundamentally wrong. Usually it’s a normalization issue (forgetting to divide by 255) or a shape mismatch. Check your data shapes.
- Out of memory: Unlikely with MNIST, but if you’re on an old laptop with 4 GB RAM, reduce the batch size to 16 or even 8.
- Validation accuracy much lower than training accuracy: Overfitting. Increase dropout, reduce model size, or add data augmentation.
- Model trains but
model.save()fails: Make sure you’re using the.kerasextension, not.h5. The old H5 format still works but the new native format is recommended these days.
Where to Go After This: Your Neural Network Roadmap
You’ve got a working model. Congrats — seriously. Most people who say they want to learn deep learning never actually train a model. You did.
So what’s next?
Convolutional neural networks. CNNs. Instead of flattening the image and throwing away all spatial information, CNNs slide small filters across the image and detect patterns at different locations. That’s how you jump from 97.5% to 99%+ on MNIST. And it’s how nearly all modern image recognition works — from the face detection on your phone to Tesla’s self-driving cameras.
Beyond that, the deep learning world is moving fast right now. Transformers, originally built for natural language processing, have started showing up in computer vision (look up Vision Transformers, or ViT). Diffusion models are generating images that are basically indistinguishable from photographs. And these days, large language models can write code, answer questions, and (I think somewhat controversially) pass medical licensing exams.
But all of it — every GPT, every Stable Diffusion, every AlphaFold — builds on the same foundation you just learned: layers, activations, loss functions, backpropagation, optimization. Those concepts don’t go away. They just stack higher.
TensorFlow isn’t the only game in town, either. PyTorch has taken over a huge chunk of the research community, and if you want to read the latest papers, you’ll probably need to know it. JAX is gaining traction for high-performance numerical computing. But for production deployment, TensorFlow still has a strong ecosystem — TFLite for mobile, TF Serving for APIs, TensorFlow.js for the browser.
My suggestion? Pick one project that excites you. Maybe train a CNN on the Fashion-MNIST dataset (same format, but classifying clothing items instead of digits). Maybe build a sentiment classifier on movie reviews using an LSTM. Maybe try your hand at a GAN that generates fake digits. Whatever sounds fun. Because that excitement is what keeps you coding at 11 PM, staring at a terminal, watching accuracy tick upward one epoch at a time.
That feeling doesn’t get old. From what I’ve seen, it only gets better.