🧠 AI · Machine Learning
📅 Березень 2026⏱ 12 min🟢 Beginner-friendly

How AI Thinks: Neural Networks, Training & Inference

ChatGPT, Midjourney, and AlphaFold all work on the same basic principle: feed data into a network of billions of numbers (parameters), adjust those numbers until the network produces useful outputs, then use the trained network to generate new text, images, or predictions. Here's what actually happens inside.

1. The Artificial Neuron

An artificial neuron takes several numbers as input, multiplies each by a weight (how important is this input?), adds them up, adds a bias term, and passes the result through an activation function (to add non-linearity).

output = activation( w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + bias ) Common activation functions: ReLU: max(0, x) (most popular — simple, fast) Sigmoid: 1/(1 + e⁻ˣ) (squeezes output to 0–1) Tanh: (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ) (output −1 to 1) A single neuron is just a weighted sum + nonlinearity. The magic comes from connecting millions of them together.

The weights and biases are the network's parameters — the numbers it learns during training. GPT-4 has an estimated 1.8 trillion parameters. Each parameter is just a floating-point number, typically stored in 16 bits (2 bytes).

2. Layers & Depth

Neurons are arranged in layers. A "deep" neural network has many layers stacked on top of each other:

Why depth matters: A shallow network (1 hidden layer) can theoretically approximate any function, but it might need an astronomically wide layer. Deep networks are more efficient — they compose simple transformations into complex ones, reusing intermediate features. This is why "deep learning" works: depth enables compositionality.

3. Training: Learning from Data

Training is the process of finding good values for all the parameters (weights and biases). It works by:

  1. Forward pass: Feed a training example through the network. Get an output.
  2. Loss calculation: Compare the output to the correct answer. Compute a loss number (how wrong was it?). For text: how surprised was the model by the correct next word?
  3. Backpropagation: Calculate how much each parameter contributed to the error. This uses the chain rule of calculus to compute gradients — the direction each parameter should move to reduce the loss.
  4. Update: Adjust each parameter slightly in the direction that reduces the loss (gradient descent). Learning rate controls how big each step is.
  5. Repeat: Millions of times across the entire dataset. One pass through the dataset = one epoch. GPT-3 saw ~300 billion tokens during training.
Gradient descent update rule: w_new = w_old − learning_rate × ∂Loss/∂w Training GPT-3: Parameters: 175 billion Training data: 300 billion tokens (~570 GB of text) Compute: ~3,640 petaflop-days Cost: ~$4.6 million (estimated) Hardware: thousands of NVIDIA A100 GPUs

4. Transformers & Attention

The Transformer architecture (Vaswani et al., 2017, "Attention Is All You Need") is the foundation of nearly all modern AI: GPT, BERT, Stable Diffusion, AlphaFold 2.

The key innovation is self-attention: each element in the input (each word, each image patch) computes how much it should "attend to" every other element. This allows the model to capture long-range dependencies — a word at position 500 can directly reference a word at position 1.

5. Large Language Models

An LLM (like GPT-4, Claude, Gemini, Llama) is a Transformer trained to predict the next word (token). That's its entire training objective. Everything else — answering questions, writing code, translating languages — emerges from this simple task performed at enormous scale.

Does the AI "understand"? This is debated. LLMs demonstrably learn syntax, semantics, factual knowledge, reasoning patterns, and even some theory of mind — all from predicting text. Whether this constitutes "understanding" in a philosophical sense is an open question. What's clear: they capture statistical patterns in language at a depth far beyond any previous technology.

6. Image Generation (Diffusion)

Diffusion models (Stable Diffusion, DALL-E 3, Midjourney) work by learning to reverse noise:

  1. Forward process: Take a real image and gradually add Gaussian noise over many steps until it becomes pure static.
  2. Training: Train a neural network to predict and remove the noise at each step. Given a noisy image and the noise level, output the denoised version.
  3. Generation: Start with pure random noise. Apply the denoising network step by step. Each step removes a little noise, gradually revealing a coherent image. The text prompt guides the denoising direction (via cross-attention with a text encoder like CLIP).

The model never "copies" training images — it learns the statistical distribution of images and generates new samples from that distribution. Each generated image is novel, composed from learned patterns (textures, shapes, compositions).

7. Limitations & Misunderstandings