A Transformer-based AI Large Language Model has one set of frozen weights for its neural network. The following are systems that switch weights:
1. Mixture of Experts (MoE)
A Transformer-based AI Large Language Model has one set of frozen weights for its neural network. The following are systems that switch weights:
A Transformer-based AI Large Language Model has one set of frozen weights for its neural network. The following are systems that switch weights:
1. Mixture of Experts (MoE)
The Concept: Instead of one giant neural network where every neuron is used for every word, the model is broken into many smaller "expert" sub-networks.How it works: A "gating network" looks at the input (e.g., the word "python") and decides which experts to activate. It might route that word to a "coding expert" set of weights and a "logic expert" set of weights, while ignoring the "creative writing" weights.Why it's used: It allows models to have trillions of parameters (weights) but only use a small fraction of them for any single token. This makes them smarter but much faster and cheaper to run.
2. LoRA and Adapters (Task-Specific Weight Swapping)
2. LoRA and Adapters (Task-Specific Weight Swapping)
The Concept: Imagine you have a frozen base model. You can attach small, separate "adapter" modules—tiny sets of weights—that are trained for specific purposes.How it works: LoRA (Low-Rank Adaptation): You freeze the massive main network. If you want the model to write like Shakespeare, you load a tiny "Shakespeare" file (maybe 100MB) that sits on top of the main model.Hot-Swapping: You can literally swap these adapters in and out instantly. In a single system, one user could be using the "Medical Diagnosis" weights while another user is using the "Fantasy RPG" weights, both sharing the same frozen base brain.
3. Hypernetworks or HyperNets (The "Network that Writes Networks")
3. Hypernetworks or HyperNets (The "Network that Writes Networks")
The Concept: You have two neural networks.[1 ][2][3] Network A (the Hypernetwork) takes an input andoutputs the weights for Network B.How it works: Network B doesn't actually exist until Network A creates it. If you show Network A a picture of a cat, it might generate a set of weights for Network B that are perfectly tuned to detect cats. If you show it a dog, it rewrites Network B to detect dogs.Current State: This is computationally expensive and tricky to train, so it's not yet standard in large LLMs, but it is used in image generation and smaller experimental models.
4. Fast Weights (Short-Term Memory)
4. Fast Weights (Short-Term Memory)
The Concept: Standard weights represent "long-term memory" (what the model learned during training). "Fast weights" are temporary weights that change rapidly during a conversation to store "short-term memory."The connection to Transformers: Modern Transformers (the architecture behind LLMs) actually use a mechanism calledAttention that behaves mathematically very similarly to fast weights. When the model looks at a sentence, it dynamically calculates "attention scores" (temporary weights) that determine how much one word relates to another. In a sense, the model is re-wiring itself for every single sentence it reads.
Summarized:
Summarized:
We are moving away from "monolithic" frozen digital brains toward modular, dynamic systems .
MoE switches weights per token.
Adapters switch weights per task.
Hypernetworks generate weights per input.
This is why we are in the most popular term's era: "The MoE Era"
MoE switches weights per token.Adapters switch weights per task.Hypernetworks generate weights per input.
This is why we are in the most popular term's era: "The MoE Era"
Generated by Gemini 3 Pro

No comments:
Post a Comment