Beginner's Guide.
Welcome to the absolute master guide for understanding every single setting, slider, feature tag, and color-coded badge inside our AI Hardware Calculator.
Artificial Intelligence operates on massive numerical thresholds. To make our platform accessible, we compress complex parameters into beautiful badges. In this documentation, we peel back the curtain. We will show you exactly what every absolute badge looks like on the site, what it means, and exactly what mathematical numbers trigger it.
Hardware Setup
Video RAM (VRAM) vs System RAM
System RAM is the slow filing cabinet inside your computer (e.g., 32GB DDR5). VRAM is the ultra-fast, dedicated super-memory physically bolted onto your graphics card (e.g., 24GB on an RTX 4090). To run an AI model fast, its entire "brain" must fit squarely inside your VRAM. If it overflows into your System RAM, the model slows down to a crawl (known as CPU Offloading).
Multi-GPU Arrays
If you own multiple graphics cards (e.g., 2× RTX 3090s), you effectively double your VRAM (24GB + 24GB = 48GB). However, slicing an AI's brain across two separate cards introduces a PCIe Bottleneck penalty. Because the two cards must constantly whisper over the motherboard, the overall Speed (Tokens/Second) is penalized by about 15% to 30%.
Apple Silicon (Unified Memory)
Apple computers (M1, M2, M3, M4 Max/Ultra) do not use separate VRAM. Instead, they use Unified Memory, meaning their colossal 64GB or 128GB of System RAM acts entirely like super-fast VRAM. This is why Macs are the undisputed kings of running massive AI models on a laptop. If you select a Mac on our site, the "System RAM" slider is completely disabled to reflect this architectural superpower.
Models & Features
AI models are labeled with exact visual tags representing their architecture. Below are the exact graphical badges you will see attached to models on the dashboard.
The size of the brain (8 Billion connections). 8B is excellent for laptops. 70B+ requires heavy servers.
These models are explicitly trained to output a "chain of thought", thinking step-by-step.
Instead of firing all neurons at once, an MoE model routes your question to a specific tiny "expert" sub-brain.
The capability to hook securely into external APIs to browse the web or trigger functions.
These models achieve elite scores on SWE-Bench and can autonomously write and debug software architecture.
Highly specialized datasets allowing the model to flawlessly solve complex algebra, calculus, and logical proofs.
Multimodal architecture that allows the AI to natively "see" and interpret uploaded images, charts, and video frames.
Quantization & Compression
Quantization is the act of surgically compressing the neural parameters so they physically fit inside your GPU.
- FP16 Uncompressed Perfection. Massive file size. Highest possible quality. Used almost exclusively in enterprise environments where massive data center GPUs are available.
- Q8_0 Virtually Lossless. Cuts the file size exactly in half. Human reviewers cannot perceive a difference in reasoning quality compared to FP16. High recommendation.
- Q6_K The Sweet Spot. Shaves it down further. Arguably the best overall balance of VRAM usage, inference speed, and retained intelligence.
- Q4_K Aggressive Squeeze. Shrinks the model by a massive 75%. Perfect for running huge 70B models on heavily limited 24GB hardware. Math capability degrades slightly.
- Q3_K Extreme Compression. Files are now ~25% of their original size. Substantial logic degradation begins to occur. Recommended only as a last resort for local hardware.
- Q2_K Minimum Viable. The model is barely functional for complex reasoning but fits on almost any device. Use only for simple chat tasks where VRAM is critically scarce.
Performance & Speed
AI performance isn't just about fitting; it's about memory growth and typing speed.
The KV Cache (Context)
The "Context Window" is the model's short-term memory. As your conversation grows, the KV Cache expands, consuming more VRAM every few thousand words.
Tokens per Second (TPS)
Writing speed is limited by your Memory Bandwidth (GB/s). The wider the lane, the faster the model can read its 30GB+ brain.
An RTX 4090 hit ~33 tok/s on a 30GB Llama-3 model.
Quality Scores & Tiers
Our engine parses industry tests (MMLU, SWE-Bench, Chatbot Arena ELO) into five strict graphical badges. Here is the *exact mathematical thresholds* for each tier natively inside the calculator:
Excellent
Great
Good
Fair
Basic
Models that fall below the Fair threshold. Usually obsolete.
Result Categories & Badges
The algorithm buckets model cards into three distinct visual containers. These are the exact headers and error texts you will see dynamically rendered based on the GPU math:
The model's uncompressed weights plus the expected Context Window buffer fit strictly inside your physical GPU VRAM.
Under the Hood (The Math)
Trigger: Weights_GB + (KV_Cache_GB @ 32K) + 1GB_Overhead < Total_VRAM
When this condition is met, inference happens at Maximum Hardware Bandwidth. No slow system memory is used.
Your VRAM is full, forcing the calculator to spill layers into your System RAM.
Under the Hood (The Math)
Trigger: VRAM_Saturation > 95% AND Weight_Offload < 50%
Tokens will generate at approximately 15-40% of native GPU speed depending on your PCIe lane bandwidth.
The model architecture exceeds your total physical memory capacity (VRAM + System RAM). Local execution is mathematically impossible.
Under the Hood (The Math)
Trigger: Weights_GB + context > (VRAM + RAM)
When this red outline appears, the calculator automatically triggers the Cloud Rental Scanner.
Cheapest Fit Algorithm
The calculator scans an API of data-center GPUs to find the absolute lowest hourly price for a GPU that has enough VRAM (A6000, A100, or H100) to host the model at full 32K context.
Market Rate Estimate
The blue pill displays the live average market rate on platforms like RunPod or Lambda. This helps you decide if a 44¢ rental is better than a $2,000 upgrade.