llmfit Guide: Find the Best Local LLMs for Your Hardware

If you’re running local LLMs and you are frustrated by guessing whether a model will fit in your VRAM, this is the guide you’ve been looking for. I’ve tested countless models myself——and hitting out-of-memory errors after a 10GB download is the worst. llmfit solves this perfectly by detecting your exact hardware and scoring hundreds of models to tell you exactly what will run. It calculates the necessary parameters, memory offloading overheads, and dynamic quantization metrics so you don’t have to rely on trial and error.

🔍 What is llmfit?

At its core, llmfit is a powerful terminal-based tool that eliminates the guesswork of running local AI models. Instead of manually cross-referencing parameter counts, quantization levels, and your available system memory, llmfit does the heavy lifting math for you. It was designed from the ground up for developers and AI enthusiasts who need concrete answers on performance before committing time and bandwidth.

It detects your system’s RAM, CPU cores, and GPU VRAM (across NVIDIA, AMD, Apple Silicon, and Intel Arc architectures). Once your hardware is mapped, llmfit queries an embedded database of hundreds of popular open-weights models and compares your specs against them. It scores each model across four critical dimensions: Quality, Speed, Fit, and Context. The result is a highly tailored list of models that you can actually run. Beyond just reporting, you can use its beautiful interactive Terminal UI (TUI) to browse, filter, and even trigger downloads directly to local runtimes like Ollama or llama.cpp.

⚡ Why Use llmfit?

Before you spend hours downloading massive multi-billion parameter models that might crash your system or grind to a halt, here’s why llmfit is an indispensable utility:

✅ Automatic Hardware Detection: Instantly maps out your GPU VRAM, system RAM, and CPU without any manual configuration required. It natively supports complex setups like multi-GPU rigs by aggregating the VRAM properly.
✅ Dynamic Quantization Math: Figures out the highest quality quantization (ranging from the pristine Q8_0 down to the heavily compressed Q2_K) that will actually fit into your available memory pool.
✅ MoE Architecture Aware: Accurately calculates memory for Mixture-of-Experts models (like Mixtral 8x7B or DeepSeek-V3). It understands that only a specific subset of active experts needs VRAM per token, dramatically altering what you thought you could run.
✅ Direct Provider Integration: Pulls models straight into your existing Ollama, MLX, or llama.cpp environments directly from the TUI, saving you from copying and pasting tags in the terminal.
✅ Hardware Planning & Future-Proofing: Inverts the analysis process—tell llmfit the exact model you want to run (e.g., Llama-3-70B at 8k context), and it will output the exact hardware specifications you need to buy or rent.

✅ Step 1 – Install llmfit

Installation is straightforward across all major operating systems. The tool is written in Rust, which means it compiles down to a single, lightning-fast binary with minimal external dependencies. Choose the method that best matches your setup below.

	Minimum	Recommended
GPU VRAM	8 GB	16 GB+
RAM	16 GB	32 GB
Storage	1 GB	5 GB (for caching)
OS	Windows 10	Windows 11 / macOS / Linux

🍎 macOS / Linux (Homebrew)

If you use Homebrew, installation is just a single command. This is the recommended method for most macOS users as it handles path updates automatically and makes upgrading simple:

brew install llmfit

Homebrew will download the pre-compiled binary and place it in your path, ready to use immediately.

⚡ Quick Install Script (macOS / Linux)

Alternatively, you can pull the latest release binary directly from GitHub. This is perfect for headless Linux servers, CI/CD pipelines, or users who prefer not to use Homebrew:

curl -fsSL https://llmfit.axjns.dev/install.sh | sh

If you are on a restricted machine where you don’t have sudo access, you can safely install it to your local user directory:

curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local

🪟 Windows (Scoop)

Windows users can install via Scoop. Make sure you have Scoop installed and configured in your PowerShell first:

scoop install llmfit

✅ Step 2 – Launch and Navigate the TUI

Once installed, simply run the command in your terminal to get started. No complex configuration files are needed—llmfit probes your system on the fly:

llmfit

This command launches the interactive Terminal UI. At the top of the screen, you’ll see a clean readout of your detected hardware specs, including your CPU, total RAM, GPU name, and available VRAM. Below that header is a scrollable, data-dense table of models ranked by their composite score. This score blends Quality, Speed, Fit, and Context into a single, easy-to-understand metric.

🕹️ Key Controls to Know:

Arrow Keys or j / k: Navigate up and down the model list efficiently.
/: Enter search mode. Type to instantly filter by model name, provider, or use case (e.g., “llama 8b” or “coding”). This is the fastest way to find a specific model.
f: Cycle through fit filters. This allows you to view All models, only Runnable models, Perfect fits (fits entirely in VRAM), Good fits (some CPU offloading), or Marginal fits (barely runs on CPU).
s: Cycle the sort column. You can organize the view by Score, Parameter count, Memory percentage, Context length, Date, or Use Case.
t: Cycle through built-in color themes. Options include Dracula, Solarized, Nord, and Monokai. Your choice saves automatically for the next launch.
Enter: Toggle a detailed, expanded view of the currently selected model, revealing a deep dive into its quantization and memory breakdown.

✅ Step 3 – Download Models Directly (Ollama Integration)

One of the best, most frictionless features of llmfit is its seamless integration with runtime providers like Ollama. If you have Ollama running in the background (either via the desktop app or ollama serve), llmfit automatically detects it upon launch.

Find a model: Use the search (/) to locate a model you want to run. For example, search for Qwen/Qwen2.5-Coder-7B-Instruct.
Check the fit: Ensure the “Fit” column says “Perfect” or “Good” for your hardware configuration. You don’t want to download something that will be “Too Tight”.
Download: Press d while the model is highlighted. If multiple providers are available on your machine (like llama.cpp and Ollama), a picker menu will appear. Select Ollama.
Watch progress: The row in the TUI will highlight with a real-time progress indicator, showing you the exact download speed and completion percentage.

Once finished, you’ll see a green checkmark (✓) next to the model in the “Inst” column, meaning it is locally cached and ready to be spun up via your Ollama client.

✅ Step 4 – Plan Hardware Upgrades (Plan Mode)

What if you want to run a massive 70B parameter model, but you know your current rig can’t handle it? Instead of buying hardware blindly, llmfit has a built-in “Plan Mode” to tell you exactly what you need.

Select a heavy model in the TUI (e.g., Llama-3.1-70B) that you aspire to run.
Press p to open Plan Mode.
You can use your keyboard to edit fields like the desired Context length, specific Quantization (like Q4_K_M), and your Target TPS (tokens per second).

Plan Mode will dynamically output the estimated minimum and recommended VRAM, system RAM, and CPU cores required to hit your targets. It even breaks down the feasibility of running it on GPU only, CPU offload, or CPU only, giving you a precise shopping list or rental spec for your next cloud GPU instance.

🧠 How llmfit Scores and Evaluates Hardware

Understanding how llmfit generates its recommendations can help you make better decisions when navigating the CLI or TUI. The tool doesn’t just look at VRAM; it performs a complex multi-dimensional analysis on the fly.

Each model is scored across four specific dimensions (ranging from 0 to 100):

Quality: Evaluates the parameter count, the reputation of the model family, the applied quantization penalty, and task alignment.
Speed: Estimates tokens/sec based on your specific backend (CUDA, Metal, ROCm, etc.), parameter count, and quantization level.
Fit: Measures memory utilization efficiency. The “sweet spot” is typically 50–80% of your available memory, ensuring you have enough overhead for context windows and OS tasks.
Context: Compares the model’s context window capability against the target requirements for the specific use case.

These dimensions are then combined into a weighted composite score. The weights vary depending on the use-case category. For instance, Chat models weight “Speed” much higher, while Reasoning models (like DeepSeek-R1) weight “Quality” higher.

When evaluating speed, llmfit knows that token generation is heavily memory-bandwidth-bound. It references a baked-in lookup table of over 80 GPUs. If it recognizes your card, it uses the actual memory bandwidth to estimate throughput: (bandwidth_GB_s / model_size_GB) × efficiency_factor. The efficiency factor accounts for kernel overhead and KV-cache reads.

🛠️ Troubleshooting Common Issues

Even with robust automatic detection, hardware quirks and driver problems can occasionally cause friction. Here are the most common problems users encounter and exactly how to fix them.

Error	Cause	Fix
GPU not detected or VRAM is wrong	Broken `nvidia-smi` installation, VM passthrough issues, or unsupported proprietary drivers masking hardware details.	Use the memory override flag to manually declare VRAM: `llmfit --memory=24G`
Cannot connect to Ollama	Ollama is running on a different machine on your local network, running in a Docker container, or bound to a custom port.	Set the host environment variable before launching: `OLLAMA_HOST="http://ip:port" llmfit`
Out of Memory during generation	The context length was set significantly higher during inference than estimated, pushing memory over the edge.	Run llmfit with a strict context cap to get accurate estimates: `llmfit --max-context 4096`
Model download fails	Network timeout, provider API changed, or you are hitting rate limits on HuggingFace.	Verify network connection, or try updating llmfit to the latest version to ensure API endpoints are current.

💡 Tips & Best Practices

💡 Tip: If you only care about models that will run flawlessly on your GPU without offloading to slower system RAM, start your session with llmfit fit --perfect -n 10 to get a quick, parseable CLI table of the top 10 perfect fits.

💡 Tip: Running a remote headless server? You can start llmfit as a REST API aggregator using llmfit serve --host 0.0.0.0 --port 8787. This is incredibly useful for Kubernetes cluster schedulers that need to ping a node to see what models it can safely run.

💡 Tip: MoE (Mixture of Experts) models look absolutely massive by parameter count, but require significantly less VRAM during inference. Don’t be afraid to click on models like Mixtral 8x7B; llmfit accurately calculates expert offloading to see if it fits, often surprising you with what your hardware can actually handle.

💡 Tip: If you use the OpenClaw AI assistant framework, you can install the llmfit-advisor skill. Your agent can then automatically configure your local model settings in your openclaw.json file based on llmfit’s hardware recommendations.

💡 Tip: If you prefer pure CLI output for bash scripting or automation pipelines, append --json to commands. For example, running llmfit recommend --use-case coding --json returns perfectly formatted, machine-readable JSON.

✅ Final Thoughts

Figuring out what local models can run on your hardware used to involve massive spreadsheets, deep Reddit searches, and a lot of frustrating trial and error. llmfit replaces all of that friction with a fast, mathematically accurate, and beautifully designed terminal tool. Whether you are rocking a single GPU gaming rig or planning a multi-GPU workstation build for production inference, this tool is an absolute must-have for your local AI stack. The best local LLM setup is the one you actually use—and now you know exactly which models fit flawlessly into your hardware constraints. Happy generating!

❓ FAQ

❓ Q: Does llmfit support multi-GPU setups?

A: Yes. It seamlessly aggregates VRAM across all detected GPUs using tools like nvidia-smi (for NVIDIA cards) or rocm-smi (for AMD). It then scores the models based on your total pooled VRAM, making it excellent for multi-card workstations.

❓ Q: Why does a massive 46B parameter model fit in just 7GB of VRAM?

A: If it’s a Mixture-of-Experts (MoE) model like Mixtral or DeepSeek, only a small subset of parameters (the “experts”) are active per token generated. llmfit inherently understands MoE architectures and calculates the effective VRAM requirement, which is drastically lower than the total footprint.

❓ Q: How accurate are the Tokens Per Second (TPS) speed estimates?

A: Highly accurate. llmfit uses a deeply researched bandwidth lookup table covering over 80 specific GPUs. It factors in kernel overhead and memory controller effects, and the internal math has been validated against real-world llama.cpp benchmarks on both Apple Silicon and NVIDIA hardware.

❓ Q: Can I use it to find embedding or vision models?

A: Absolutely. The built-in database spans general, coding, reasoning, chat, multimodal (vision like Llama 3.2 Vision), and embedding categories. You can filter by these specific use cases directly in the TUI or via CLI flags.