Logo
FP8 vs BF16 in ComfyUI: Precision, Performance & Installation

FP8 vs BF16 in ComfyUI: Precision, Performance & Installation

November 6, 2025
10 min read

If you push ComfyUI hard — bigger batches, higher resolutions, more complex workflows — precision stops being an academic detail and starts feeling like a superpower. The right format lets you fit larger models in VRAM, run faster, and still keep image quality where you need it. In this guide, we’ll compare FP8 vs BF16 specifically for ComfyUI users, then walk through how to enable and install these formats in your workflows.

This won’t be a light explainer. I’ll keep it human, but we’re going deep enough that you can make a confident, informed choice for your exact GPU and workflow.


What you’ll learn

  • The “why” behind lower precision in diffusion workflows
  • FP32, BF16, and FP8 — what they are and how they differ
  • Why FP8/BF16 matter in ComfyUI (VRAM, speed, batch size, stability)
  • Step-by-step: using BF16 and FP8 models in ComfyUI
  • Mini workflow example you can replicate
  • Performance expectations and when to pick which
  • Common issues, symptoms, and quick fixes
  • References and further reading

Background: Floating-point formats (quick but solid)

FP32 (baseline)

Most deep learning frameworks historically defaulted to FP32 (32-bit floating point): 1 sign bit, 8-bit exponent, 23-bit mantissa. It’s accurate, stable, and expensive — in memory, bandwidth, and compute. For inference, FP32 is overkill on modern GPUs, which is why lower-precision formats have become standard.

  • Memory use: 4 bytes/value
  • Great numerical stability
  • Slower and larger versus FP16/BF16/FP8

Why lower precision at all?

  • Less memory per tensor → larger models, higher resolution or batch size
  • More values per cache line → better bandwidth utilization
  • Specialized tensor cores → much higher throughput on modern GPUs

The trade-off: as precision drops, you risk rounding error, underflow/overflow, and occasional instability unless the format preserves range or you use scaling/mixed-precision tricks.

BF16 (bfloat16)

BF16 keeps the FP32 exponent width but shrinks the mantissa. That’s the magic.

  • Format: 1 sign bit, 8-bit exponent, 7-bit mantissa
  • Key property: FP32-like dynamic range with lower precision mantissa
  • Practical effect: fewer NaNs/overflows than FP16, more stable training/inference while saving memory versus FP32

Typical use: BF16 is the “safe” lower-precision workhorse across training and inference on NVIDIA/AMD/TPU hardware. It often just works where FP16 might be finicky.

Citations:

  • NVIDIA Developer: Floating-point formats and mixed-precision best practices (BF16 range advantages)
  • Wikipedia: bfloat16 format definition and bit layout

FP8 (E4M3, E5M2)

FP8 isn’t one format — it’s a family. The two that matter:

  • E4M3: 1 sign, 4-bit exponent, 3-bit mantissa (+ optional finite-nan variants like e4m3fn)
  • E5M2: 1 sign, 5-bit exponent, 2-bit mantissa

Why it’s emerging now:

  • Newer GPUs (NVIDIA Hopper/Blackwell, some accelerators) provide native FP8 tensor core throughput
  • Massive memory savings: 1 byte/value
  • Huge throughput wins on supported hardware

Trade-offs:

  • Much smaller mantissas → less precision; can impact fine detail or stability if not scaled well
  • May require per-tensor scaling, calibration, or mixed-precision paths

Citations:

  • NVIDIA Developer Blog: “Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training”
  • rohan-paul.com: overviews of FP8/low-precision trends and practical effects

BF16 vs FP8 at a glance

  • Dynamic range: BF16 ≈ FP32 range; FP8 range depends on E4M3 vs E5M2 (E5M2 has more range, less precision)
  • Mantissa precision: BF16 (7 bits) > FP8 (2–3 bits)
  • Memory: BF16 = 2 bytes, FP8 = 1 byte
  • Throughput: on FP8-native tensor cores, FP8 can be ~1.5–2× vs BF16 in best cases (hardware-dependent)
  • Stability: BF16 generally safer; FP8 needs good scaling and may degrade fidelity in edge cases

Reported numbers (illustrative; hardware/stack dependent): FP8 training throughput improved from ~BF16 415 TFLOPS to ~570 TFLOPS in some NVIDIA accelerator benchmarks. See NVIDIA Developer posts and arXiv studies discussing FP8 vs BF16 training trade-offs for context.

Citations:

  • NVIDIA Developer Blog (FP8 intros and NeMo throughput posts)
  • arXiv: “Trade-offs of FP8 vs. BF16 Training in LLMs”

Why this matters in ComfyUI

Diffusion models (SD 1.5/2.1, SDXL, FLUX variants, etc.) are memory-hungry and bandwidth-bound. Lower precision helps you:

  • Fit bigger models in VRAM (especially SDXL/FLUX)
  • Push higher resolutions (4K upscales, large latent sizes)
  • Increase batch size for iteration speed
  • Reduce node runtimes and queue wait

But there are caveats:

  • Fidelity: community testing often ranks quality roughly as fp16 > bf16 > fp8_scaled > fp8_e4m3fn for certain workflows. Your mileage will vary by model and sampler.
  • Node compatibility: Some custom nodes assume specific dtypes; mixed precision across nodes can cause errors.
  • Model availability: BF16 weights are more common than true FP8 weights. FP8 sometimes requires on-the-fly quantization nodes or special checkpoints.

In practical ComfyUI workflows (think SDXL base + refiner, 4K upscales, ControlNet/Guidance, or image editing like Qwen-Image-Edit pipelines), precision choice can decide whether your render runs in 12 GB VRAM or crashes.

References/Community notes:

  • comfyanonymous.github.io docs and wiki threads discussing dtype trade-offs
  • Stable Diffusion tutorial blogs/videos demonstrating FP8/BF16 model variants in ComfyUI

Installation & setup in ComfyUI (step-by-step)

Here’s a pragmatic path that balances stability with speed.

1) Update ComfyUI

Keeping ComfyUI and its nodes up-to-date avoids 80% of dtype headaches.

  • Portable build: replace the ComfyUI folder with the latest release or pull the latest changes
  • Git install: pull latest on main branch and update submodules/extensions

Optional but recommended: update your Python env, PyTorch, and CUDA toolkits to match the matrix recommended by ComfyUI maintainers for your GPU.

If you plan to use true FP8 or on-the-fly conversion, install a node pack like:

Follow each extension’s README. Typically this means cloning into ComfyUI/custom_nodes/ and restarting ComfyUI so nodes are discovered.

3) Place model files in the right folder

Depending on your build, models may be expected in one of these:

  • ComfyUI/models/checkpoints/ (common default for SD/SDXL/FLUX)
  • or ComfyUI/models/diffusion_models/ (some guides use this path)

If you have multiple precision variants, keep file names explicit, for example:

  • sdxl_base_bf16.safetensors
  • sdxl_base_fp8_e4m3fn.safetensors
  • flux1-dev-fp8.safetensors

In ComfyUI, the “Load Checkpoint” (or model loader) node will show each file. Choose the precision variant you intend to benchmark.

4) Hardware prerequisites

  • FP8 acceleration requires GPUs with FP8-capable tensor cores (e.g., NVIDIA Hopper/Blackwell; check vendor docs). Without native FP8, you may still use FP8-quantized weights, but you won’t see full speedups.
  • BF16 support is widely available on modern NVIDIA GPUs (Ampere+), many AMD GPUs, and TPUs.
  • VRAM headroom: FP8 halves memory versus BF16 for tensors stored in that format, but graphs may still use mixed dtypes.

5) Optional: On-the-fly conversion (if you only have FP16/BF16 weights)

Some quantizer nodes can convert from FP16/BF16 to FP8 at load-time or per-tensor. Expect results to depend on model and scaling strategy (per-channel/per-tensor scaling, calibration). Start with conservative settings (scaled FP8, E5M2 for range) and test.

6) Mini workflow example (replicable)

Here’s a small SDXL example you can recreate in the graph editor:

  1. Load Checkpoint → select sdxl_base_bf16.safetensors (or your FP8 variant)
  2. CLIP Text Encode (positive) → prompt: “portrait, soft studio lighting, 85mm, ultra-detailed skin”
  3. CLIP Text Encode (negative) → prompt: “overexposed, extra fingers, low-res, watermark”
  4. Sampler → Euler a / DPM++ 2M Karras (steps 20–30 for quick tests)
  5. Latent to Image → set resolution 1024×1024
  6. Save Image

Swap step 1 between BF16 and FP8 variants and compare:

  • Elapsed time per render (ComfyUI console)
  • VRAM usage (monitor with vendor tools)
  • Visual fidelity on hair, skin texture, fine patterns

For FLUX variants, link text encoders as required by your model (see our FLUX guide: /blog/flux-comfyui-guide).


Performance comparison & recommendations

What you can expect in practice (ballpark, hardware/graph dependent):

  • Speed/Throughput: FP8 can be ~1.5–2× faster than BF16 on GPUs with native FP8 tensor cores, especially on larger batch sizes or high-resolution generations. Sources like NVIDIA Developer posts and practitioner write-ups (e.g., rohan-paul.com) report substantial gains in training/inference throughput.
  • Memory/VRAM: FP8 cuts tensor memory in half relative to BF16. Depending on node mix and mixed-precision paths, real VRAM savings may be a bit less than 2× but are still meaningful.
  • Fidelity/Stability: Community testing often shows a quality gradient: fp16 (or full precision) ≥ bf16 ≥ fp8_scaled ≥ fp8_e4m3fn. On some prompts/models you won’t notice; on texture-rich or high-detail scenes you may.

When to pick which:

  • Choose BF16 if you want:

    • Maximum stability with strong range (fewer corner-case NaNs/overflows)
    • Broad compatibility across custom nodes
    • High-fidelity outputs (portraits, fabric, micro-texture) where subtle detail matters
    • Older or mid-range GPU without FP8 acceleration
  • Choose FP8 if you want:

    • Maximum speed/VRAM savings on FP8-capable GPUs
    • Bigger batches or higher base resolution without OOM
    • Rapid iteration (concepting, style exploration) where slight fidelity loss is acceptable
    • You’re comfortable testing scaling/calibration settings or FP8-specific checkpoints

Practical tip: Benchmark both on your exact graph. 5–10 runs per setting, same seed/prompts, same sampler, report median time and peak VRAM. Save crops of hair/eyes/fabric for A/B comparison.


Common issues & troubleshooting

  • “Model not found” or empty dropdown

    • Verify the path: ComfyUI/models/checkpoints/ (or models/diffusion_models/ per your setup)
    • Confirm the file extension and that it’s not quarantined by OS (Windows SmartScreen)
  • “Dtype mismatch” or tensor cast errors across nodes

    • Keep a consistent precision along major branches (model, VAE, clip)
    • Update custom nodes; some older nodes assume fp16 tensors only
  • Image artifacts (banding, posterization, loss of micro-detail) in FP8

    • Try FP8 with scaling (e.g., e4m3 + per-tensor scaling) or switch to E5M2
    • Reduce guidance scale slightly, or increase steps by 10–20%
    • If still visible, move back to BF16 for that model/task
  • Random instability, NaNs, or divergence in special nodes (e.g., ControlNets, image editing)

    • Use BF16 for sensitive subgraphs (keep FP8 for the rest) if your nodes allow it
    • Lower resolution for testing; increase gradually
    • Ensure GPU drivers and CUDA are current
  • Can’t open a community workflow due to missing nodes

    • Read the sidebar error; install the listed custom node packs
    • Search the node name in your custom_nodes/ directory; restart ComfyUI after install
    • Reddit and Discord threads often link the exact repo to clone

  • FLUX in ComfyUI: /blog/flux-comfyui-guide
  • SDXL Best Practices: /blog/sdxl-best-practices-guide
  • ComfyUI Portable vs Desktop: /blog/comfyui-portable-vs-desktop-guide
  • TeaCache with ComfyUI (caching tricks for speed): /blog/teacache-with-comfyui

References and further reading


Conclusion

  • BF16 is the safe default: big dynamic range, stable, broadly compatible. If you care about consistent fidelity across tricky prompts, start here.
  • FP8 is the performance lever: huge VRAM and speed savings on the right GPUs. Amazing for exploration, large batches, and high-res renders — just validate quality for your subject.
  • In ComfyUI, the choice is yours per-workflow. Keep both variants handy, benchmark with your own graph, and let your GPU and eyes be the judge.

If you run benchmarks on your setup, I’d love to hear the results — model version, GPU, time/VRAM, and which precision you ended up sticking with.