Skip to content
NeuroCanvas logo

Blog

AI Toolkit vs OneTrainer: Z-Image LoRA Training 2026

15 min read

If you’ve trained Z-Image Base LoRAs in both AI Toolkit and OneTrainer, you’ve probably hit a wall that feels designed to drive you insane: one tool nails the body but softens the face, the other gets the face exactly right but mangles the arms. I’ve dug into both tools across dozens of character training runs, and this isn’t random β€” there are specific, fixable reasons why the two diverge. This guide breaks down the key differences in settings, speed, and training behavior, then gives you working configs for both.


πŸ” What Is Z-Image Base LoRA Training?

Z-Image Base is the full, non-distilled Z-Image checkpoint from Tongyi-MAI. It runs at 30–50 sampling steps with full CFG and negative prompt support β€” unlike Z-Image Turbo, which is an 8-step distilled variant. Because Base is the source model, it’s the cleanest target for LoRA fine-tuning: any LoRA you train here has access to the full weight space without fighting distillation artifacts.

Both AI Toolkit (by Ostris) and OneTrainer are open-source LoRA trainers that support Z-Image Base. They both work. The question is which gives you more predictable results β€” and why they behave so differently even with nearly identical settings.


⚑ Why the Tool Choice Actually Matters

The biggest surprise for most people is that identical hyperparameters produce meaningfully different outputs in the two tools. Here’s the practical summary before the details:

  • βœ… Speed: OneTrainer runs roughly 1.4–2Γ— faster than AI Toolkit on the same hardware. The main driver is torch.compile and int8 quantized training (w8a8), which OneTrainer enables by default and AI Toolkit currently lacks.
  • βœ… Character bodies: AI Toolkit tends to produce more anatomically consistent bodies β€” especially at 3,000+ steps.
  • βœ… Character faces: OneTrainer frequently produces sharper, more accurate facial likeness, particularly with the Prodigy_Adv optimizer and stochastic rounding enabled.
  • βœ… Workflow: AI Toolkit has a cleaner, simpler UI and supports job queuing for multi-run iteration.
  • βœ… Options: OneTrainer exposes more optimizer choices and ships fine-grained presets per VRAM tier.
  • ❌ The catch? Getting both face and body right in a single tool requires deliberate tuning β€” but it’s fully achievable in either one.

πŸ“Š Quick Comparison Table

FeatureAI ToolkitOneTrainer
UI Simplicityβœ… Very clean❌ Dense but powerful
Speed❌ ~1.4–2Γ— slowerβœ… Faster via torch.compile + int8
Job Queueβœ… Yes❌ No
Z-Image Base Supportβœ… Nativeβœ… Native
FLUX.2 Klein 9B Supportβœ… Yesβœ… Via fork (PR not yet merged)
Prodigy_Adv Optimizer❌ Not availableβœ… Yes
Stochastic Rounding❌ Not exposed in UIβœ… Available
VRAM Presets❌ Manual configβœ… Presets per tier
Windows Supportβœ…βœ…

πŸ₯‡ AI Toolkit by Ostris

AI Toolkit is the natural starting point for most people β€” Ostris ships solid defaults, the job queue makes iterating across multiple runs painless, and the cloud UI on RunComfy removes local setup entirely. Here are the full recommended settings for Z-Image Base character LoRAs.

πŸ’‘ Key Features

  • Job queue for multi-dataset iteration without manual restarts
  • FlowMatch sampler with CFG support for accurate in-training preview images
  • float8 quantization for Transformer and Text Encoder
  • BF16 checkpoint output
  • Cloud deployment via RunComfy (H100/H200)
SettingValue
Model ArchitectureZ-Image
Model PathTongyi-MAI/Z-Image
Quantization (Transformer)float8
Quantization (Text Encoder)float8
Target TypeLoRA
Linear Rank32
Save Data TypeBF16
Batch Size1
Steps3,000–7,000
OptimizerAdamW8Bit
Learning Rate0.0001
Weight Decay0.0001
Timestep TypeWeighted
Timestep BiasBalanced
EMAOFF
Resolutions768 + 1024
Sample Guidance Scale4
Sample Steps30–50

Steps per image: aim for ~100 steps per image in your dataset. 64 images β†’ 6,400 steps. 30 images β†’ 3,000 steps. Don’t undershoot β€” if your face is coming out slightly wide or soft at 3,000 steps with a 60-image dataset, extending to 5,000–6,500 and checking intermediate checkpoints is the right move, not switching tools.

Sample settings matter. The most common Z-Image Base misconfiguration in AI Toolkit is using Turbo-style sampling β€” 8 steps, little or no CFG β€” for preview images. Set Guidance Scale to 3–5 and Sample Steps to at least 30. Turbo settings here make previews look undercooked, which causes people to stop training too early or assume the LoRA isn’t working when it just needs more steps.

❌ Known Limitations

  • No Prodigy_Adv optimizer β€” only standard Prodigy and AdamW variants
  • Stochastic rounding is not exposed in the UI
  • ~1.4–2Γ— slower than OneTrainer at equivalent settings
  • FLUX.2 Klein 9B training sample images are notoriously misleading β€” test checkpoints manually at a fixed seed
  • Speed gap versus OneTrainer widens above batch size 2

πŸ’° Cloud Cost

Running on RunComfy H100: approximately $2–4 per full character training run depending on step count and resolution.


πŸ₯ˆ OneTrainer

OneTrainer trades UI simplicity for raw flexibility. The settings page is dense, but that density is the point β€” it exposes optimizer options and quantization controls that AI Toolkit either doesn’t have or doesn’t surface. For Z-Image Base character training specifically, the Prodigy_Adv optimizer with stochastic rounding directly improves facial likeness in ways that default AI Toolkit settings don’t match.

πŸ’‘ Key Features

  • torch.compile for significantly faster training on all supported GPUs
  • Prodigy_Adv optimizer with stochastic rounding
  • Per-VRAM-tier presets (8 GB, 12 GB, 16 GB, 24 GB+)
  • DoRA support for faster convergence on fine character details
  • Image pair training for edit-style LoRAs
SettingValue
Base ModelZ-Image Base
LoRA Rank16–32
AlphaEqual to Rank
OptimizerProdigy_Adv (or AdamW + stochastic rounding)
Learning Rate1.0 (Prodigy) / 0.0001 (AdamW)
Epochs~100–120 (batch 1 = 100–120 Γ— image count steps)
Stochastic RoundingON
LoRA Weight Data TypeBF16
EMAOFF
Resolution768
Gradient CheckpointingON if VRAM is tight
Differential GuidanceOFF

Alpha and rank: keep alpha equal to rank. Setting alpha = 1 with a high rank effectively decouples the update scale from the learning rate in a way that makes tuning harder to reason about. Equal alpha and rank keeps the math straightforward and the LR more predictable.

Stochastic rounding: in OneTrainer’s optimizer settings, enable stochastic rounding whenever you’re using BF16 as the LoRA weight data type. This is the single setting most people miss when first using OneTrainer, and it meaningfully affects convergence quality on fine facial detail.

Epoch-to-steps math: 100 epochs at batch size 1 with 64 images = 6,400 steps. Train to 120 epochs (7,680 steps) and manually select the best checkpoint. The sweet spot is often around epochs 115–118 β€” not the final one.

❌ Known Limitations

  • FLUX.2 Klein 9B support requires a fork (PR #1301, not yet merged to main)
  • Dense UI β€” configuring from scratch takes significantly longer than AI Toolkit
  • No job queue for automated multi-run training
  • Sample images during training can be misleading on some models β€” use a fixed seed for manual checkpoint evaluation

These apply to both tools unless otherwise noted.

Setting12–16 GB24 GB48 GB+
Quantizationfloat8 (Transformer + TE)float8Optional
Rank163232–48
Resolutions512 + 768768 + 10241024 + 1280 + 1536
Sample Steps3030–4040–50
Steps (~60 img, character)4,000–6,0005,000–7,000Same, faster iteration
EMAOFFOFFOFF
BF16 LoRA WeightsYESYESYES
Stochastic Rounding (OT)ONONON

🧩 Fixing the Character LoRA Face vs Body Problem

This is the most common frustration when training Z-Image Base character LoRAs, and it has a real explanation rather than being random tool behavior.

Why AI Toolkit gets the body right but softens the face:

At 3,000 steps with a 64-image dataset, a Z-Image Base LoRA trained in AI Toolkit has learned enough for body poses, composition, and general likeness β€” but hasn’t converged on high-frequency facial detail. The weighted timestep approach AI Toolkit uses by default distributes training broadly, which favors coarser features (body shape, clothing, pose) before finer ones (facial geometry, eye spacing). The fix is more steps, not a different tool.

Why OneTrainer nails the face but fails the body:

Body failures in OneTrainer β€” slimmer proportions, deformed arms and hands β€” are almost always a dataset imbalance issue, not an optimizer issue. If your dataset has 40 portrait crops and 24 full-body shots, the LoRA overweights facial features relative to body structure. Prodigy_Adv and stochastic rounding accelerate convergence, which means the imbalance shows up faster and more severely than it does in AI Toolkit.

Sounds obvious, but: check your dataset ratio before changing your optimizer. Add more full-body images with clear arm and hand visibility. Physically crop out other people in group shots rather than captioning them out β€” caption exclusion is inconsistent.

A two-LoRA approach that works:

The most reliable high-likeness method currently used by the community: train a full character LoRA in AI Toolkit at 6,000+ steps for body and pose fidelity, then train a face-only LoRA in OneTrainer using a tightly cropped face-only dataset. Apply both in ComfyUI β€” pass the face LoRA through a FaceDetailer node using FLUX.2 Klein 4B as the refinement model. This setup consistently produces 95–100% likeness across poses and lighting conditions.


πŸ› οΈ Troubleshooting

ErrorCauseFix
Face slightly wide or nose larger (AI Toolkit)Too few steps β€” underfittingIncrease to 6,000+ steps; test every 250-step checkpoint
Arms deformed or body too slim (OneTrainer)Dataset imbalance β€” too many portrait cropsAdd full-body images; crop out bystanders physically
”LoRA does nothing” at inferenceUsing Turbo sampling settings on BaseSet guidance scale 3–5 and sample steps 30–50 at inference
OOM at 1024 resolutionRank too high or quantization offEnable float8 for Transformer; drop to 768; reduce rank to 16
Loss curve never drops (OneTrainer)Differential guidance accidentally enabledTurn OFF differential guidance in advanced settings
LoRA looks great on Base, poor on TurboBase–Turbo weight deviationTrain on Turbo for Turbo deployment; or use a 4-step distilled LoRA merge
FLUX.2 Klein 9B OOM on 16 GBModel too large for VRAM at default settingsEnable layer offloading; use 7-bit quant; drop rank to 8–16
Training samples look terrible during Klein runsKnown issue β€” base model mid-training samplesIgnore samples; manually test checkpoints at a fixed seed

πŸ’‘ Tips & Best Practices

πŸ’‘ Tip: Don’t deploy a Z-Image Base LoRA directly on Z-Image Turbo and expect matching results. Turbo is a distilled finetune with diverged weights β€” LoRA strength, facial sharpness, and style transfer all behave differently. You’ll need to push strength to 1.3–1.5 at minimum to compensate, and results still won’t match a Turbo-trained LoRA. Train and deploy on the same base for consistent quality.

πŸ’‘ Tip: In OneTrainer, always verify that stochastic rounding is enabled in the optimizer settings when using BF16 as the LoRA weight data type. It’s not automatic for all optimizers and is the most commonly missed setting. Without it, precision loss from BF16 can subtly degrade convergence on fine detail β€” exactly the kind of problem that shows up as soft or inaccurate faces.

πŸ’‘ Tip: Caption dropout at 0.05 prevents the trigger word from becoming too tightly bound to a single lighting condition or background. AI Toolkit sets this by default; OneTrainer requires you to set it manually. Skipping it makes your LoRA less flexible at inference time β€” prompts for different backgrounds or lighting will partially bleed into the trigger response.

πŸ’‘ Tip: For FLUX.2 Klein 9B in AI Toolkit, set timestep_type to linear instead of the default weighted. Community members consistently report better character likeness at 1,800–2,000 steps with this single change, using lr=0.0001 and rank=32. Also plan to run the LoRA at 1.3–1.5 strength when using the distilled 9B model at inference β€” base-trained LoRAs don’t fully transfer at strength 1.0.

πŸ’‘ Tip: The 100-steps-per-image rule is a floor, not a ceiling. For high-fidelity character training, running to 120 steps per image and manually selecting the best intermediate checkpoint typically beats stopping at the minimum. Evaluate each checkpoint at a fixed prompt and seed β€” the best result is usually around 115–118 epochs of training, not at the end of the run.

πŸ’‘ Tip: If FLUX.2 Klein 9B hits VRAM limits on 16 GB, the 4B variant uses Apache 2.0 and trains without forced quantization compromises on consumer hardware. The 9B produces marginally sharper fine detail, but for most character and style use cases the 4B results are competitive and far easier to work with locally.


βœ… Final Thoughts

There’s no single best tool for Z-Image Base LoRA training β€” there are clear trade-offs. AI Toolkit gives you a simpler workflow, a job queue, and body-consistent character training. OneTrainer runs faster, exposes better optimizer options, and produces stronger facial likeness when configured correctly. The face-versus-body problem is almost always a dataset balance or step-count issue, not a fundamental limitation of either tool.

Fix your dataset ratio, run more steps than you think you need, enable stochastic rounding in OneTrainer, and evaluate intermediate checkpoints at a fixed seed rather than trusting training previews. That combination resolves the overwhelming majority of Z-Image Base character training frustrations β€” in both tools.

Go train something.


❓ FAQ

❓ Q: Why is AI Toolkit so much slower than OneTrainer for LoRA training?

OneTrainer is 1.4–2Γ— faster primarily because it implements torch.compile and int8 quantized training (w8a8) by default β€” optimizations AI Toolkit currently lacks. The gap widens at higher batch sizes, where OneTrainer’s VRAM scaling is more efficient. For speed-critical workflows, OneTrainer is the better choice; for workflow simplicity and job queuing, AI Toolkit wins.

❓ Q: Can I use a Z-Image Base LoRA on Z-Image Turbo?

You can, but results vary and typically require increasing LoRA strength to 1.3–1.5 to compensate for the weight deviation between Base and Turbo. For best consistency, train on the same model you’ll deploy to. If Turbo deployment is the goal, either retrain on Turbo or combine a Base-trained character LoRA with a separately available 4-step distilled LoRA for Base inference.

❓ Q: How many training steps does a Z-Image Base character LoRA actually need?

The community baseline is 100–120 steps per image at batch size 1: 64 images β†’ 6,400–7,680 steps. Training too short is the most common cause of soft faces and missed likeness in AI Toolkit. Save checkpoints every 250 steps and manually evaluate at a fixed seed β€” the best result is often around 115–118 epochs, not the final checkpoint.

❓ Q: Does Prodigy_Adv actually improve results over AdamW for Z-Image LoRA training?

The quality improvement comes primarily from two things, not the optimizer itself: Prodigy_Adv dynamically adjusts the learning rate (removing the need to tune it manually), and it’s typically paired with stochastic rounding enabled in OneTrainer. If you enable stochastic rounding alongside a well-tuned AdamW learning rate, the quality gap between the two closes significantly. Prodigy_Adv is mainly a convenience, not a secret ingredient.

❓ Q: Why do Z-Image Base LoRA training samples look bad during training?

Training samples are generated mid-run at reduced quality to avoid slowing down the run. For Z-Image Base, they’re produced using in-progress weights at 30 steps, which doesn’t reflect final LoRA quality. Don’t make decisions based on training samples alone. After the run completes, manually test each checkpoint at a consistent prompt and seed to find the actual best point.


πŸ“š Additional Resources