You’ve seen Z-Image Turbo generate photorealistic images in 8 steps. Now you want to inject your own characters, styles, or concepts into it. That’s what this guide is for.
Here’s the catch most tutorials skip: Z-Image Turbo is a step-distilled model. If you train a LoRA on it the usual way — without the training adapter — you’ll gradually undo the distillation. Your LoRA will “work,” but suddenly you’ll need 20–30 steps and non-zero CFG to get clean output. That’s Turbo drift, and it defeats the entire point.
I’ve gone through the official AI Toolkit docs, the RunComfy training guide, and community reports to write the only guide that explains why each setting exists — not just what to copy-paste.
⚡ Turbo vs De-Turbo — Which Base Should You Train On?
AI Toolkit gives you two architecture choices for Z-Image LoRA training. Picking the wrong one means your LoRA won’t behave the way you expect at inference.
| Z-Image Turbo + Training Adapter | Z-Image De-Turbo | |
|---|---|---|
| Best for | Most LoRAs: characters, styles, products | Adapter-free training, longer fine-tunes |
| Inference steps | 8 steps, CFG = 0 | 20–30 steps, CFG 2–3 |
| Requires adapter | ✅ Yes | ❌ No |
| Risk of Turbo drift | Low (adapter prevents it) | Not applicable |
| Model ID | Tongyi-MAI/Z-Image-Turbo | ostris/Z-Image-De-Turbo |
Use Turbo + adapter if you want your LoRA to keep Z-Image’s 8-step speed after training. This is the right choice for 90% of use cases.
Use De-Turbo if you want adapter-free training or plan very long training runs where the adapter would eventually drift anyway.
The rest of this guide focuses on the Turbo + adapter path because that’s what most people need. De-Turbo differences are called out where they matter.
⚙️ Requirements
You need an NVIDIA GPU — AMD and Apple Silicon are not supported. On 12 GB VRAM you can train at 768×768; on 16–24 GB you get full 1024×1024 resolution. RAM should be at least 16 GB, and keep 50+ GB free on an SSD for checkpoints.
On the software side: Python 3.10, CUDA 11.8 or newer, and Git. Linux is the smoothest experience; Windows works well through WSL2 or natively.
🔽 Setup — Install AI Toolkit
Clone the repo and install dependencies:
git clone https://github.com/ostris/ai-toolkit.gitcd ai-toolkitgit submodule update --init --recursivepython -m venv venvsource venv/bin/activate # Windows: venv\Scripts\activatepip install torch torchvision --index-url https://download.pytorch.org/whl/cu121pip install -r requirements.txtThen launch the GUI:
python flux_train_ui.pyThe web UI opens at http://localhost:7860. All the panels described below are in this interface. If you prefer cloud training with zero setup, RunComfy runs the exact same AI Toolkit UI in a browser — no CUDA installs needed.
No GPU? No problem.
Rent an RTX 4090 on RunPod and train your Z-Image LoRA in under an hour. No setup required.
✅ Step 1 — Prepare Your Dataset
How many images do you need?
Z-Image learns fast. A small, diverse dataset generalizes better than a large repetitive one — this isn’t FLUX or SDXL where more images usually helps. Aim for 12–25 images for a character LoRA, mixing angles, expressions, lighting conditions, and backgrounds. For a style LoRA, 15–40 images across varied subjects works well: people, interiors, objects, different environments. Going above 50 images usually hits diminishing returns unless your concept has very wide visual range.
If you only have 5 images, expect overfitting to show up around step 1500–2000. At that point use an earlier checkpoint rather than letting it run to 3000.
Resolution
On 12 GB VRAM, enable the 512 and 768 resolution buckets only — skip 1024 to avoid OOM errors. On 16–24 GB you can enable all three (512 / 768 / 1024), which is Z-Image’s native resolution and gives the best results. Always enable Cache Latents in the Datasets panel regardless of which buckets you use.
Captions
Each image needs a matching .txt file with the same base name. If no .txt exists, AI Toolkit falls back to the Default Caption you set in the Datasets panel.
dataset/ photo_01.png photo_01.txt photo_02.png photo_02.txtFor a character LoRA, keep captions literal and consistent: [trigger] woman with red hair, close-up portrait, natural lighting. For a style LoRA, describe the scene normally without over-describing the style itself — you want the model to learn “render anything in this style,” not “only activate the style on a specific keyword.”
Set a short, unique Trigger Word in the Job panel — something like zchar_redhair or zpaint_ink. Non-dictionary tokens that don’t exist as real words work best, as they won’t activate accidentally from other prompts.
✅ Step 2 — Understand the Training Adapter (v1 vs v2)
This is the part most guides skip entirely. It’s also the most important thing to get right.
Why the training adapter exists
Z-Image Turbo is step-distilled: a slow multi-step diffusion process was compressed into 8 fast steps during the original model training. The problem is that if you apply gradient updates directly to a distilled model, those updates gradually undo the distillation. The model starts forgetting how to run cleanly at 8 steps and slowly drifts toward needing 20–30 steps for clean output. This is Turbo drift, and once it happens you’ve lost the main reason to use Z-Image Turbo in the first place.
The training adapter is the fix. It temporarily “de-distills” the model during the forward pass, so your LoRA’s gradient updates land on a model that behaves like a normal diffusion model during training. Your LoRA learns your concept cleanly. At inference time, you drop the adapter entirely and run your LoRA directly on the real Turbo base — 8 steps, CFG at 0, full speed.
v1 vs v2 — which adapter to use?
There are two adapter versions on Hugging Face under ostris/zimage_turbo_training_adapter/:
- v1 (
...v1.safetensors) is the stable baseline. If you’re just starting out or your previous run had instability issues, this is your safe fallback. - v2 (
...v2.safetensors) is the newer version, usually the default in recent AI Toolkit builds. It can produce slightly different training dynamics — sometimes better, sometimes less predictable.
The practical approach: start with your UI’s default (usually v2). If you see noisy outputs, Turbo drift, or weird artifacts, rerun the same job with v1 and compare samples at the same checkpoint steps.
✅ Step 3 — Configure AI Toolkit Panel by Panel
Open the AI Toolkit UI, click New Job, and work through the panels top to bottom. Here’s what matters and why.
JOB and MODEL panels
Give your job a descriptive name like zimage_char_redhair_v1 so you can tell checkpoints apart later. Set your Trigger Word here if you’re using one.
For the model, select Z-Image Turbo (w/ Training Adapter) as the architecture — the Name or Path field will auto-fill to Tongyi-MAI/Z-Image-Turbo. The Training Adapter Path should point to the v1 or v2 file discussed above. If you’re going the De-Turbo route instead, select Z-Image De-Turbo (De-Distilled) and the path fills to ostris/Z-Image-De-Turbo — no adapter needed.
QUANTIZATION panel
On 24 GB or more, leave both Transformer and Text Encoder at BF16 — no quantization needed, you’ll get the cleanest gradients. On 16 GB, set the Transformer to float8 and leave the Text Encoder alone. On 12 GB, quantize both to float8 to fit in memory. Avoid going lower than float8 if you can help it; it starts affecting gradient quality noticeably.
TARGET and SAVE panels
Set Target Type to LoRA and Linear Rank to 16 as your starting point. Rank 8 works for smaller or more subtle LoRAs; rank 32 makes sense if you’re training a complex style with lots of texture detail and have the VRAM to handle it.
For saves, use BF16 as the data type, save every 250 steps, and keep 4–12 checkpoint saves. The frequent saves are important — you’ll often want to use a checkpoint from step 1500 or 2000 rather than the final one.
TRAINING panel
This is the most important panel. The table below covers every setting worth touching:
| Setting | Value | Why |
|---|---|---|
| Batch Size | 1 | Never increase this for small datasets — it destabilizes identity |
| Optimizer | AdamW8Bit | Same results as AdamW at a fraction of the VRAM cost |
| Learning Rate | 0.0001 | Drop to 0.00005 if samples look noisy or burned at step 250 |
| Weight Decay | 0.0001 | |
| Steps | 2500–3000 | Use 1500–2200 for fewer than 10 images |
| Timestep Type | Weighted | |
| Timestep Bias | Balanced | Shift to High Noise for stronger global style; Low Noise for identity/detail |
| Cache Text Embeddings | ON (static captions) | Set Caption Dropout to 0 when this is on |
| DOP | OFF for first run | Add later for trigger-only production LoRAs |
Keep EMA off and Unload TE off for standard training runs.
SAMPLE panel
This is where most people silently break their training without realizing it. Your sample settings must match the base model you’re training on — if they don’t, your preview images will look terrible at every checkpoint and you’ll think the training is failing when it isn’t.
For Turbo training: set steps to 8, guidance scale to 0, resolution to 1024×1024, and sample every 250 steps. For De-Turbo training: use 20–30 steps and guidance scale 2–3 instead.
Write 5–10 sample prompts that reflect real inference use. Always include one or two prompts without your trigger word — this lets you catch style leakage early, where the LoRA starts affecting outputs even when you don’t call it.
🚀 Step 4 — Start Training and Read Your Samples
Hit Create Job and watch the training queue. Your first preview images appear at step 250.
Good progress follows a predictable arc: at steps 250–500 you’ll see loose resemblance with correct general colors and rough composition. By steps 750–1000 the identity or style should be clearly coming through. From step 1500 onward you want sharp, consistent results that match your dataset.
A few warning signs to watch for:
- Noisy or burned images at step 250 means your learning rate is too high. Stop immediately, drop LR to
0.00005, and restart. - LoRA works but needs 20+ steps at inference is Turbo drift — you either trained without the adapter or pushed LR too high for too long. Retrain with the adapter enabled and LR at or below 1e-4.
- Perfect at step 1500, terrible at step 3000 is overfitting. Your dataset is too small or too repetitive. Use the step 1500 checkpoint.
- Style bleeds into trigger-free prompts means the LoRA is too aggressive globally. Enable DOP in the next run, or lower the LoRA weight to 0.7–0.8 at inference.
Training is fast on Z-Image compared to FLUX. On an RTX 3080 at 768×768 you’re looking at roughly 3 seconds per iteration — a 2500-step run takes about 2 hours. An RTX 4090 at 1024×1024 does the same in around 35 minutes.
🛠️ Troubleshooting
| Error / Issue | Cause | Fix |
|---|---|---|
| CUDA Out of Memory | Resolution too high for VRAM | Disable 1024 bucket; use 512/768 only |
| Turbo drift (LoRA needs 20+ steps) | Trained without adapter, or LR too high | Use Turbo + adapter architecture; keep LR ≤ 1e-4 |
| Deep-fried / burned images | LR too high | Drop to 0.00005; use earlier checkpoint |
| Overfit faces / repeated backgrounds | Too few images + too many steps | Stop earlier; add more varied images to dataset |
| No speed improvement vs base model | Trained De-Turbo instead of Turbo | Check Model Architecture selection |
| Style leaks into all prompts | No DOP, LoRA weight too high | Enable DOP next run; lower LoRA weight to 0.7–0.8 |
AttributeError on launch | Outdated AI Toolkit | git pull and reinstall requirements |
📚 Related Guides
- Z-Image Turbo in ComfyUI: Setup & Workflow Guide
- Ostris AI Toolkit: LoRA Training Guide
- Multi-LoRA Workflows in ComfyUI
- FP8 vs BF16 in ComfyUI: Which Format to Use
- Best GPU Cloud Providers for AI
🏁 Final Thoughts
The training adapter is the whole game with Z-Image Turbo. Without it, you get a LoRA that technically works but loses the 8-step speed that makes the model worth using. With it, you can train a solid character or style LoRA in under two hours on a mid-range GPU.
Start with rank 16, LR 1e-4, 2500 steps, v2 adapter. Check your samples at step 500 and 1000. If they look stable, let it run. If not, the troubleshooting section above covers every common failure mode.
Go train something.