Z-Image Turbo LoRA Training with AI Toolkit: Complete Guide

You’ve seen Z-Image Turbo generate photorealistic images in 8 steps. Now you want to inject your own characters, styles, or concepts into it. That’s what this guide is for.

Here’s the catch most tutorials skip: Z-Image Turbo is a step-distilled model. If you train a LoRA on it the usual way — without the training adapter — you’ll gradually undo the distillation. Your LoRA will “work,” but suddenly you’ll need 20–30 steps and non-zero CFG to get clean output. That’s Turbo drift, and it defeats the entire point.

I’ve gone through the official AI Toolkit docs, the RunComfy training guide, and community reports to write the only guide that explains why each setting exists — not just what to copy-paste.

⚡ Turbo vs De-Turbo — Which Base Should You Train On?

AI Toolkit gives you two architecture choices for Z-Image LoRA training. Picking the wrong one means your LoRA won’t behave the way you expect at inference.

	Z-Image Turbo + Training Adapter	Z-Image De-Turbo
Best for	Most LoRAs: characters, styles, products	Adapter-free training, longer fine-tunes
Inference steps	8 steps, CFG = 0	20–30 steps, CFG 2–3
Requires adapter	✅ Yes	❌ No
Risk of Turbo drift	Low (adapter prevents it)	Not applicable
Model ID	`Tongyi-MAI/Z-Image-Turbo`	`ostris/Z-Image-De-Turbo`

Use Turbo + adapter if you want your LoRA to keep Z-Image’s 8-step speed after training. This is the right choice for 90% of use cases.

Use De-Turbo if you want adapter-free training or plan very long training runs where the adapter would eventually drift anyway.

The rest of this guide focuses on the Turbo + adapter path because that’s what most people need. De-Turbo differences are called out where they matter.

⚙️ Requirements

You need an NVIDIA GPU — AMD and Apple Silicon are not supported. On 12 GB VRAM you can train at 768×768; on 16–24 GB you get full 1024×1024 resolution. RAM should be at least 16 GB, and keep 50+ GB free on an SSD for checkpoints.

On the software side: Python 3.10, CUDA 11.8 or newer, and Git. Linux is the smoothest experience; Windows works well through WSL2 or natively.

🔽 Setup — Install AI Toolkit

Clone the repo and install dependencies:

git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
git submodule update --init --recursive
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Then launch the GUI:

python flux_train_ui.py

The web UI opens at http://localhost:7860. All the panels described below are in this interface. If you prefer cloud training with zero setup, RunComfy runs the exact same AI Toolkit UI in a browser — no CUDA installs needed.

No GPU? No problem.

Rent an RTX 4090 on RunPod and train your Z-Image LoRA in under an hour. No setup required.

Rent a GPU Now

✅ Step 1 — Prepare Your Dataset

How many images do you need?

Z-Image learns fast. A small, diverse dataset generalizes better than a large repetitive one — this isn’t FLUX or SDXL where more images usually helps. Aim for 12–25 images for a character LoRA, mixing angles, expressions, lighting conditions, and backgrounds. For a style LoRA, 15–40 images across varied subjects works well: people, interiors, objects, different environments. Going above 50 images usually hits diminishing returns unless your concept has very wide visual range.

If you only have 5 images, expect overfitting to show up around step 1500–2000. At that point use an earlier checkpoint rather than letting it run to 3000.

Resolution

On 12 GB VRAM, enable the 512 and 768 resolution buckets only — skip 1024 to avoid OOM errors. On 16–24 GB you can enable all three (512 / 768 / 1024), which is Z-Image’s native resolution and gives the best results. Always enable Cache Latents in the Datasets panel regardless of which buckets you use.

Captions

Each image needs a matching .txt file with the same base name. If no .txt exists, AI Toolkit falls back to the Default Caption you set in the Datasets panel.

1
dataset/
2
  photo_01.png
3
  photo_01.txt
4
  photo_02.png
5
  photo_02.txt

For a character LoRA, keep captions literal and consistent: [trigger] woman with red hair, close-up portrait, natural lighting. For a style LoRA, describe the scene normally without over-describing the style itself — you want the model to learn “render anything in this style,” not “only activate the style on a specific keyword.”

Set a short, unique Trigger Word in the Job panel — something like zchar_redhair or zpaint_ink. Non-dictionary tokens that don’t exist as real words work best, as they won’t activate accidentally from other prompts.

✅ Step 2 — Understand the Training Adapter (v1 vs v2)

This is the part most guides skip entirely. It’s also the most important thing to get right.

Why the training adapter exists

Z-Image Turbo is step-distilled: a slow multi-step diffusion process was compressed into 8 fast steps during the original model training. The problem is that if you apply gradient updates directly to a distilled model, those updates gradually undo the distillation. The model starts forgetting how to run cleanly at 8 steps and slowly drifts toward needing 20–30 steps for clean output. This is Turbo drift, and once it happens you’ve lost the main reason to use Z-Image Turbo in the first place.

The training adapter is the fix. It temporarily “de-distills” the model during the forward pass, so your LoRA’s gradient updates land on a model that behaves like a normal diffusion model during training. Your LoRA learns your concept cleanly. At inference time, you drop the adapter entirely and run your LoRA directly on the real Turbo base — 8 steps, CFG at 0, full speed.

v1 vs v2 — which adapter to use?

There are two adapter versions on Hugging Face under ostris/zimage_turbo_training_adapter/:

v1 (...v1.safetensors) is the stable baseline. If you’re just starting out or your previous run had instability issues, this is your safe fallback.
v2 (...v2.safetensors) is the newer version, usually the default in recent AI Toolkit builds. It can produce slightly different training dynamics — sometimes better, sometimes less predictable.

The practical approach: start with your UI’s default (usually v2). If you see noisy outputs, Turbo drift, or weird artifacts, rerun the same job with v1 and compare samples at the same checkpoint steps.

✅ Step 3 — Configure AI Toolkit Panel by Panel

Open the AI Toolkit UI, click New Job, and work through the panels top to bottom. Here’s what matters and why.

JOB and MODEL panels

Give your job a descriptive name like zimage_char_redhair_v1 so you can tell checkpoints apart later. Set your Trigger Word here if you’re using one.

For the model, select Z-Image Turbo (w/ Training Adapter) as the architecture — the Name or Path field will auto-fill to Tongyi-MAI/Z-Image-Turbo. The Training Adapter Path should point to the v1 or v2 file discussed above. If you’re going the De-Turbo route instead, select Z-Image De-Turbo (De-Distilled) and the path fills to ostris/Z-Image-De-Turbo — no adapter needed.

QUANTIZATION panel

On 24 GB or more, leave both Transformer and Text Encoder at BF16 — no quantization needed, you’ll get the cleanest gradients. On 16 GB, set the Transformer to float8 and leave the Text Encoder alone. On 12 GB, quantize both to float8 to fit in memory. Avoid going lower than float8 if you can help it; it starts affecting gradient quality noticeably.

TARGET and SAVE panels

Set Target Type to LoRA and Linear Rank to 16 as your starting point. Rank 8 works for smaller or more subtle LoRAs; rank 32 makes sense if you’re training a complex style with lots of texture detail and have the VRAM to handle it.

For saves, use BF16 as the data type, save every 250 steps, and keep 4–12 checkpoint saves. The frequent saves are important — you’ll often want to use a checkpoint from step 1500 or 2000 rather than the final one.

TRAINING panel

This is the most important panel. The table below covers every setting worth touching:

Setting	Value	Why
Batch Size	`1`	Never increase this for small datasets — it destabilizes identity
Optimizer	`AdamW8Bit`	Same results as AdamW at a fraction of the VRAM cost
Learning Rate	`0.0001`	Drop to `0.00005` if samples look noisy or burned at step 250
Weight Decay	`0.0001`
Steps	`2500–3000`	Use `1500–2200` for fewer than 10 images
Timestep Type	`Weighted`
Timestep Bias	`Balanced`	Shift to High Noise for stronger global style; Low Noise for identity/detail
Cache Text Embeddings	ON (static captions)	Set Caption Dropout to `0` when this is on
DOP	OFF for first run	Add later for trigger-only production LoRAs

Keep EMA off and Unload TE off for standard training runs.

SAMPLE panel

This is where most people silently break their training without realizing it. Your sample settings must match the base model you’re training on — if they don’t, your preview images will look terrible at every checkpoint and you’ll think the training is failing when it isn’t.

For Turbo training: set steps to 8, guidance scale to 0, resolution to 1024×1024, and sample every 250 steps. For De-Turbo training: use 20–30 steps and guidance scale 2–3 instead.

Write 5–10 sample prompts that reflect real inference use. Always include one or two prompts without your trigger word — this lets you catch style leakage early, where the LoRA starts affecting outputs even when you don’t call it.

🚀 Step 4 — Start Training and Read Your Samples

Hit Create Job and watch the training queue. Your first preview images appear at step 250.

Good progress follows a predictable arc: at steps 250–500 you’ll see loose resemblance with correct general colors and rough composition. By steps 750–1000 the identity or style should be clearly coming through. From step 1500 onward you want sharp, consistent results that match your dataset.

A few warning signs to watch for:

Noisy or burned images at step 250 means your learning rate is too high. Stop immediately, drop LR to 0.00005, and restart.
LoRA works but needs 20+ steps at inference is Turbo drift — you either trained without the adapter or pushed LR too high for too long. Retrain with the adapter enabled and LR at or below 1e-4.
Perfect at step 1500, terrible at step 3000 is overfitting. Your dataset is too small or too repetitive. Use the step 1500 checkpoint.
Style bleeds into trigger-free prompts means the LoRA is too aggressive globally. Enable DOP in the next run, or lower the LoRA weight to 0.7–0.8 at inference.

Training is fast on Z-Image compared to FLUX. On an RTX 3080 at 768×768 you’re looking at roughly 3 seconds per iteration — a 2500-step run takes about 2 hours. An RTX 4090 at 1024×1024 does the same in around 35 minutes.

🛠️ Troubleshooting

Error / Issue	Cause	Fix
CUDA Out of Memory	Resolution too high for VRAM	Disable 1024 bucket; use 512/768 only
Turbo drift (LoRA needs 20+ steps)	Trained without adapter, or LR too high	Use Turbo + adapter architecture; keep LR ≤ 1e-4
Deep-fried / burned images	LR too high	Drop to `0.00005`; use earlier checkpoint
Overfit faces / repeated backgrounds	Too few images + too many steps	Stop earlier; add more varied images to dataset
No speed improvement vs base model	Trained De-Turbo instead of Turbo	Check Model Architecture selection
Style leaks into all prompts	No DOP, LoRA weight too high	Enable DOP next run; lower LoRA weight to 0.7–0.8
`AttributeError` on launch	Outdated AI Toolkit	`git pull` and reinstall requirements

🏁 Final Thoughts

The training adapter is the whole game with Z-Image Turbo. Without it, you get a LoRA that technically works but loses the 8-step speed that makes the model worth using. With it, you can train a solid character or style LoRA in under two hours on a mid-range GPU.

Start with rank 16, LR 1e-4, 2500 steps, v2 adapter. Check your samples at step 500 and 1000. If they look stable, let it run. If not, the troubleshooting section above covers every common failure mode.

Go train something.