Diffusers documentation
Krea 2
Krea 2
Krea 2 (K2) is a flow-matching text-to-image model built around a single-stream MMDiT with grouped-query attention. A Qwen3-VL text encoder provides the conditioning: instead of the last hidden state, hidden states from twelve decoder layers are tapped per token and fused inside the transformer by a small text-fusion stage. Images are decoded with the Qwen-Image VAE.
Two checkpoints are released, sharing the same architecture but with different recommended sampler settings:
- Base (midtrain) — use the full sampler with classifier-free guidance:
num_inference_steps=28,guidance_scale=4.5. - TDM (distilled) — distilled for few-step sampling, run with
num_inference_steps=8and guidance disabled (guidance_scale=0.0).
guidance_scale follows the Krea 2 convention: the velocity is computed as cond + guidance_scale * (cond - uncond)
and guidance is enabled whenever guidance_scale > 0 (this equals the usual CFG formulation with scale
1 + guidance_scale).
Text-to-image
import torch
from diffusers import Krea2Pipeline
# Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "a fox in the snow"
image = pipe(
prompt,
height=1024,
width=1024,
num_inference_steps=28,
guidance_scale=4.5,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("krea2.png")Krea2Pipeline
class diffusers.Krea2Pipeline
< source >( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLQwenImage text_encoder: Qwen3VLModel tokenizer: AutoTokenizer transformer: Krea2Transformer2DModel text_encoder_select_layers: tuple[int, ...] | list[int] | None = None is_distilled: bool = False patch_size: int = 2 )
Parameters
- scheduler (FlowMatchEulerDiscreteScheduler) —
Euler flow-matching scheduler. The Krea 2 sigma schedule is the resolution-aware exponential time shift, so
the scheduler config is expected to set
use_dynamic_shifting=Truetogether with the Krea 2 shift parameters (base_shift=0.5,max_shift=1.15,base_image_seq_len=256,max_image_seq_len=6400). - vae (AutoencoderKLQwenImage) — The Qwen-Image variational auto-encoder (f8, 16 latent channels) used to decode latents to images.
- text_encoder (PreTrainedModel) —
A Qwen3-VL model (e.g.
Qwen3VLModelofQwen/Qwen3-VL-4B-Instruct). The pipeline consumes a stack of hidden states tapped from several decoder layers rather than the last hidden state. - tokenizer (AutoTokenizer) — The tokenizer paired with the text encoder.
- transformer (Krea2Transformer2DModel) — The Krea 2 single-stream MMDiT that predicts the flow-matching velocity.
- text_encoder_select_layers (
tuple[int, ...], optional) — Indices into the text encoder’shidden_statestuple (0 is the embedding output) whose states are stacked per token as the transformer’s text conditioning. Must havetransformer.config.num_text_layersentries. - is_distilled (
bool, optional, defaults toFalse) — Whether the transformer is the few-step distilled (TDM/turbo) checkpoint. WhenTruea fixed timestep shiftmu=1.15is used; otherwisemuis computed from the image resolution. - patch_size (
int, optional, defaults to 2) — Side length of the square patches the latents are packed into before entering the transformer. The effective pixel-to-token downsampling factor isvae_scale_factor * patch_size.
The Krea 2 pipeline for text-to-image generation.
__call__
< source >( prompt: str | list[str] | None = None negative_prompt: str | list[str] | None = None height: int = 1024 width: int = 1024 num_inference_steps: int = 28 sigmas: list[float] | None = None guidance_scale: float = 4.5 num_images_per_prompt: int = 1 generator: torch._C.Generator | list[torch._C.Generator] | None = None latents: torch.Tensor | None = None prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None negative_prompt_embeds: torch.Tensor | None = None negative_prompt_embeds_mask: torch.Tensor | None = None output_type: str | None = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, dict], NoneType]] = None callback_on_step_end_tensor_inputs: list = ['latents'] max_sequence_length: int = 512 ) → Krea2PipelineOutput or tuple
Parameters
- prompt (
strorlist[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. - negative_prompt (
strorlist[str], optional) — The prompt or prompts not to guide the image generation. Ignored whenguidance_scale <= 0; defaults to an empty prompt when guidance is enabled. - height (
int, defaults to 1024) — The height in pixels of the generated image. Rounded up to a multiple of 16 if needed. - width (
int, defaults to 1024) — The width in pixels of the generated image. Rounded up to a multiple of 16 if needed. - num_inference_steps (
int, defaults to 28) — The number of denoising steps. Use 28 for the base (midtrain) checkpoint and 8 for the few-step distilled (TDM) checkpoint. - sigmas (
list[float], optional) — Custom sigmas for the scheduler. If not defined, the defaultlinspace(1.0, 1/num_inference_steps, num_inference_steps)grid is used (the resolution-aware shift is applied inside the scheduler). - guidance_scale (
float, defaults to 4.5) — Classifier-free guidance scale, following the Krea 2 convention: the velocity is computed ascond + guidance_scale * (cond - uncond)and guidance is enabled wheneverguidance_scale > 0(this equals the usual CFG formulation with scale1 + guidance_scale). Set to0.0to disable (e.g. for the TDM checkpoint). - num_images_per_prompt (
int, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generatororlist[torch.Generator], optional) — One or more torch generator(s) to make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents in packed form(batch_size, image_seq_len, in_channels), sampled from a Gaussian distribution, to be used as inputs for image generation. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings of shape(batch_size, text_seq_len, num_text_layers, text_hidden_dim). If not provided, embeddings are generated fromprompt. - prompt_embeds_mask (
torch.Tensor, optional) — Boolean mask forprompt_embeds; required whenprompt_embedsis passed. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings; same layout asprompt_embeds. - negative_prompt_embeds_mask (
torch.Tensor, optional) — Boolean mask fornegative_prompt_embeds; required whennegative_prompt_embedsis passed. - output_type (
str, optional, defaults to"pil") — The output format of the generated image. Choose between"pil","np","pt"or"latent". - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a Krea2PipelineOutput instead of a plain tuple. - callback_on_step_end (
Callable, optional) — A function that is called at the end of each denoising step withcallback_on_step_end(self, step, timestep, callback_kwargs). - callback_on_step_end_tensor_inputs (
list[str], optional, defaults to["latents"]) — The list of tensor inputs for thecallback_on_step_endfunction. Must be a subset of._callback_tensor_inputs. - max_sequence_length (
int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.
Returns
Krea2PipelineOutput or tuple
Krea2PipelineOutput if
return_dict is True, otherwise a tuple, whose first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import Krea2Pipeline
>>> # Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
>>> pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "a fox in the snow"
>>> # Base (midtrain) checkpoint defaults. For the few-step distilled (TDM) checkpoint use
>>> # `num_inference_steps=8, guidance_scale=0.0` instead.
>>> image = pipe(prompt, num_inference_steps=28, guidance_scale=4.5).images[0]
>>> image.save("krea2.png")encode_prompt
< source >( prompt: str | list[str] device: torch.device | None = None num_images_per_prompt: int = 1 prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None max_sequence_length: int = 512 )
Parameters
- prompt (
strorlist[str], optional) — prompt to be encoded - device — (
torch.device): torch device - num_images_per_prompt (
int) — number of images that should be generated per prompt - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings of shape(batch_size, text_seq_len, num_text_layers, text_hidden_dim). Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - prompt_embeds_mask (
torch.Tensor, optional) — Pre-generated boolean mask marking valid text tokens, of shape(batch_size, text_seq_len). Required whenprompt_embedsis passed. - max_sequence_length (
int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.
( prompt: str | list[str] max_sequence_length: int = 512 device: torch.device | None = None )
Tokenize prompt into the fixed-length Krea 2 layout and tap the selected encoder hidden states.
Returns a (hidden_states, attention_mask) tuple of shapes (batch_size, text_seq_len, num_text_layers, text_hidden_dim) and (batch_size, text_seq_len) (bool).
prepare_position_ids
< source >( text_seq_len: int grid_height: int grid_width: int device: device )
Build the (text_seq_len + grid_height * grid_width, 3) rotary coordinates for the combined sequence:
text tokens sit at the origin, image tokens carry their (0, h, w) latent-grid coordinates.
Krea2PipelineOutput
class diffusers.pipelines.krea2.Krea2PipelineOutput
< source >( images: list[PIL.Image.Image] | numpy.ndarray )
Output class for the Krea 2 pipeline.