Diffusers documentation

Krea 2

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.38.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Krea 2

Krea 2 (K2) is a flow-matching text-to-image model built around a single-stream MMDiT with grouped-query attention. A Qwen3-VL text encoder provides the conditioning: instead of the last hidden state, hidden states from twelve decoder layers are tapped per token and fused inside the transformer by a small text-fusion stage. Images are decoded with the Qwen-Image VAE.

Two checkpoints are released, sharing the same architecture but with different recommended sampler settings:

  • Base (midtrain) — use the full sampler with classifier-free guidance: num_inference_steps=28, guidance_scale=4.5.
  • TDM (distilled) — distilled for few-step sampling, run with num_inference_steps=8 and guidance disabled (guidance_scale=0.0).

guidance_scale follows the Krea 2 convention: the velocity is computed as cond + guidance_scale * (cond - uncond) and guidance is enabled whenever guidance_scale > 0 (this equals the usual CFG formulation with scale 1 + guidance_scale).

Text-to-image

import torch
from diffusers import Krea2Pipeline

# Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "a fox in the snow"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=4.5,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("krea2.png")

Krea2Pipeline

class diffusers.Krea2Pipeline

< >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLQwenImage text_encoder: Qwen3VLModel tokenizer: AutoTokenizer transformer: Krea2Transformer2DModel text_encoder_select_layers: tuple[int, ...] | list[int] | None = None is_distilled: bool = False patch_size: int = 2 )

Parameters

  • scheduler (FlowMatchEulerDiscreteScheduler) — Euler flow-matching scheduler. The Krea 2 sigma schedule is the resolution-aware exponential time shift, so the scheduler config is expected to set use_dynamic_shifting=True together with the Krea 2 shift parameters (base_shift=0.5, max_shift=1.15, base_image_seq_len=256, max_image_seq_len=6400).
  • vae (AutoencoderKLQwenImage) — The Qwen-Image variational auto-encoder (f8, 16 latent channels) used to decode latents to images.
  • text_encoder (PreTrainedModel) — A Qwen3-VL model (e.g. Qwen3VLModel of Qwen/Qwen3-VL-4B-Instruct). The pipeline consumes a stack of hidden states tapped from several decoder layers rather than the last hidden state.
  • tokenizer (AutoTokenizer) — The tokenizer paired with the text encoder.
  • transformer (Krea2Transformer2DModel) — The Krea 2 single-stream MMDiT that predicts the flow-matching velocity.
  • text_encoder_select_layers (tuple[int, ...], optional) — Indices into the text encoder’s hidden_states tuple (0 is the embedding output) whose states are stacked per token as the transformer’s text conditioning. Must have transformer.config.num_text_layers entries.
  • is_distilled (bool, optional, defaults to False) — Whether the transformer is the few-step distilled (TDM/turbo) checkpoint. When True a fixed timestep shift mu=1.15 is used; otherwise mu is computed from the image resolution.
  • patch_size (int, optional, defaults to 2) — Side length of the square patches the latents are packed into before entering the transformer. The effective pixel-to-token downsampling factor is vae_scale_factor * patch_size.

The Krea 2 pipeline for text-to-image generation.

__call__

< >

( prompt: str | list[str] | None = None negative_prompt: str | list[str] | None = None height: int = 1024 width: int = 1024 num_inference_steps: int = 28 sigmas: list[float] | None = None guidance_scale: float = 4.5 num_images_per_prompt: int = 1 generator: torch._C.Generator | list[torch._C.Generator] | None = None latents: torch.Tensor | None = None prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None negative_prompt_embeds: torch.Tensor | None = None negative_prompt_embeds_mask: torch.Tensor | None = None output_type: str | None = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, dict], NoneType]] = None callback_on_step_end_tensor_inputs: list = ['latents'] max_sequence_length: int = 512 ) Krea2PipelineOutput or tuple

Parameters

  • prompt (str or list[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds.
  • negative_prompt (str or list[str], optional) — The prompt or prompts not to guide the image generation. Ignored when guidance_scale <= 0; defaults to an empty prompt when guidance is enabled.
  • height (int, defaults to 1024) — The height in pixels of the generated image. Rounded up to a multiple of 16 if needed.
  • width (int, defaults to 1024) — The width in pixels of the generated image. Rounded up to a multiple of 16 if needed.
  • num_inference_steps (int, defaults to 28) — The number of denoising steps. Use 28 for the base (midtrain) checkpoint and 8 for the few-step distilled (TDM) checkpoint.
  • sigmas (list[float], optional) — Custom sigmas for the scheduler. If not defined, the default linspace(1.0, 1/num_inference_steps, num_inference_steps) grid is used (the resolution-aware shift is applied inside the scheduler).
  • guidance_scale (float, defaults to 4.5) — Classifier-free guidance scale, following the Krea 2 convention: the velocity is computed as cond + guidance_scale * (cond - uncond) and guidance is enabled whenever guidance_scale > 0 (this equals the usual CFG formulation with scale 1 + guidance_scale). Set to 0.0 to disable (e.g. for the TDM checkpoint).
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt.
  • generator (torch.Generator or list[torch.Generator], optional) — One or more torch generator(s) to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents in packed form (batch_size, image_seq_len, in_channels), sampled from a Gaussian distribution, to be used as inputs for image generation.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings of shape (batch_size, text_seq_len, num_text_layers, text_hidden_dim). If not provided, embeddings are generated from prompt.
  • prompt_embeds_mask (torch.Tensor, optional) — Boolean mask for prompt_embeds; required when prompt_embeds is passed.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings; same layout as prompt_embeds.
  • negative_prompt_embeds_mask (torch.Tensor, optional) — Boolean mask for negative_prompt_embeds; required when negative_prompt_embeds is passed.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between "pil", "np", "pt" or "latent".
  • return_dict (bool, optional, defaults to True) — Whether or not to return a Krea2PipelineOutput instead of a plain tuple.
  • callback_on_step_end (Callable, optional) — A function that is called at the end of each denoising step with callback_on_step_end(self, step, timestep, callback_kwargs).
  • callback_on_step_end_tensor_inputs (list[str], optional, defaults to ["latents"]) — The list of tensor inputs for the callback_on_step_end function. Must be a subset of ._callback_tensor_inputs.
  • max_sequence_length (int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.

Returns

Krea2PipelineOutput or tuple

Krea2PipelineOutput if return_dict is True, otherwise a tuple, whose first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import Krea2Pipeline

>>> # Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
>>> pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "a fox in the snow"
>>> # Base (midtrain) checkpoint defaults. For the few-step distilled (TDM) checkpoint use
>>> # `num_inference_steps=8, guidance_scale=0.0` instead.
>>> image = pipe(prompt, num_inference_steps=28, guidance_scale=4.5).images[0]
>>> image.save("krea2.png")

encode_prompt

< >

( prompt: str | list[str] device: torch.device | None = None num_images_per_prompt: int = 1 prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None max_sequence_length: int = 512 )

Parameters

  • prompt (str or list[str], optional) — prompt to be encoded
  • device — (torch.device): torch device
  • num_images_per_prompt (int) — number of images that should be generated per prompt
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings of shape (batch_size, text_seq_len, num_text_layers, text_hidden_dim). Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • prompt_embeds_mask (torch.Tensor, optional) — Pre-generated boolean mask marking valid text tokens, of shape (batch_size, text_seq_len). Required when prompt_embeds is passed.
  • max_sequence_length (int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.

get_text_hidden_states

< >

( prompt: str | list[str] max_sequence_length: int = 512 device: torch.device | None = None )

Tokenize prompt into the fixed-length Krea 2 layout and tap the selected encoder hidden states.

Returns a (hidden_states, attention_mask) tuple of shapes (batch_size, text_seq_len, num_text_layers, text_hidden_dim) and (batch_size, text_seq_len) (bool).

prepare_position_ids

< >

( text_seq_len: int grid_height: int grid_width: int device: device )

Build the (text_seq_len + grid_height * grid_width, 3) rotary coordinates for the combined sequence: text tokens sit at the origin, image tokens carry their (0, h, w) latent-grid coordinates.

Krea2PipelineOutput

class diffusers.pipelines.krea2.Krea2PipelineOutput

< >

( images: list[PIL.Image.Image] | numpy.ndarray )

Parameters

  • images (list[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or numpy array of shape (batch_size, height, width, num_channels).

Output class for the Krea 2 pipeline.

Update on GitHub