FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing

Paper | Github | Project Page

Overview

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

Model Architecture

The model consists of three main components:

  1. Text Encoder: UMT5-XXL encoder for text feature extraction
  2. Latent Diffusion Model: Transformer-based diffusion model operating in latent space
  3. VAE Decoder: 1D convolutional VAE for decoding latent features to motion sequences

Technical Specifications:

  • Input: Natural language text
  • Output: Motion sequences in two formats:
    • 263-dimensional HumanML3D features (default)
    • 22Γ—3 joint coordinates (optional, with EMA smoothing support)
  • Latent dimension: 4
  • Upsampling factor: 4Γ— (VAE decoder)
  • Frame rate: 20 FPS

Installation

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU with 16GB+ VRAM (recommended)
  • 16GB+ system RAM

Dependencies

Step 1: Install basic dependencies

pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy

Step 2: Install Flash Attention (Required)

Flash attention requires CUDA and may need compilation. Choose the appropriate method:

pip install flash-attn --no-build-isolation

Note: Flash attention is required for this model. If installation fails, please refer to the official flash-attention installation guide.

Quick Start

Basic Usage

from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True
)

# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate motion as joint coordinates (22 joints Γ— 3 coords) with ema (alpha: 0.0-1.0)
motion_joints = model("a person walking forward", length=60, output_joints=True, smoothing_alpha=0.5)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)

Batch Generation

# Generate multiple motions efficiently
texts = [
    "a person walking forward",
    "a person running quickly", 
    "a person jumping up and down"
]
lengths = [60, 50, 40]  # Different lengths for each motion

motions = model(texts, length=lengths)

for i, motion in enumerate(motions):
    print(f"Motion {i}: {motion.shape}")

Multi-Text Motion Transitions

# Generate a motion sequence with smooth transitions between actions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]  # Transition points in latent tokens
)

# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")

API Reference

model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False, smoothing_alpha=1.0)

Generate motion sequences from text descriptions.

Parameters:

  • text (str, List[str], or List[List[str]]): Text description(s)

    • Single string: Generate one motion
    • List of strings: Batch generation
    • Nested list: Multiple text prompts per motion (for transitions)
  • length (int or List[int], default=60): Number of latent tokens to generate

    • Output frames β‰ˆ length Γ— 4 (due to VAE upsampling)
    • Example: length=60 β†’ 240 frames (12 seconds at 20 FPS)
  • text_end (List[int] or List[List[int]], optional): Latent token positions for text transitions

    • Only used when text is a nested list
    • Specifies when to switch between different text descriptions
    • IMPORTANT: Must have the same length as the corresponding text list
      • Example: text=[["walk", "turn", "sit"]] requires text_end=[[20, 40, 60]] (3 endpoints for 3 texts)
    • Must be in ascending order
  • num_denoise_steps (int, optional): Number of denoising iterations

    • Higher values produce better quality but slower generation
    • Recommended range: 10-50
  • output_joints (bool, default=False): Output format selector

    • False: Returns 263-dimensional HumanML3D features
    • True: Returns 22Γ—3 joint coordinates for direct visualization
  • smoothing_alpha (float, default=1.0): EMA smoothing factor for joint positions (only used when output_joints=True)

    • 1.0: No smoothing (default)
    • 0.5: Medium smoothing (recommended for smoother animations)
    • 0.0: Maximum smoothing
    • Range: 0.0 to 1.0

Returns:

  • Single motion:
    • output_joints=False: numpy.ndarray of shape (frames, 263)
    • output_joints=True: numpy.ndarray of shape (frames, 22, 3)
  • Batch: List[numpy.ndarray] with shapes as above

Example:

# Single generation (263-dim features)
motion = model("walk forward", length=60)  # Returns (240, 263)

# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True)  # Returns (240, 22, 3)

# Batch generation
motions = model(["walk", "run"], length=[60, 50])  # Returns list of 2 arrays

# Multi-text transitions
motion = model(
    [["walk", "turn"]],
    length=[60],
    text_end=[[30, 60]]
)  # Returns list with 1 array of shape (240, 263)

Update History

  • 2025/12/8: Added EMA smoothing option for joint positions during rendering

Citation

If you use this model in your research, please cite:

@article{cai2025flooddiffusion,
  title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
  author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
  journal={arXiv preprint arXiv:2512.03520},
  year={2025}
}

Troubleshooting

Common Issues

ImportError with trust_remote_code:

# Solution: Add trust_remote_code=True
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True  # Required!
)

Out of Memory:

# Solution: Generate shorter sequences
motion = model("walk", length=30)  # Shorter = less memory

Slow first load: The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.

Module import errors: Ensure all dependencies are installed:

pip install lightning diffusers omegaconf ftfy numpy
Downloads last month
19
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using ShandaAI/FloodDiffusion 2

Paper for ShandaAI/FloodDiffusion