One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.

ByteDance research team took a different approach with Lance. Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start.

https://arxiv.org/pdf/2605.18678

What Lance Can Do

Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities.

This all-in-one capability is a major milestone. While standard unified architectures typically stop at basic image understanding and text-to-image generation, Lance is among the few to natively bridge the entire image-video ecosystem across both understanding and generation tasks.

https://arxiv.org/pdf/2605.18678

How the Architecture Works

The architecture is based on two principles: unified context modeling and decoupled capability pathways.

For unified context, Lance converts all inputs — text, images, and videos — into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visual inputs, the Qwen2.5-VL ViT encoder produces compact semantic visual tokens. For generation-oriented visual inputs, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token types — text, semantic visual, and latent visual — live in the same sequence. The model then runs generalized 3D causal attention over the full context, with text tokens using causal attention and visual tokens using bidirectional attention.

For decoupled pathways, Lance uses a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL 3B. The understanding expert (LLMUND) handles text and semantic visual tokens, producing outputs for multimodal reasoning and text generation. The generation expert (LLMGEN) handles VAE latent tokens for visual synthesis and editing. Crucially, both experts operate over the same shared interleaved sequence — they share context but don’t compete for the same parameters. The understanding expert is trained with a next-token prediction loss; the generation expert is trained with a flow matching objective in continuous latent space. The two losses are combined with configurable weights throughout training.

Modality-Aware Rotary Positional Encoding (MaPE)

Running ViT semantic tokens, clean VAE condition tokens, and noisy VAE target tokens through the same sequence creates a subtle problem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone — it has no way to tell these token groups apart. When multiple visual token groups occupy the same sequence, their positional boundaries become ambiguous, which can hurt cross-task alignment.

Lance introduces Modality-Aware Rotary Positional Encoding (MaPE) to fix this. MaPE applies a fixed temporal offset to each modality group based on its index in the sequence. Spatial coordinates stay unchanged, so the intrinsic layout within images and videos is preserved. The temporal offset alone is enough to separate the token groups in the global positional space without disrupting temporal ordering within any individual video.

Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 — consistent degradation across generation, editing, and understanding.

Training: Four Stages, One Unified Framework

Lance is trained through four sequential stages, each building on the last.

Pre-Training (PT) lays the foundation using approximately 1B image-text and 140M video-text pairs, covering 1.5T training tokens. This stage establishes basic multimodal alignment and generation capability. The VAE and ViT encoders are frozen here; only the backbone and connectors are trained.

Continual Training (CT) expands the task space by introducing interleaved multi-task data — editing samples, subject-driven generation samples, and multimodal understanding data — across approximately 300B tokens. A progressive data-mixture schedule gradually increases the proportion of harder tasks like editing as training proceeds.

Supervised Fine-Tuning (SFT) tightens instruction following, editing accuracy, and identity consistency using curated high-quality data across 72B tokens.

Reinforcement Learning (RL) uses Group Relative Policy Optimization (GRPO), with PaddleOCR serving as the reward model, to further sharpen text rendering accuracy and image-text alignment.

Everything fits within a maximum training budget of 128 GPUs.

Results

Image Generation. On GenEval, Lance scores 0.90 overall, matching TUNA for the top spot among unified models. Subcategory scores include counting (0.84), colors (0.97), and spatial position (0.87). On DPG-Bench, Lance scores 84.67 overall, with particularly strong relation modeling — though TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter efficiency in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the top unified model score at 3B activated parameters.

Video Generation. On VBench, Lance achieves a Total Score of 85.11 (using LLM rewriting), the highest among unified models. The next-best unified model, TUNA, scores 84.06. Lance also outscores dedicated generation-only models including HunyuanVideo (83.43) and Wan2.1-T2V (83.69).

Image Editing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the highest among unified models. It leads in background change, material modification, motion change, portrait beautification, subject removal, subject replacement, and tone transfer. Text modification is flagged as a remaining weakness.

Video Understanding. On MVBench, Lance achieves a 62.0 overall score, the highest among unified models. Show-o2 (7B), the next-best unified model, scores 55.7. Lance also outperforms several understanding-only models with more parameters — notable given that it is simultaneously trained for generation and editing.

Marktechpost’s Visual Explainer

Step 1 of 6

Step 01 — Prerequisites

Check Your Environment First

Before cloning the repository, confirm your system meets the minimum software and hardware requirements. Lance requires CUDA-capable hardware with significant VRAM.

🐍

Python

3.10 or higher

Required

⚡

CUDA

12.4 or higher

Required

🖥️

GPU VRAM

40 GB minimum

For inference

📦

License

Apache 2.0

Open—source

Note: A GPU with at least 40 GB VRAM is required for running inference. CUDA 12.4+ is mandatory — lower versions are not officially supported.

Step 02 — Clone the Repository

Clone from GitHub

Clone the official Lance repository from ByteDance on GitHub. The repository includes the inference scripts, Gradio interface, benchmark scripts, and model configuration files.

git clone https://github.com/bytedance/Lance
cd Lance

The repository structure you will see after cloning:

inference_lance.py

Main inference script for all tasks

inference_lance.sh

Shell wrapper with configurable parameters

lance_gradio_t2v_v2t.py

Gradio UI for T2V and V2T tasks

config/examples/

JSON example configs per task type

Step 03 — Install Dependencies

Install Required Packages

Install all Python dependencies from the provided requirements.txt file. It is strongly recommended to use a dedicated virtual environment or conda environment before installing.

# Create and activate a conda environment (recommended)
conda create -n lance-env python=3.10 -y
conda activate lance-env

# Install all dependencies
pip install -r requirements.txt

Tip: Using a clean conda environment prevents dependency conflicts with other projects on the same machine.

Step 04 — Download Model Weights

Download Lance—3B Checkpoints

Download all necessary model checkpoints from the official Hugging Face repository at bytedance-research/Lance. After downloading, place all files in the downloads/ directory inside your cloned repo.

# Install the Hugging Face CLI if not already installed
pip install huggingface_hub

# Download the model weights
huggingface-cli download bytedance-research/Lance \
–local-dir downloads/

Your directory should look like this after downloading:

Lance/
└── downloads/
└── Lance_3B_Video/ ◄ model weights go here

Note: Model weights are large files. Ensure you have sufficient disk space and a stable connection before downloading.

Step 05 — Run Inference

Run Tasks via the CLI

Lance provides a unified command—line interface for all tasks via inference_lance.sh. Configure parameters at the top of the shell script before running. Supported tasks are listed below.

t2i

Text—to—image generation

t2v

Text—to—video generation

image_edit

Image editing from instruction

video_edit

Video editing from instruction

x2t_image

Image understanding / VQA

x2t_video

Video understanding / captioning

Example command for text—to—video generation at 480p:

bash inference_lance.sh \
–TASK_NAME t2v \
–MODEL_PATH downloads/Lance_3B_Video \
–RESOLUTION video_480p \
–NUM_FRAMES 121 \
–VIDEO_HEIGHT 480 \
–VIDEO_WIDTH 848 \
–SAVE_PATH_GEN results/t2v

Step 06 — Gradio UI & Tips

Launch the Gradio Interface (Optional)

For a visual interface covering text—to—video and video—to—text tasks, Lance includes a ready—to—run Gradio app.

python lance_gradio_t2v_v2t.py

Prompt Tips

For all tasks, follow the prompt format used in the provided example configs under config/examples/. Using the recommended format typically leads to better generation quality.

x2t_image_example.json

Examples for image understanding and VQA

x2t_video_example.json

Examples for video understanding and captioning

Customize: You can modify TASK_DEFAULT_CONFIGS in inference_lance.py to set your own default data samples for each task type.

Key Takeaways

Lance is a 3B activated parameter native unified multimodal model that handles image and video understanding, generation, and editing within a single jointly trained framework.
A dual-stream mixture-of-experts architecture with Modality-Aware Rotary Positional Encoding (MaPE) decouples understanding and generation pathways while keeping them in shared interleaved multimodal context.
Lance achieves 0.90 on GenEval and 85.11 on VBench, the highest Total Score among unified models, trained within a maximum budget of 128 GPUs.
On MVBench, Lance scores 62.0, the highest among unified models — outperforming Show-o2 (7B) at 55.7, while also supporting generation and editing.
Lance is open-source under Apache 2.0, with weights available on Hugging Face.

Check out the Paper, Model Weights and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

What's Hot

The ‘stunning, behemoth’ Galaxy Tab S10 Ultra just scored a $350 discount during Best Buy’s Black Friday in July sale

Garmin wins on training, Google wins on value

How to make your Android safer without changing how you use it

Pixel Watch 5 to launch August 12 with a pricier smaller model

‘Video Remix’ in Google Photos reimagines reality into your ideal scenario

Samsung might bring Privacy Display to every Galaxy S27 model

Google just pulled the plug on Pixel’s AI image generator

Apple’s new iPhone ad couldn’t try any harder to make Android phones look ancient [Video]

Amazon search now generates AI images of fake products [Video]

The ‘stunning, behemoth’ Galaxy Tab S10 Ultra just scored a $350 discount during Best Buy’s Black Friday in July sale

Garmin wins on training, Google wins on value

How to make your Android safer without changing how you use it

The ‘stunning, behemoth’ Galaxy Tab S10 Ultra just scored a $350 discount during Best Buy’s Black Friday in July sale

Garmin wins on training, Google wins on value

How to make your Android safer without changing how you use it

Usefull link

categories

What's Hot

One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

What Lance Can Do

How the Architecture Works

Modality-Aware Rotary Positional Encoding (MaPE)

Training: Four Stages, One Unified Framework

Results

Marktechpost’s Visual Explainer

Check Your Environment First

Clone from GitHub

Install Required Packages

Download Lance—3B Checkpoints

Run Tasks via the CLI

Launch the Gradio Interface (Optional)

Key Takeaways

Related Posts

Usefull link

categories