Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.
ByteDance research team took a different approach with Lance. Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities — trained jointly from the start.
https://arxiv.org/pdf/2605.18678
What Lance Can Do
Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing — including multi-turn consistency editing across both modalities.
This all-in-one capability is a major milestone. While standard unified architectures typically stop at basic image understanding and text-to-image generation, Lance is among the few to natively bridge the entire image-video ecosystem across both understanding and generation tasks.
https://arxiv.org/pdf/2605.18678
How the Architecture Works
The architecture is based on two principles: unified context modeling and decoupled capability pathways.
For unified context, Lance converts all inputs — text, images, and videos — into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visual inputs, the Qwen2.5-VL ViT encoder produces compact semantic visual tokens. For generation-oriented visual inputs, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token types — text, semantic visual, and latent visual — live in the same sequence. The model then runs generalized 3D causal attention over the full context, with text tokens using causal attention and visual tokens using bidirectional attention.
For decoupled pathways, Lance uses a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL 3B. The understanding expert (LLMUND) handles text and semantic visual tokens, producing outputs for multimodal reasoning and text generation. The generation expert (LLMGEN) handles VAE latent tokens for visual synthesis and editing. Crucially, both experts operate over the same shared interleaved sequence — they share context but don’t compete for the same parameters. The understanding expert is trained with a next-token prediction loss; the generation expert is trained with a flow matching objective in continuous latent space. The two losses are combined with configurable weights throughout training.
Modality-Aware Rotary Positional Encoding (MaPE)
Running ViT semantic tokens, clean VAE condition tokens, and noisy VAE target tokens through the same sequence creates a subtle problem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone — it has no way to tell these token groups apart. When multiple visual token groups occupy the same sequence, their positional boundaries become ambiguous, which can hurt cross-task alignment.
Lance introduces Modality-Aware Rotary Positional Encoding (MaPE) to fix this. MaPE applies a fixed temporal offset to each modality group based on its index in the sequence. Spatial coordinates stay unchanged, so the intrinsic layout within images and videos is preserved. The temporal offset alone is enough to separate the token groups in the global positional space without disrupting temporal ordering within any individual video.
Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 — consistent degradation across generation, editing, and understanding.
Training: Four Stages, One Unified Framework
Lance is trained through four sequential stages, each building on the last.
Pre-Training (PT) lays the foundation using approximately 1B image-text and 140M video-text pairs, covering 1.5T training tokens. This stage establishes basic multimodal alignment and generation capability. The VAE and ViT encoders are frozen here; only the backbone and connectors are trained.
Continual Training (CT) expands the task space by introducing interleaved multi-task data — editing samples, subject-driven generation samples, and multimodal understanding data — across approximately 300B tokens. A progressive data-mixture schedule gradually increases the proportion of harder tasks like editing as training proceeds.
Supervised Fine-Tuning (SFT) tightens instruction following, editing accuracy, and identity consistency using curated high-quality data across 72B tokens.
Reinforcement Learning (RL) uses Group Relative Policy Optimization (GRPO), with PaddleOCR serving as the reward model, to further sharpen text rendering accuracy and image-text alignment.
Everything fits within a maximum training budget of 128 GPUs.
Results
Image Generation. On GenEval, Lance scores 0.90 overall, matching TUNA for the top spot among unified models. Subcategory scores include counting (0.84), colors (0.97), and spatial position (0.87). On DPG-Bench, Lance scores 84.67 overall, with particularly strong relation modeling — though TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter efficiency in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the top unified model score at 3B activated parameters.
Video Generation. On VBench, Lance achieves a Total Score of 85.11 (using LLM rewriting), the highest among unified models. The next-best unified model, TUNA, scores 84.06. Lance also outscores dedicated generation-only models including HunyuanVideo (83.43) and Wan2.1-T2V (83.69).
Image Editing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the highest among unified models. It leads in background change, material modification, motion change, portrait beautification, subject removal, subject replacement, and tone transfer. Text modification is flagged as a remaining weakness.
Video Understanding. On MVBench, Lance achieves a 62.0 overall score, the highest among unified models. Show-o2 (7B), the next-best unified model, scores 55.7. Lance also outperforms several understanding-only models with more parameters — notable given that it is simultaneously trained for generation and editing.
Marktechpost’s Visual Explainer
Step 1 of 6
Step 01 — Prerequisites
Check Your Environment First
Before cloning the repository, confirm your system meets the minimum software and hardware requirements. Lance requires CUDA-capable hardware with significant VRAM.
🐍
Python
3.10 or higher
Required
⚡
CUDA
12.4 or higher
Required
🖥️
GPU VRAM
40 GB minimum
For inference
📦
License
Apache 2.0
Open—source
Note: A GPU with at least 40 GB VRAM is required for running inference. CUDA 12.4+ is mandatory — lower versions are not officially supported.
Step 02 — Clone the Repository
Clone from GitHub
Clone the official Lance repository from ByteDance on GitHub. The repository includes the inference scripts, Gradio interface, benchmark scripts, and model configuration files.
git clone https://github.com/bytedance/Lance
cd Lance
The repository structure you will see after cloning:
inference_lance.py
Main inference script for all tasks
inference_lance.sh
Shell wrapper with configurable parameters
lance_gradio_t2v_v2t.py
Gradio UI for T2V and V2T tasks
config/examples/
JSON example configs per task type
Step 03 — Install Dependencies
Install Required Packages
Install all Python dependencies from the provided requirements.txt file. It is strongly recommended to use a dedicated virtual environment or conda environment before installing.
# Create and activate a conda environment (recommended)
conda create -n lance-env python=3.10 -y
conda activate lance-env
# Install all dependencies
pip install -r requirements.txt
Tip: Using a clean conda environment prevents dependency conflicts with other projects on the same machine.
Step 04 — Download Model Weights
Download Lance—3B Checkpoints
Download all necessary model checkpoints from the official Hugging Face repository at bytedance-research/Lance. After downloading, place all files in the downloads/ directory inside your cloned repo.
# Install the Hugging Face CLI if not already installed
pip install huggingface_hub
# Download the model weights
huggingface-cli download bytedance-research/Lance \
–local-dir downloads/
Your directory should look like this after downloading:
Lance/
└── downloads/
└── Lance_3B_Video/ ◄ model weights go here
Note: Model weights are large files. Ensure you have sufficient disk space and a stable connection before downloading.
Step 05 — Run Inference
Run Tasks via the CLI
Lance provides a unified command—line interface for all tasks via inference_lance.sh. Configure parameters at the top of the shell script before running. Supported tasks are listed below.
t2i
Text—to—image generation
t2v
Text—to—video generation
image_edit
Image editing from instruction
video_edit
Video editing from instruction
x2t_image
Image understanding / VQA
x2t_video
Video understanding / captioning
Example command for text—to—video generation at 480p:
bash inference_lance.sh \
–TASK_NAME t2v \
–MODEL_PATH downloads/Lance_3B_Video \
–RESOLUTION video_480p \
–NUM_FRAMES 121 \
–VIDEO_HEIGHT 480 \
–VIDEO_WIDTH 848 \
–SAVE_PATH_GEN results/t2v
Step 06 — Gradio UI & Tips
Launch the Gradio Interface (Optional)
For a visual interface covering text—to—video and video—to—text tasks, Lance includes a ready—to—run Gradio app.
python lance_gradio_t2v_v2t.py
Prompt Tips
For all tasks, follow the prompt format used in the provided example configs under config/examples/. Using the recommended format typically leads to better generation quality.
x2t_image_example.json
Examples for image understanding and VQA
x2t_video_example.json
Examples for video understanding and captioning
Customize: You can modify TASK_DEFAULT_CONFIGS in inference_lance.py to set your own default data samples for each task type.
Key Takeaways
- Lance is a 3B activated parameter native unified multimodal model that handles image and video understanding, generation, and editing within a single jointly trained framework.
- A dual-stream mixture-of-experts architecture with Modality-Aware Rotary Positional Encoding (MaPE) decouples understanding and generation pathways while keeping them in shared interleaved multimodal context.
- Lance achieves 0.90 on GenEval and 85.11 on VBench, the highest Total Score among unified models, trained within a maximum budget of 128 GPUs.
- On MVBench, Lance scores 62.0, the highest among unified models — outperforming Show-o2 (7B) at 55.7, while also supporting generation and editing.
- Lance is open-source under Apache 2.0, with weights available on Hugging Face.
Check out the Paper, Model Weights and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

