MagicQuill V2

Precise and Interactive Image Editing with Layered Visual Cues

Zichen Liu^*,1,2 Yue Yu^*,1,2 Hao Ouyang² Qiuyu Wang² Shuailei Ma^2,3 Ka Leong Cheng² Wen Wang^2,4 Qingyan Bai^1,2 Yuxuan Zhang⁵ Yanhong Zeng² Yixuan Li^2,5 Xing Zhu² Yujun Shen² Qifeng Chen¹

¹HKUST, ²Ant Group, ³NEU, ⁴ZJU, ⁵CUHK

📄 Paper 💻 GitHub 🤗 HF Demo

Abstract

We propose MagicQuill V2, a novel system that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.

Layered Editing Examples

(Hover over Layered Composition to decompose layers)

Content Layer

Layered Composition

hover me

Editing Result

Editing Prompt: A girl sits and embraces her dog, with her vintage car parked near by.

Layered Composition

hover me

Editing Result

Editing Prompt: A man in cloak rides a white horse through a serene lake at dawn.

Layered Composition

Editing Result

Editing Prompt: A group photo of 4 people in a sunny day.

Layered Composition

Editing Result

Editing Prompt: A shot of a modern industrial living room with furnitures.

Control Layer

Structural Layer: Edge

Layered Composition

Editing Result

Editing Prompt: Add a leather collar with bell to the dog.

Layered Composition

Editing Result

Editing Prompt: Make this cup extra large.

Layered Composition

Editing Result

Editing Prompt: Add a flower in this hand.

Layered Composition

Editing Result

Editing Prompt: Let him wear a golden dollar-shape glasses.

Color Layer: Color

Layered Composition

Editing Result

Editing Prompt: Turn her T-shirt's color to deep green.

Layered Composition

Editing Result

Editing Prompt: Turn his T-shirt into vibrant, seamless multi-color horizontal gradient

Layered Composition

Editing Result

Editing Prompt: Turn the flowers into blue and red.

Layered Composition

Editing Result

Editing Prompt: Turn her crown into blue diamond.

Spatial Layer: Local Edit

Layered Composition

Editing Result

Editing Prompt: Turn the dog's head to face the camera.

Layered Composition

Editing Result

Editing Prompt: Convert the man into black and white lineart style.

Layered Composition

Editing Result

Editing Prompt: Add candles on the cake and a fork by the cake.

Layered Composition

Editing Result

Editing Prompt: Change the text from "NanoBanana" to "MagicQuill", and the banana neon icon to a feather quill icon.

Spatial Layer: Removal

Layered Composition

Editing Result

Editing Prompt: Remove the instance.

Layered Composition

Editing Result

Editing Prompt: Remove the instance.

Layered Composition

Editing Result

Editing Prompt: Remove the instance.

Layered Composition

Editing Result

Editing Prompt: Remove the instance.

Tutorial

System Overview

The Toolbar (A) features a new Local Edit Brush for defining the target editing area, along with tools from MagicQuill V1.

The Visual Cue Manager (B) holds all content layer visual cues (foreground props) that users can drag onto the canvas to define what to generate.

Users can refine these cues using the Image Segmentation Panel (C) by clicking the segment icon . This panel allows precise object extraction using dots or bounding boxes, powered by SAM.

Image Segmentation

Click the segment icon to enter the segmentation UI. Users can perform four operations:

Add positive dots to indicate areas to include.
Add negative dots to indicate areas to exclude.
Add bounding box to bound the region of interest.
Use eraser to refine and erase unwanted areas.

After segmentation, click the Save/Save as new prop button to add the foreground prop to the Visual Cue Manager, or fill with any brush.

Layer Operations

1. Content Layer

Users can click a foreground prop in the Visual Cue Manager to add it to the canvas for puzzle-like editing. Use the local edit brush to specify the edit location. The result will respect the user-provided foreground props.

Layered Composition