FreeEdit:

Mask-free Reference-based Image Editing with Multi-modal Instruction

1Institute of Information Engineering, Chinese Academy of Sciences,
2Meituan, 3Beihang University

Image Editing Examples

Abstract

Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user's intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods.

Pipeline


Pipeline: FreeEdit consists of three components: (a) Multi-modal instruction encoder. (b) Detail extractor. (c) Denosing U-Net. Text instruction and reference image are firstly fed into the multi-modal instruction encoder to generate multi-modal instruction embedding. The reference image is additionally fed into the detail extractor to obtain fine-grained features. The original image latent is concatenated with the noise latent to introduce the original image condition. Denosing U-Net accepts the 8-channel input and interacts with the multi-modal instruction embedding through cross-attention. The DRRA modules which connect the detail extractor and the denoising U-Net, are used to integrate fine-grained features from the detail extractor to promote ID consistency with the reference image. (d) The editing examples obtained using FreeEdit.

Dataset

To address the lack of datasets, we developed FreeBench, a highquality dataset designed to support reference-based instruction-driven editing. Previous studies always struggled with constructing image triplets that include the original, edited, and reference images, because it is difficult to maintain the ID consistency of the reference images and the edited images. To resolve this, we implement a twice-repainting construction scheme to ensure identity consistency between the edited object and the reference, based on the real-world segmentation dataset.

Pipeline for dataset construction and examples of training samples. (a) Image triplet construction. We repaint the source image in the existing real-world segmentation dataset twice to form the image triplet. (b) Instruction Construction. We use multiple powerful MLLMs to caption the generated image, and combine the resulting local descriptions with instruction templates to form edit instructions. (c) Examples of the training dataset. The item in the dataset contains images before and after editing and a multi-modal instruction.

Statistics for the FreeBench dataset. The first four parent classes in FreeBench are animals, food, kitchenware, and vehicles. FreeBench covers the vast majority of categories in daily life, allowing us to train a generalizable zero-shot reference-based image editing model.

Reference-based Image Editing

Qualitative comparisons of FreeEdit with previous methods, including mask-based methods PaintByExample, AnyDoor, MimicBrush, and mask-free methods InstructPix2Pix, Kosmos-G. Mask-based methods require the user to manually provide the mask of the editing area. AnyDoor also needs to provide a foreground mask of the reference image, and the editing mask fed to AnyDoor will be processed as a box because its training is box-based. Mask-free methods are language-based and don't require additional mask input, where we take InstructPix2Pix with detailed instructions as a baseline for comparison. The inputs required for each method are marked below each line of images. S* denotes the specific visual concept in the reference image, and O* denotes the original image to be edited.

Object Removal

Previous reference-based inpainting methods use image features as a control condition for cross-attention, which makes the trained model no longer able to support other tasks. Instead, FreeEdit could support reference-free object removal task by simply setting the reference scale λ in DRRA to 0 due to flexible reference attention and multi-modal language instructions.

A qualitative comparison of FreeEdit with SD-Inpainting and InstructPix2Pix on the object removal task. SD-Inpainting requires the edit area mask of the object to be deleted, while our FreeEdit and InstructPix2Pix perform object removal without the need for masks in the form of language instructions. The input mask for SD-Inpainting is highlighted in green in the 2nd column.

Plain-text instruction-driven Editing

As with object removal task, FreeEdit also supports a broader range of plain-text instruction-driven editing which is not limited to a specific type of editing, as our multi-modal instructions are closely related to text instructions.

The Visual comparison of FreeEdit with InstructPix2Pix and MagicBrush on plain-text instruction-driven editing task. The corresponding text instruction used for each edit case is marked below the corresponding images.

Mask-free Virtual Try-on

As a versatile reference-based image editing model, FreeEdit can be extended to virtual try-on task. Different from the previous methods, FreeEdit simplifies the inference pipeline of virtual try-on. Users could execute the task according to a concise multi-modal language instruction such as "replace her top with S*" that conforms to human habits and does not need to provide a manual mask.

BibTeX

@misc{he2024freeedit,
      title={FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction}, 
      author={Runze He and Kai Ma and Linjiang Huang and Shaofei Huang and Jialin Gao and Xiaoming Wei and Jiao Dai and Jizhong Han and Si Liu},
      year={2024},
      eprint={2409.18071},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.18071}, 
}