Image Background Removal Research Summary

This article provides an overview of the research behind these advances. Understanding the evolution of matting can be helpful when working with background removal in various applications.

When we talk about background removal, two techniques often come up: image segmentation and image matting. While they seem similar at a glance, they solve different problems.

Segmentation vs. Matting

  • Segmentation assigns a hard label (foreground or background) to each pixel.
  • Matting, on the other hand, estimates an alpha value (0 to 1) per pixel, representing how much a pixel belongs to the foreground. This is crucial for semi-transparent or soft edges like hair, smoke, or glass.
SegmentationImage Matting
Hard edges, binary maskSoft transitions, alpha matte

This difference makes matting the go-to solution when quality matters especially in photography, film, and advanced background removal workflows.

Human-in-the-Loop vs. Fully Automatic Approaches

Many matting methods require guidance from the user. This guidance is commonly in the form of a trimap. This trimap defines:

  • Definite foreground (white)
  • Definite background (black)
  • Unknown area (gray), where the model must estimate the alpha matte
Image and Trimap comparison

While trimap-based models like DIM or IndexNet still lead in quality, they rely on user input—not ideal for large-scale or real-time scenarios.

Recent advances aim for fully automated matting, where no user annotation is needed. These "trimap-free" models (e.g., MODNet, MatteFormer) are pushing the boundaries of real-time background removal.

Deep Learning Takes the Lead

Over the past few years, deep learning-based methods have surpassed traditional approaches in nearly every benchmark:

  • Higher accuracy on edges and transparent regions
  • Better generalization across diverse images
  • Real-time performance (even on mobile hardware)

Here is a timeline of the key advances in image background removal:

Deep Image Matting

Authors:

A deep learning approach for accurate alpha matte estimation, addressing challenges in image matting (e.g., overlapping foreground/background colors, complex textures). The method uses a two-stage neural network and introduces a large-scale synthetic dataset (Composition-1k) to improve generalization.

Architecture:

1. Encoder-Decoder Network (VGG-16 backbone) for initial alpha prediction; 2. Refinement Network (4-layer CNN) to enhance details

Input:

RGB image + trimap (user-defined or dilated from ground truth)

Requires Human in Loop:

Yes (trimap creation for inference)

Modality:

Image

Key Innovations:

  • Two-stage architecture: Encoder-decoder network + refinement network for sharp edges
  • Combined loss functions (alpha prediction + compositional loss). Compositional loss is a creative solution.
  • Training images are composited. The paper claims the model is able to generalize to real images.

F, B, Alpha Matting

Authors:

A deep learning method for jointly predicting the alpha matte, foreground (F), and background (B) from an input image and trimap. The approach achieves state-of-the-art performance by introducing a low-cost modification to alpha matting networks, optimizing training regimes, and exploring novel loss functions for joint prediction.

Architecture:

U-Net-style encoder-decoder with ResNet-50 encoder, extended to output 7 channels (1 for α, 3 for F, 3 for B).

Input:

RGB image + trimap (user-defined or generated from ground truth)

Requires Human in Loop:

Yes (trimap creation for inference)

Modality:

Image

Key Innovations:

  • Single encoder-decoder architecture for simultaneous prediction of α, F, and B.
  • Trimap encoded as 9 channels (Gaussian blurs at three scales) for better spatial guidance.
  • Group Normalization with batch size 1 for improved accuracy.
  • Exclusion loss to prevent overlapping gradients in F and B predictions.
  • Fusion mechanism to refine predictions post-inference using Bayesian updates.

Bridging Composite and Real: Towards End-to-end Deep Image Matting

Authors:

Proposes an end-to-end trimap-free matting model (GFM) that decomposes the task into semantic segmentation (Glance Decoder) and transition-area detail matting (Focus Decoder). Introduces the RSSN composition route to reduce domain gaps between synthetic and real-world data, and releases two high-quality datasets (AM-2k for animals, PM-10k for portraits) with manually labeled alpha mattes. Achieves SOTA results by explicitly modeling collaboration between decoders and addressing resolution/sharpness/noise discrepancies in composites.

Architecture:

Shared encoder (DenseNet-121/ResNet-34/101) with two decoders: 1) Glance Decoder (semantic segmentation via Pyramid Pooling Module); 2) Focus Decoder (transition-area matting via Bridge Block with dilated convolutions). Predictions merged via Collaborative Matting (CM).

Input:

Single RGB image (no trimap required)

Requires Human in Loop:

No

Modality:

Image

Key Innovations:

  • GFM: Shared encoder + dual decoders (Glance for semantics, Focus for details) trained collaboratively.
  • RSSN composition route: Blurs backgrounds, denoises inputs, adds uniform noise, and uses BG-20k (20k high-res clean backgrounds) to bridge domain gaps.
  • AM-2k and PM-10k datasets: 2,000 animal and 10,000 portrait images with high-quality alpha mattes.
  • Reduces generalization error by 60% compared to traditional composite training.
  • Trimap-free: Requires only an RGB image as input.

MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition

Authors:

A lightweight neural network for real-time, trimap-free portrait matting via objective decomposition. MODNet splits the task into three sub-objectives (semantic estimation, detail prediction, fusion) optimized simultaneously. Introduces e-ASPP for efficient multi-scale feature fusion and a self-supervised SOC strategy to handle domain shifts. Achieves 67 FPS on a 1080Ti GPU and outperforms prior trimap-free methods on Adobe Matting and PPM-100 benchmarks.

Architecture:

1. Semantic Branch (MobileNetV2 backbone + e-ASPP for coarse semantics); 2. Detail Branch (12-layer CNN for boundary refinement); 3. Fusion Branch (concatenates upsampled semantics and details to predict final alpha).

Input:

Single RGB image (no trimap required)

Requires Human in Loop:

No

Modality:

Image

Key Innovations:

  • Three-branch architecture: Semantic (MobileNetV2 + e-ASPP), Detail (high-res boundary refinement), and Fusion branches trained end-to-end.
  • Efficient ASPP (e-ASPP): Reduces ASPP parameters/FLOPs to 1% while retaining performance via depth-wise convolutions and channel compression.
  • Self-supervised SOC strategy: Enforces consistency between sub-objective predictions to adapt to real-world data without labels.
  • Achieves real-time performance (67 FPS)

MatteFormer: Transformer-Based Image Matting via Prior-Tokens

Authors:

A transformer-based model for image matting that leverages prior-tokens to incorporate global context from trimaps. The model introduces a Prior-Attentive Swin Transformer (PAST) block, which uses prior-tokens to enhance the self-attention mechanism, allowing the model to attend to both local and global information. MatteFormer achieves state-of-the-art performance on standard image matting datasets.

Architecture:

1. Encoder: PAST blocks (modified Swin Transformer blocks with prior-tokens and prior-memory); 2. Decoder: Simple CNN-based structure with upsampling layers.

Input:

RGB image + trimap (user-defined or generated from ground truth alpha matte)

Requires Human in Loop:

Yes (trimap creation for inference)

Modality:

Image

Key Innovations:

  • Introduction of prior-tokens: Global representations of trimap regions (foreground, background, unknown) that participate in the self-attention mechanism.
  • Prior-Attentive Swin Transformer (PAST) block: Combines local spatial-tokens with global prior-tokens in the self-attention layer.
  • Prior-memory: Accumulates prior-tokens from previous blocks to refine global information across layers.

Referring Image Matting

Authors:

Introduces Referring Image Matting (RIM), a new task that extracts meticulous alpha mattes of specific objects using natural language descriptions. Proposes CLIPMat, a vision-language baseline model with context-aware prompting and multi-level detail extraction. Releases RefMatte - the first large-scale dataset (47.5k synthetic images + 100 real-world images) with 474k language expressions and high-quality alpha mattes. Achieves 50-75% error reduction over segmentation methods.

Architecture:

CLIP text/image encoders (ViT-B/16 or ViT-L/14) → Context-embedded Prompt (CP) → TSP module (text-visual cross-attention) → Dual decoders: 1) Semantic decoder (trimap prediction), 2) Detail decoder (alpha matte) with MDE (shallow features + original image fusion)

Input:

RGB image + text description (keyword or expression)

Requires Human in Loop:

No

Modality:

Image, Text

Key Innovations:

  • Language-guided matting: Enables text-based control (keywords/expressions) instead of trimaps/scribbles
  • RefMatte dataset: 230 categories with spatial relationships (left/behind/etc.) and attribute-based expressions
  • CLIPMat architecture: Combines CLIP's vision-language backbone with 1) Context-embedded prompts (CP), 2) Text-driven semantic pop-up (TSP via cross-attention), 3) Multi-level details extractor (MDE)
  • Optional matting refiner improves edge details (85.83 SAD on real-world data)
  • First work to bridge vision-language tasks with high-precision matting

Try Our Background Removal Tool

Experience our state-of-the-art background removal technology based on the research above.