The Evolution of Image Background Removal Tech

Discover how AI learned to separate subjects from backgrounds - from early manual tools to modern AI that understands language!

Deep Image Matting

Authors:

A deep learning approach for accurate alpha matte estimation, addressing challenges in image matting (e.g., overlapping foreground/background colors, complex textures). The method uses a two-stage neural network and introduces a large-scale synthetic dataset (Composition-1k) to improve generalization.

Architecture:

1. Encoder-Decoder Network (VGG-16 backbone) for initial alpha prediction; 2. Refinement Network (4-layer CNN) to enhance details

Input:

RGB image + trimap (user-defined or dilated from ground truth)

Requires Human in Loop:

Yes (trimap creation for inference)

Modality:

Image

Key Innovations:

  • Two-stage architecture: Encoder-decoder network + refinement network for sharp edges
  • Combined loss functions (alpha prediction + compositional loss). Compositional loss is a creative solution.
  • Training images are composited. The paper claims the model is able to generalize to real images.

F, B, Alpha Matting

Authors:

A deep learning method for jointly predicting the alpha matte, foreground (F), and background (B) from an input image and trimap. The approach achieves state-of-the-art performance by introducing a low-cost modification to alpha matting networks, optimizing training regimes, and exploring novel loss functions for joint prediction.

Architecture:

U-Net-style encoder-decoder with ResNet-50 encoder, extended to output 7 channels (1 for α, 3 for F, 3 for B).

Input:

RGB image + trimap (user-defined or generated from ground truth)

Requires Human in Loop:

Yes (trimap creation for inference)

Modality:

Image

Key Innovations:

  • Single encoder-decoder architecture for simultaneous prediction of α, F, and B.
  • Trimap encoded as 9 channels (Gaussian blurs at three scales) for better spatial guidance.
  • Group Normalization with batch size 1 for improved accuracy.
  • Exclusion loss to prevent overlapping gradients in F and B predictions.
  • Fusion mechanism to refine predictions post-inference using Bayesian updates.

Bridging Composite and Real: Towards End-to-end Deep Image Matting

Authors:

Proposes an end-to-end trimap-free matting model (GFM) that decomposes the task into semantic segmentation (Glance Decoder) and transition-area detail matting (Focus Decoder). Introduces the RSSN composition route to reduce domain gaps between synthetic and real-world data, and releases two high-quality datasets (AM-2k for animals, PM-10k for portraits) with manually labeled alpha mattes. Achieves SOTA results by explicitly modeling collaboration between decoders and addressing resolution/sharpness/noise discrepancies in composites.

Architecture:

Shared encoder (DenseNet-121/ResNet-34/101) with two decoders: 1) Glance Decoder (semantic segmentation via Pyramid Pooling Module); 2) Focus Decoder (transition-area matting via Bridge Block with dilated convolutions). Predictions merged via Collaborative Matting (CM).

Input:

Single RGB image (no trimap required)

Requires Human in Loop:

No

Modality:

Image

Key Innovations:

  • GFM: Shared encoder + dual decoders (Glance for semantics, Focus for details) trained collaboratively.
  • RSSN composition route: Blurs backgrounds, denoises inputs, adds uniform noise, and uses BG-20k (20k high-res clean backgrounds) to bridge domain gaps.
  • AM-2k and PM-10k datasets: 2,000 animal and 10,000 portrait images with high-quality alpha mattes.
  • Reduces generalization error by 60% compared to traditional composite training.
  • Trimap-free: Requires only an RGB image as input.

MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition

Authors:

A lightweight neural network for real-time, trimap-free portrait matting via objective decomposition. MODNet splits the task into three sub-objectives (semantic estimation, detail prediction, fusion) optimized simultaneously. Introduces e-ASPP for efficient multi-scale feature fusion and a self-supervised SOC strategy to handle domain shifts. Achieves 67 FPS on a 1080Ti GPU and outperforms prior trimap-free methods on Adobe Matting and PPM-100 benchmarks.

Architecture:

1. Semantic Branch (MobileNetV2 backbone + e-ASPP for coarse semantics); 2. Detail Branch (12-layer CNN for boundary refinement); 3. Fusion Branch (concatenates upsampled semantics and details to predict final alpha).

Input:

Single RGB image (no trimap required)

Requires Human in Loop:

No

Modality:

Image

Key Innovations:

  • Three-branch architecture: Semantic (MobileNetV2 + e-ASPP), Detail (high-res boundary refinement), and Fusion branches trained end-to-end.
  • Efficient ASPP (e-ASPP): Reduces ASPP parameters/FLOPs to 1% while retaining performance via depth-wise convolutions and channel compression.
  • Self-supervised SOC strategy: Enforces consistency between sub-objective predictions to adapt to real-world data without labels.
  • Achieves real-time performance (67 FPS)

MatteFormer: Transformer-Based Image Matting via Prior-Tokens

Authors:

A transformer-based model for image matting that leverages prior-tokens to incorporate global context from trimaps. The model introduces a Prior-Attentive Swin Transformer (PAST) block, which uses prior-tokens to enhance the self-attention mechanism, allowing the model to attend to both local and global information. MatteFormer achieves state-of-the-art performance on standard image matting datasets.

Architecture:

1. Encoder: PAST blocks (modified Swin Transformer blocks with prior-tokens and prior-memory); 2. Decoder: Simple CNN-based structure with upsampling layers.

Input:

RGB image + trimap (user-defined or generated from ground truth alpha matte)

Requires Human in Loop:

Yes (trimap creation for inference)

Modality:

Image

Key Innovations:

  • Introduction of prior-tokens: Global representations of trimap regions (foreground, background, unknown) that participate in the self-attention mechanism.
  • Prior-Attentive Swin Transformer (PAST) block: Combines local spatial-tokens with global prior-tokens in the self-attention layer.
  • Prior-memory: Accumulates prior-tokens from previous blocks to refine global information across layers.

Referring Image Matting

Authors:

Introduces Referring Image Matting (RIM), a new task that extracts meticulous alpha mattes of specific objects using natural language descriptions. Proposes CLIPMat, a vision-language baseline model with context-aware prompting and multi-level detail extraction. Releases RefMatte - the first large-scale dataset (47.5k synthetic images + 100 real-world images) with 474k language expressions and high-quality alpha mattes. Achieves 50-75% error reduction over segmentation methods.

Architecture:

CLIP text/image encoders (ViT-B/16 or ViT-L/14) → Context-embedded Prompt (CP) → TSP module (text-visual cross-attention) → Dual decoders: 1) Semantic decoder (trimap prediction), 2) Detail decoder (alpha matte) with MDE (shallow features + original image fusion)

Input:

RGB image + text description (keyword or expression)

Requires Human in Loop:

No

Modality:

Image, Text

Key Innovations:

  • Language-guided matting: Enables text-based control (keywords/expressions) instead of trimaps/scribbles
  • RefMatte dataset: 230 categories with spatial relationships (left/behind/etc.) and attribute-based expressions
  • CLIPMat architecture: Combines CLIP's vision-language backbone with 1) Context-embedded prompts (CP), 2) Text-driven semantic pop-up (TSP via cross-attention), 3) Multi-level details extractor (MDE)
  • Optional matting refiner improves edge details (85.83 SAD on real-world data)
  • First work to bridge vision-language tasks with high-precision matting

Try Our Background Removal Tool

Experience our state-of-the-art background removal technology based on the research above.