How Automatic Image Background Removal Works

Author: Imran Kocabiyik

Image without background

An illustration of an encoder-decoder neural network for image matting

In today’s digital landscape, background removal is a critical process for film production, image editing, augmented reality, virtual reality, and media entertainment. It involves removing the original background from an image or video. To do this successfully, one needs to understand the various techniques that can be used to remove backgrounds. This blog post summarizes how background removal works, and it briefly synthesizes the main solution techniques and their limitations. Let’s dive in!

Defining the Problem

The technical term for image background removal is image matting or alpha matting. It is the process of accurately extracting the foregrounds from images.

To extract the foreground, we need to have a mask called alpha matte. Here is an illustration of an image and a corresponding alpha matte:

An illustration of the image and alpha matte pair

An illustration of the image and alpha matte pair

More formally, the image can be modeled as a combination of foreground $F$ and background image $B$ by using the alpha matte $\alpha$:

$$ \begin{equation}I_{i}=\alpha_{i} F_{i}+\left(1-\alpha_{i}\right) B_{i} \quad \alpha_{i} \in[0,1]\end{equation} $$

Why Background Removal is Hard

An alpha matte is a matrix of values between 0 and 1 (inclusive).

  • 0 indicates that the pixel belongs to the background
  • 1 indicates that the pixel belongs to the foreground
  • The intermediate values between 0 and 1 correspond to different levels of partial opacity. They represent the pixels in the transition region.

Predicting an accurate alpha matte can be very challenging due to the complexity and variability of real-world images. This separation requires precise identification of the edges of the foreground object, as well as accurate estimation of the color and texture of both the foreground and background.

Nevertheless, with a good approximation of the alpha matte, the results are often visually indistinguishable from the ground truth and the final results are visually appealing.

Relieving the Burden

To make the problem easier, many of the matting methods require some additional user inputs. The most common annotation is a trimap. A trimap is a mask that indicates the absolute background, the absolute foreground, and the transition region.

The trimap allows the matting algorithm to focus its efforts on the unknown pixels, which are typically more difficult to classify than the foreground or background pixels. By providing this initial segmentation of the image, the trimap can help improve the accuracy and efficiency of the matting algorithm.

Below you can see an example of an image and its corresponding trimap:

An illustration of trimap which serves as an additional input in background removal applications

An illustration of trimap which serves as an additional input in background removal applications

Comparing the Solutions

In terms of implementation, background removal techniques can be grouped under two main categories: traditional methods and deep learning methods.

Traditional Methods

There are several traditional methods for image matting, each of which has its own strengths and limitations. Some of the most common methods include:

Chroma keying

This method uses color information to separate the foreground from the background, by selecting a specific color (often blue or green) and replacing all pixels of that color with transparent pixels.

Chroma keying is a technique used in video production to combine two images or video streams together by replacing a specific color, known as the "key color," in one of the images with the corresponding pixels from the other image. This allows the two images to be composited together seamlessly, creating the illusion that the objects in the two images are part of the same scene.

One of the main limitations of chroma keying is that it can be difficult to find a key color that is not present in the foreground objects of the image. If the key color appears in the foreground objects, it will be replaced by the pixels from the other image, which can create an unrealistic or unnatural composite. This can be especially challenging when working with images that have a lot of variation in color, or when the lighting conditions are not consistent across the two images being composited. Additionally, chroma keying can be sensitive to variations in lighting and color, which can make it difficult to achieve good results without careful planning and setup.

The difference between the green chroma key and the blue chroma key technique is the color that is used as the key color. In the green chroma key, the key color is green, and in the blue chroma key, the key color is blue.

The choice of which color to use as the key color depends on a number of factors, such as the colors present in the foreground objects of the image, the lighting conditions, and the equipment being used. Green and blue are commonly used as key colors because they are not typically found in human skin tones, and they are easy to distinguish from other colors in most lighting conditions. However, other colors can also be used as key colors, depending on the specific needs of the application.

withoutbg API
The easiest and fastest way to remove backgrounds from your images.
Try withoutbg API

Closed-form Matting

Closed-form matting is a technique used to estimate the alpha matte, or transparency, of an image. An alpha matte is a grayscale image that represents the transparency of each pixel in a color image. In closed-form matting, the alpha matte is estimated by solving a system of linear equations based on the colors of the pixels in the image. This approach has the advantage of being fast and accurate, but it can be sensitive to noise and may not always produce good results. It is commonly used in applications such as image compositing, where the alpha matte is used to combine multiple images together seamlessly.

Looking the Neighbors: Bayesian matting

This method uses a statistical model to estimate the transparency of each pixel, based on the colors of neighboring pixels and the known foreground and background colors.

For predicting the opacity of an unknown pixel, the neighbor pixel could be a reference because statistically, they have similar colors. Therefore, samples collected from the neighbor pixels they are used to predict the alpha matte.

Edge-based matting

This method uses edge detection algorithms to identify the boundaries between the foreground and background and then uses this information to estimate the transparency of each pixel. This method can be effective for images with well-defined edges, but it can be sensitive to noise and may not work well for images with more complex or blurred edges.

Introducing Deep Learning

It is considered one of the most accurate methods for background removal because it does not rely heavily on specific features such as color or texture but rather looks at all aspects of an image or video frame simultaneously when making a decision about what should be classified as a foreground object and what should be classified as a background object.

However, it is difficult to train a model because it requires a big and diverse dataset, supervision, and computational power.

The deep learning model is mostly an encoder-decoder neural network. The encoder synthesizes the image and compresses it in a latent space. The decoder takes the latent-space representation and constructs the alpha matte.

The solution presented in the below diagram is producing good results but, there is one limitation: The network requires a trimap as an additional input, which can only be provided by a human. So, the solution doesn’t work for applications that need to work in real time.

An illustration of a model which takes an RGB image and a trimap as inputs

An illustration of a model which takes an RGB image and a trimap as inputs

Getting Rid of The Trimap Requirement

Because trimaps or scribbles are a significant limitation, new methods have been developed. Below are some solutions that don’t use trimap as an additional input.

Training Two Neural Networks

When the trimap is dropped, the neural network may not capture the fine details, which are very important in image matting. So, the network still needs some additional masks. It could be a segmentation mask, a coarse map, or anything that highlights the foreground region.

An image matting pipeline with two neural networks

An image matting pipeline with two neural networks

Google Pixel phone is powered with such a solution for blurring the portrait background. Here are more details: Accurate Alpha Matting for Portrait Mode Selfies on Pixel 6

Neural Network with Multiple Objectives

Another solution is developing a neural network with multi-tasks. With this setting, the neural network can learn both high-level semantics and low-level details by using one single encoder. The following illustration is a simplified version of the Glance and Focus Neural Network, presented in the Bridging Composite and Real: Towards End-to-end Deep Image Matting paper.

In the following setting, there is one shared encoder and two decoders. The two decoders are working collaboratively. One of the decoders is returning the semantic map, the other is capturing fine details. Finally, the results are combined and alpha matte is produced.

An image matting pipeline with one encoder and two decoders with different tasks

An image matting pipeline with one encoder and two decoders with different tasks

withoutbg API
The easiest and fastest way to remove backgrounds from your images.
Try withoutbg API

Summary

Image matting is the process of extracting foreground elements from the background in an image. This is an important task in image processing, as it allows for the isolation of specific elements in an image, which can be useful for a variety of applications, such as creating composites, removing backgrounds, or creating digital masks for video editing.

Solving the image matting problem is difficult for several reasons. One of the main challenges is that it is often difficult to accurately separate the foreground from the background, particularly when the foreground and background colors are similar. Additionally, complex textures and patterns can also make the matting process more difficult.

Traditionally, image matting has been solved using optimization-based approaches, which can be time-consuming and require extensive manual intervention. In recent years, however, deep learning-based methods have been developed that can automatically learn to extract foreground elements from images. These methods have shown promising results and can provide more accurate and efficient solutions to the image matting problem. Overall, image matting is an important task in image processing, and advances in deep learning have opened up new possibilities for solving this challenging problem.

References

    Accurate Alpha Matting for Portrait Mode Selfies on Pixel 6 [link]
    Sergio Orts Escolano and Jana Ehman, Software Engineers, Google Research (JANUARY 24, 2022)
    Bridging Composite and Real: Towards End-to-end Deep Image Matting [link]
    Li, J., Zhang, J., Maybank, S. J., & Tao, D. (2020).