[Auto]Stitching Photo Mosaics

UC Berkeley
CS 180 Fall 2024 Project 4

Project Overview

In Part A of the project, I implemented image warping and mosaicing using perspective transformation (aka homography). Specifically, by spacifying image coordinate correspondences between two images, I can compute the 8 dof homography matrix between them, and use inverse warping and 2-band “Laplacian Stack” blending to produce an image mosaic. For Part B, I followed the paper Multi-Image Matching using Multi-Scale Oriented Patches” by Brown et al. (with some simplification) and use automatic feature matching & RANSAC to compute the homography matrix between two images.

Pictures

I took three pairs of pictures (view from my balcony, view of Shattuck Avenue from my apartment's rooftop, and my apartment's lounge) such that the transforms between them are projective. I fix the zoom and try my best only to rotate with respect to the center of projection. Then, I use the online tool and manually select a dozen of corresponding points in each pair of pictures. The pictures and the selected points are shown below.

Balcony View and Keypoints
Shattuck View and Keypoints
Lounge View and Keypoints

Computing Homography and Warping the Images

To recover the homography, we aim to find a matrix \(H = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 3}\) such that, for each of the n pairs of corresponding points \((x^{(i)}, {x^{(i)}}')_{i=1}^n\) written in homogeneous coordinates, we have \({x^{(i)}}' \sim H x^{(i)}\) (\( \cdot \sim \cdot\) here means equals up to scale). We can formulate this as a system of linear equations: \[ \begin{bmatrix} & & & & & ... & & \\ x_1^{(i)} & x_2^{(i)} & 1 & 0 & 0 & 0 & -x_1^{(i)} {x_1^{(i)}}'& -x_2^{(i)} {x_1^{(i)}}' \\ 0 & 0 & 0 & x_1^{(i)} & x_2^{(i)} & 1 & -x_1^{(i)} {x_2^{(i)}}' & -x_2^{(i)} {x_2^{(i)}}' \\ & & & & & ... & & \end{bmatrix} \begin{bmatrix} h_{11} \\ h_{12} \\ h_{13} \\ h_{21} \\ h_{22} \\ h_{23} \\ h_{31} \\ h_{32} \end{bmatrix} = \begin{bmatrix} ... \\ {x_1^{(i)}}' \\ {x_2^{(i)}}' \\ ... \end{bmatrix} \] Unless when we are using 4 pairs of corresponding points to rectify an image, usually the system is overconstraint, and we can use the least squares solution to approximate the homography matrix (for numerical stability, I use the min-norm solution calculated through the Moore-Penrose Pseudoinverse). To warp one image to another, I use inverse warping with nearest neighbor interpolation method (since linear and cubic interpolation produces blurred results). Since perspective transform preserves lines, I can trace where the corners of the source image land on the target image, and thus determine the edge/mask of the warped image in the target image.

Image Rectification

With these, we are ready to rectify distorted squares/rectangles due to the angle I took the images. Below are two examples of rectification, one for rectifying the book cover for Foundations of Computer Vision, and another for rectifying the US Letter sized paper taped on the entrance of BAIR ... or should I say Berkeley Physics and Chemistry Lab.

Rectification of Book Cover, left: original, right: rectified
Rectification of Meme, left: original, right: rectified

Blending Images into Mosaic

For a pair of images, I first compute the homography matrix between them and then warp one image to the other. To blend the two images, I use the 2-band “Laplacian Stack” blending method. Specifically, I extract the low and high frequency components of the two images using a Gaussian filter. I then creates a continuous mask that is based on the exact euclidean distance transform (i.e. the distances of a point from the nearest edge pixel) of the binary masks. This process produces two continuous masks. Then, I normalize across masks such that the sum of the two masks at each pixel is 1 (or 0 if both masks are 0 at the pixel). I blend the low frequency components using the normalized masks and blend the high frequency components by choosing the high frequency component of the image with the higher mask value at each pixel. Below are the blended mosaics of the three pairs of images I took. The (unnormalized) continuous masks for the last pair (Lounge View) is also shown below.

Mosaic for Balcony View
Mosaic for Shattuck View
Mosaic for Lounge View
Unnormalized Continuous Masks for Lounge View, Obtained from Exact Euclidean Distance Transform

Harris Corner Detector

Harris corner detector uses the second moment matrix (the outer product of smoothed derivatives from convolving the image with a DoG filter) to find corners in the image. Specifically, we use the determinant of the second moment matrix divided by the trace of the matrix as a measure of cornerness. We threshold the measure with threshold \(10\) and discard corners within \(20\) pixels to the image boundaries. Below the leftmost plots are the harris corners overlaid on the original images. We see there are too many corners (on the magnitude of 10k).

Corners for Balcony View. Left: Harris Corners, Middle: Maximum Thresholding to 500 corners, Right: Adaptive Non-Maximal Suppression to 500 corners
Corners for Shattuck View. Left: Harris Corners, Middle: Maximum Thresholding to 500 corners, Right: Adaptive Non-Maximal Suppression to 500 corners
Corners for Lounge View. Left: Harris Corners, Middle: Maximum Thresholding to 500 corners, Right: Adaptive Non-Maximal Suppression to 500 corners

Adaptive Non-Maximal Suppression

From the middle plots above, we can see that maximum thresholding on the Harris measure of cornerness does not give well-spread distributions of corners: the 500 corners are packed together in clusters. To provide a more uniform distribution of corners, we implement a robust non-maximal Suppression, where for each corner \(i\) we assign \(r_i = \min(|x_i - x_j|), 0.9 \times f(x_j) > f(x_i) \). Here \(x_i\) is the coordinate of corner \(i\) and \(f(x_i)\) is the cornerness measure. We then choose the corners with the largest 500 \(r_i\) values. Intuitively, we are finding the local maxima of cornerness. The rightmost plots above show the results of this process. We see the corners are much more evenly distributed across the image.

Feature Descriptor extraction

For each point of interest (corners filtered through Adaptive Non-Maximal Suppression), we simply extract the \(40 \times 40\) patch centered at each corner. Then, for each patch, we pass through a Gaussian filter (with cutoff frequency corresponding to the Nyquist frequency of the resampled image) and resizes it to \(8 \times 8\) and normalize it to have zero mean and unit variance. Below are some examples of the feature descriptors (unnormalized for visualization purposes).

9 Example Feature Descriptors from Balcony View
9 Example Feature Descriptors from Shattuck View
9 Example Feature Descriptors from Lounge View

Feature Matching

After extracting the feature descriptors, we can match them using the nearest neighbor approach. For each feature descriptor in the first image, we find the 2-nearest neighbor descriptors in the second image using the Euclidean distance. We then calculate the ratio between the error of the first and second nearest neighbor and threshold it with a value of \(0.5\) to filter out poor matches. The figures below show the matches between the three pairs of images. We see that even with these filtering techniques, we still have one false match for the Shattuck view and one for the Lounge view.

59 Matches for Balcony View
29 Matches for Shattuck View
115 Matches for Lounge View

Robust Homography with RANSAC

To compute a robust homography matrix, we can use 4-point RANSAC (RANdom SAmple Consensus). Specifically, we sample a lot of 4-point combinations from the matches and compute the exact homography for each combination. For each homography matrix, we can then count the number of inliers (matches that are consistent up to 10 pixels with the homography). Finally, we select the largest set of inliners and recompute the final homography matrix using only those inliers. Below are the final automatically generated mosaics using RANSAC. For both the Shattuck and Lounge views, RANSAC discards the mismatched points.

Auto Mosaic for Balcony View
Auto Mosaic for Shattuck View
Auto Mosaic for Lounge View

Comparison between Manual and Auto Stitching

We can see below that auto stitching produces results that are pretty similar to using manually annotated correspondences.

Comparison between Mosaics for Balcony View. Left: Manual, Right: Auto
Comparison between Mosaics for Shattuck View. Left: Manual, Right: Auto
Comparison between Mosaics for Lounge View. Left: Manual, Right: Auto

What have you learned?

The pipeline (detecting and extracting features, matching them, and using RANSAC to filter outliers) we have implemented is widely used (e.g. in SLAM), and implementing it helped me understand how it works. I think this will better equip me for the final project, where I plan to do SLAM/3D reconstruction.