Projects

Structure from Motion with Deferred Feature Matching and Subset Bundle Adjustment

Authors

Abstract

Structure from motion techniques allow the digital 3D reconstruction of an object from a set of 2D images. With the advance of cheap but high quality digital cameras, online image sharing platforms, and increases in readily available computational power, this field has been of increasing interest in recent years. For the initial sparse reconstruction that seeks to estimate the camera parameters, this paper proposes new approaches for the feature matching and bundle adjustment steps. The intended use cases for these optimizations are reconstructions of individual objects with a large amount of overlap between the input images. To evaluate the new approach it is compared to two other well known implementations, and it is shown that the new approach outperforms these state of the art techniques in speed and quality by achieving up to a threefold speed increase while simultaneously reducing the camera position errors by half on public benchmark datasets.

Images

Code

The implemented structure from motion pipeline uses Deferred Feature Matching and Subset Bundle Adjustment to increase the speed and precision of the reconstruction. It performs a sparse reconstruction to estimate the camera parameters and outputs these to be used by subsequent dense reconstruction algorithms. The intended use case are reconstructions of a single object of interest with a lot of overlap in the source images.

Requirements

The code base is written in C++11 and Cuda 5.0 and requires an AVX capable CPU (Intel Ivybridge architecture or higher) and a Cuda device with a compute capability of at least 3.0 (Kepler architecture or higher) for execution. The entire framework was developed under linux but should be easy to port to other OS ecosystems like Windows. Compilation was tested with GCC 4.8.3.
In addition, the following libraries are needed:
  • Doxygen (to build the documentation)
  • CMake (build system)
  • Cuda
  • ImageMagick
  • Boost
  • TinyXML
  • xxd (a command line tool for converting to hex)
  • GLEW (for some of the GUI tools, not for the main program)
  • X11 (for some of the GUI tools, not for the main program)

How to Build the Framework

To build the documentation (you'll need Doxygen), go to the "doc/" directory and execute the "buildDoc.sh" shell script, e.g.:
$ cd doc
$ ./buildDoc.sh

You will then find the main documentation in "doc/html/index.html" with instructions on how to build and use the SfM pipeline.
Building the actual code is based on CMake. See the doxygen generated documentation for more details.

Structure of the Codebase

The codebase is structured into several libraries and projects. The most important part is the backend which holds the entire implementation of the actual structure from motion pipeline. A frontend, in the form of a command line program, is supplied as well which combines the structure from motion implementation with input and output handling into a fully functional application. For details, see the documentation.

Downloads

  • Version 1.0: Original version of the paper (<1MB)
  • Version 1.1: CMake build system, minor bug fixes, enhanced radial distortion support, all data (cuda kernel etc.) compiled into the executable to prevent working directory issues (<1MB)
  • Version 1.2: Fixed sparse viewer version check; Added a slower SSE4.2 code path for older machines (<1MB)

Datasets

The "Fountain-P11", "Herz-Jesu-P25", and "Castle-P30" datasets by C. Strecha can be found on his project page. The raw data for the volcano and volcanic bomb shown in the "Additional Results" section below can be found here.
In addition, we provide here the images of the turntable dataset:

Citation

If you use this code or the dataset for your research, please cite:
@something{ Ley:2015:somewhere,
    author = {Ley, Andreas and Hänsch, Ronny and Hellwich, Olaf},
    title = {Structure from Motion with Deferred Feature Matching and Subset Bundle Adjustment},
    ...
}

License

The implementation is distributed under GPLv3. If this license doesn't fit your needs, you can get in contact with us.

Additional Results

In the following, we provide additional results which were omitted from the main paper for reasons of brevity.

Bundle Adjustment Verification on Synthetic Dataset

We built a custom BA implementation which allows the matches to be weighted according to their sizes in the source images. To verify the implementation and the hypothesis of improving the reconstruction through weighting, we constructed a synthetic dataset. The following figure plots the remaining (true) camera position errors over the computation time for our Bundle Adjustment implementation with and without weighting. For reference, the convergence of PBA is also shown.

Bundle adjustment performance on a synthetic data set with and without weighting the feature point matches by their image space sizes. For reference, the performance of PBA on the same data is also shown.

It can be observed that weighting allows the bundle adjustment to converge to a significantly lower remaining error.
Note that PBA operates faster due to its multi-threaded implementation. We also considered writing a multi-threaded implementation but omitted it. With the reduction of the problem size in Subset Bundle Adjustment, the benefits of multi-threading are not as significant, as the following test on a smaller synthetic dataset shows.

Bundle adjustment performance on a small, synthetic data set, such as it could arise in a Subset Bundle Adjustment scenario. PBA cannot exploit its multicore abilities, and the custom bundle adjustment implementation with weighting pulls ahead in speed and quality.

Still, for larger (> 200 image) datasets, a multi-threaded BA implementation like PBA might be beneficial even with Subset Bundle Adjustment.

Residual Radial Distortion

In the paper we showed the reconstructions of three benchmark datasets by Christoph Strecha which were assumed to be free of radial distortion. Of course, no non-synthetic set of images can ever be 100% free of distortion but in the light of the results shown in the main paper, especially from the "Castle-P30" dataset, the question remains how valid the "no radial distortion" assumption truly is.
To verify the presence or absence of radial distortion and quantify it, we augmented our bundle adjustment implementation with an option which extends the internal calibration parameters to include a simple polynomial model for radial distortion. The model consists of three parameters which, unlike the model used for example in Visual SFM, control the mapping from the undistorted Euclidean image space positions to the distorted ones: $$ \begin{align} r &= \left| \vec{O}_\text{eucl} \right|^2 \cdot \kappa_2 + \left| \vec{O}_\text{eucl} \right|^3 \cdot \kappa_3 + \left| \vec{O}_\text{eucl} \right|^4 \cdot \kappa_4 \\ \vec{O}_\text{eucl,dist} &= \vec{O}_\text{eucl} \cdot (1 + r) \end{align} $$ Apart from the reversal of the direction, mapping from undistorted to distorted instead of vice versa, and the addition of higher order terms than \(\kappa_2\), this model is similar to the one employed by Visual SFM. It is important to note that like in Visual SFM the image center is assumed to be the center of distortion even though the internal calibration in our implementation is more sophisticated and actually estimates the principal point. However, since the radial distortion mode was seen more as an analytical tool, these limitations were considered acceptable.
The radial distortion was estimated for the three benchmark datasets "Fountain-P11", "Herz-Jesu-P25", and "Castle-P30". The following figure depicts the estimated residual distortions.

Estimated residual radial distortion in the benchmark datasets by Strecha et al.

The graph shows the estimated movement of the image content due to the distortion, which depends on the distance to the image center. Note that since the images are all 3072 by 2048 pixels, a distance of 1024 pixels corresponds to the top and bottom of the image, a distance of 1536 pixels corresponds to the left and right of the image, and a distance of 1846 pixels corresponds to the image corners. The estimated distortions are very small, well below two pixels in magnitude. For reference, typical image distortions for raw images from a camera are in the range of tens of pixels or more, even for standard non-fisheye lenses. What is interesting is that the three estimations, which were carried out independently from each other on different datasets, all result in similar values. The spread of the three curves is also smaller than the estimated values. This indicates that the radial distortion model employed by Strecha to undistort the images was probably off by about one pixel towards the borders.
Note that being off by one pixel at the borders is actually a fairly small error but it seems to be enough to introduce severe quality problems. The reason why the "Castle-P30" dataset is more sensitive to this probably has to do with the nature of its camera poses. The first two datasets have a lot of overlap where a lot of camera connectivity can be established by matching feature points from the image centers, where the radial distortion is small. The lower amount of overlap in "Castle-P30" and the way in which the cameras look at the inner facade puts more emphasis on matches of feature points from the left and right borders of the images, which suffer from the residual radial distortion. This explains why the new approach with radial distortion estimation activated performs significantly better on "Castle-P30" but also why Visual SFM's preemptive matching mode seems to improve the precision there. The preemptive matching mode filters out those image pairs from the list of image pairs that undergo the initial pairwise feature point matching that are unlikely to result in matches. Presumably, images that share little content because they only overlap at the borders are likely to be excluded from the pairwise matching, which coincidentally reduces the amount of matches of feature points from image borders.

Additional Datasets

To examine how well the processing speed of SfM with Deferred Feature Matching and Subset Bundle Adjustment scales to bigger datasets, we performed additional reconstructions for larger, real-world datasets.
Volcano
The Volcano dataset is a dataset used by James and Robson (2012) to investigate the usefulness of SfM reconstructions for geoscience applications (The images were taken by Ben van Wyk de Vries, Univ. Blaise Pascal, Clermont Ferrand). The dataset consists of 133 images of the summit craters of the Piton de la Fournaise volcano. The images have a resolution of 3072 by 2048 pixels and were taken with a Canon EOS D60 with a 20mm lens from a small airplane. Images were taken in multiple circles of varying radii. The images seem to be the unaltered images from the camera, which means that they still contain strong radial distortions. With no access to the camera or the lens, the radial distortion had to be estimated from those images in order to undistort them. This was achieved by running our SfM pipeline with radial distortion estimation enabled on the distorted dataset, which performs surprisingly well given that it was not designed for this. The parameters obtained from that reconstruction were then used to undistort the images, and all tests, those with Visual SFM and those with the new approach, were performed on the undistorted images. Due to the large number of images, Bundler was excluded from this test and Visual SFM was only run in preemptive matching mode as with naive matching the matching step alone takes it well past the 30 minutes mark.
The following figure shows the amount of time the individual frameworks and modes need for the reconstruction.

Speed comparison on the Volcano dataset.

Visual SFM with preemptive matching, shared internal calibrations, and radial distortion estimation again did not produce a reasonable solution. The new approach is almost twice as fast as Visual SFM, which shows that Deferred Feature Matching and Subset Bundle Adjustment can scale to bigger datasets.
Volcanic Bomb
The Volcanic Bomb dataset is also used by James and Robson (2012) in the same publication as the Volcano dataset and features many high resolution images of a "volcanic bomb". A volcanic bomb is a mass of molten rock that is ejected from a volcano during an eruption and then cools and solidifies in flight. This particular sample is from the Soufriere Hills volcano on Montserrat, and shows the cracked outer surface and deep crevices of a "bread-crust" bomb. The dataset consists of 210 images and is the largest one shown here. The images have a resolution of 4272 by 2848 pixels and were acquired with a Canon EOS 450D and a 50mm lens. Radial distortion was handled as with the Volcano dataset. The sample was placed on a turntable and multiple rings of images were taken in addition to some hand shots from the top. To provide the reconstruction with images from the bottom of the sample, the sample was turned around, and the acquisition was repeated. This creates an interesting scenario for our SfM pipeline. The best place to start the reconstruction is somewhere "in the middle", which in this case is the intersection point where the rings of both inclinations meet. However, the user has to pick the initial image pair, and while "the middle" was pretty obvious in the previous datasets, this is not the case in this one. For this reason, the following tests include the results for the new approach twice, once for an unoptimal starting point, denoted "+badStart", and once for a good initial image pair, denoted "+goodStart".
The reconstruction speeds can be seen in the following figure.

Speed comparison on the Volcanic Bomb dataset.

Visual SFM with preemptive matching, shared internal calibrations and radial distortion estimation was unable to produce a reasonable reconstruction again. Visual SFM with individual internal calibrations performed extremely well on this dataset in terms of speed, pulling equal with the new approach. However, previous datasets already revealed that individual internal calibrations can severely degrade the reconstruction quality. As fast as Visual SFM with individual internal calibrations is on this dataset, its shared internal calibrations mode is extremely slow, taking about twice as long in total. When started from a bad initial image pair, the new approach performs about as well as Visual SFM with individual internal calibrations. From a good initial image pair, it is slightly faster, but the difference is almost negligible. This would indicate that while the new approach can scale to over 200 images and still compete with Visual SFM, some additional optimizations might be necessary to ensure scalability beyond that number of images.