A Very Basic Overview of Neural Radiance Fields (NeRF)

Can they one day replace photos?

Tim Cheng
Towards Data Science


Figure 1. NeRF Pipeline. Given a large set of images, NeRF learns to implicitly represent the 3D shape, such that new views can later on be synthesised. Image retrieved from the original NeRF paper by Mildenhall et al.

The deep learning era began through the advancements it brought in traditional 2D image-recognition tasks such as classifications, detections, and instance segmentations. As the techniques matured, the research in deep-learning-based computer vision has been shifted towards fundamental 3D computer vision problems — one of the most notable being synthesising new views of an object and reconstructing the 3D shape of it from images. Many approaches tackled this as a conventional machine learning problem, where the goal becomes to learn a system to “inflate” 3D geometry out of images after a finite set of training iterations. Recently, however, a completely new direction, namely Neural Radiance Fields (NeRF), has been introduced. This article dives into the basic concepts of the originally proposed NeRF as well as several of its extensions in recent years.

Representing the Geometry Implicitly

The biggest difference between a NeRF model and traditional neural networks for 3D reconstruction is that NeRF is an instance-specific implicit representation of an object.

In simple words, given a set of images capturing the same object from multiple angles along with their corresponding poses, the network learns to represent the 3D object such that new views can be synthesised in a consistent manner with the training set of views.

Starting with a Basic MLP

Figure 2. NeRF Training Overview. Image retrieved from the original NeRF paper by Mildenhall et al.

While such implicit representation seems difficult, Mildenhall et al. in their first NeRF paper have shown that a simple Multilayer Perceptron (MLP) withholds enough capacity to perform such a complex task.

Specifically, the input of this fully connected network is a single 5D coordinate (3 for location and 2 for viewing direction), and the output is the density and colour of the given location. In practice, density only matters with the location and not the viewing direction, and so only location is used to to predict the density, while viewing direction is combined with the…



Oxford CS | Top Writer in AI | Posting on Deep Learning and Vision

Recommended from Medium


See more recommendations