NeRF

implementation of Vanilla NeRF

Francesco Crivelli¹,

¹University of California Berkeley

write-up Code

Model Architecture

Here's a detailed description of the model's architecture and hyperparameters:

Neural Field Model Architecture

Input:

2D coordinate (x, y) ∈ [0,1]²

Positional Encoding:

Takes a coordinate p and maps it to a higher dimensional space using:

\[\gamma(p) = (sin(2^0\pi p), cos(2^0\pi p), ..., sin(2^{L-1}\pi p), cos(2^{L-1}\pi p))\]

Input dimension for each coordinate increases from 1 to 2L
Total encoding dimension = 2 + 22L (original coordinates + encoded dimensions)
L is the maximum frequency used in encoding

MLP Architecture:

CopyInput (2 + 2*2*L dims) → Linear(256) → ReLU →
Linear(256) → ReLU →
Linear(256) → ReLU →
Linear(256) → ReLU →
Linear(3) → Sigmoid → Output (RGB)

Output:

3-dimensional RGB color values ∈ [0,1]³
Sigmoid activation ensures valid color range

Hyperparameters:

Base Configuration (Fox, Mama, Papa photos):

Frequency encoding parameter L = 10 (resulting in 42-dimensional encoded input)
Learning rate = 0.01
Batch size = 10,000
Number of epochs = 300
Optimizer: Adam
Loss function: Mean Squared Error (MSE)

Francesco Photo Experiments:

Configuration 1:

L = 5 (22-dimensional encoded input)
Learning rate = 0.01
Number of epochs = 100

Configuration 2:

L = 10 (42-dimensional encoded input)
Learning rate = 0.005
Number of epochs = 100

Configuration 3:

L = 10 (42-dimensional encoded input)
Learning rate = 0.05
Number of epochs = 100

Configuration 4:

L = 15 (62-dimensional encoded input)
Learning rate = 0.01
Number of epochs = 100

PSNR Metric:

Peak Signal-to-Noise Ratio calculation:
\[PSNR = 10 \cdot \log_{10}(1/MSE)\]
Used to evaluate reconstruction quality

The model learns a continuous function mapping from 2D coordinates to RGB colors, effectively representing the image as a neural field. The positional encoding helps the network learn high-frequency details, while the MLP architecture provides the capacity to learn complex spatial patterns.

Original Image

Variable Speed Training

PSNR Plot

Training images of the fox model

Original Image

Variable Speed Training

PSNR Plot

Training images of mama model

Original Image

Variable Speed Training

PSNR Plot

Training images of papa model

Original Image

L=5, lr=0.01

L=5 PSNR Plot

L=10, lr=0.005

L=10 PSNR Plot

L=15, lr=0.01

L=10, lr=0.05

L=10 PSNR Plot

L=15 PSNR Plot

Comparison of Extremes (L=5 vs L=15)

Training images of Francesco model with different hyperparameters

NeRF: From 2D Images to 3D Neural Fields

Let's explore how NeRF transforms a collection of 2D images into a continuous 3D neural representation:

Ray Generation: Creating Our 3D Vision

Just like how our eyes perceive the world through light rays, NeRF starts by simulating this process:

Camera Space to World Space:

\[X_w = [R|t] X_c\]

We first transform the camera's view into world coordinates, where [R|t] represents how the camera is positioned and oriented in the world.

From Pixels to Camera Rays:

\[x_c = K^{-1} \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} s\]

Each pixel is converted into a ray in the camera's view, where K captures the camera's internal properties.

Creating View Rays:

Ray Origin: \[r_o = -R^{-1}t\]
Ray Direction: \[r_d = normalize(X_w - r_o)\]

Sampling: Exploring the 3D Space

With our rays defined, we now sample points along them to explore the 3D space:

Strategic Ray Selection:

We collect rays from all training images
Each ray carries information about its origin, direction, and the true color it should see
During training, we randomly sample these rays to learn different viewpoints

Point Sampling Strategy:

Along each ray, we sample points to understand what the ray encounters:

\[x = r_o + r_d \cdot t\]

where t spans from near (2.0) to far (6.0), giving us a comprehensive view of the space

Neural Network: Learning the Scene

The heart of NeRF is its neural network that learns to represent the 3D scene:

Input Processing:

Location in 3D space (x,y,z)
Viewing angle (θ,φ)
Enhanced through positional encoding:

\[\gamma(p) = (sin(2^0\pi p), cos(2^0\pi p), ..., sin(2^{L-1}\pi p), cos(2^{L-1}\pi p))\]

Network Architecture:

Deep network with 8 layers processing spatial information
Skip connections to maintain fine spatial details
View direction influences final color prediction
Produces color (RGB) and density (σ) for each point

Volume Rendering: Creating the Final Image

Finally, we combine all points along each ray to produce the final color:

\[C(r) = \sum_{i=1}^N T_i(1 - exp(-\sigma_i\delta_i))c_i\]

where:

\[T_i = exp(-\sum_{j=1}^{i-1} \sigma_j\delta_j)\] represents how much previous points block the view
σᵢ tells us how solid each point is
cᵢ is the color at each point

Training Process:

Process 10,000 rays in each training batch
Use Adam optimizer with 5e-4 learning rate
Achieve high-quality results (~23 PSNR) after 1000 steps

Through this process, NeRF learns to transform a set of 2D images into a rich 3D representation that can generate novel views of the scene from any angle.

Ray visualization - view 1

Ray visualization - view 2

Camera View 1 Render

Camera View 2 Render

Network Optimization Progress

300 Training Steps

NeRF Reconstruction at 300 Steps

1000 Training Steps

NeRF Reconstruction at 1000 Steps

3000 Training Steps

NeRF Reconstruction at 3000 Steps

PSNR Plots

300 Steps

1000 Steps

3000 Steps

Bells and Whistles: Depth Reconstruction

Depth Map (Grayscale)

Depth Map (RGB)

NeRF

implementation of Vanilla NeRF

Neural Field Model Architecture

Input:

Positional Encoding:

MLP Architecture:

Output:

Hyperparameters:

Base Configuration (Fox, Mama, Papa photos):

Francesco Photo Experiments:

Configuration 1:

Configuration 2:

Configuration 3:

Configuration 4:

PSNR Metric:

Original Image

Variable Speed Training

PSNR Plot

Training images of the fox model

Original Image

Variable Speed Training

PSNR Plot

Training images of mama model

Original Image

Variable Speed Training

PSNR Plot

Training images of papa model

Original Image

L=5, lr=0.01

L=5 PSNR Plot

L=10, lr=0.005

L=10 PSNR Plot

L=15, lr=0.01

L=10, lr=0.05

L=10 PSNR Plot

L=15 PSNR Plot

Comparison of Extremes (L=5 vs L=15)

Training images of Francesco model with different hyperparameters

Ray Generation: Creating Our 3D Vision

Camera Space to World Space:

From Pixels to Camera Rays:

Creating View Rays:

Sampling: Exploring the 3D Space

Strategic Ray Selection:

Point Sampling Strategy:

Neural Network: Learning the Scene

Input Processing:

Network Architecture:

Volume Rendering: Creating the Final Image

Training Process:

Ray visualization - view 1

Ray visualization - view 2

Camera View 1 Render

Camera View 2 Render

Network Optimization Progress

300 Training Steps

NeRF Reconstruction at 300 Steps

1000 Training Steps

NeRF Reconstruction at 1000 Steps

3000 Training Steps

NeRF Reconstruction at 3000 Steps

PSNR Plots

300 Steps

1000 Steps

3000 Steps

Bells and Whistles: Depth Reconstruction

Depth Map (Grayscale)

Depth Map (RGB)

NeRF A reconstruction at high level of fidelity