NeRF

implementation of Vanilla NeRF

1University of California Berkeley
NeRF Rotation
Model Architecture

Here's a detailed description of the model's architecture and hyperparameters:

Neural Field Model Architecture

Input:

2D coordinate (x, y) ∈ [0,1]²

Positional Encoding:

Takes a coordinate p and maps it to a higher dimensional space using:

\[\gamma(p) = (sin(2^0\pi p), cos(2^0\pi p), ..., sin(2^{L-1}\pi p), cos(2^{L-1}\pi p))\]

  • Input dimension for each coordinate increases from 1 to 2L
  • Total encoding dimension = 2 + 22L (original coordinates + encoded dimensions)
  • L is the maximum frequency used in encoding

MLP Architecture:

CopyInput (2 + 2*2*L dims) → Linear(256) → ReLU →
Linear(256) → ReLU →
Linear(256) → ReLU →
Linear(256) → ReLU →
Linear(3) → Sigmoid → Output (RGB)

Output:

3-dimensional RGB color values ∈ [0,1]³
Sigmoid activation ensures valid color range

Hyperparameters:

Base Configuration (Fox, Mama, Papa photos):

  • Frequency encoding parameter L = 10 (resulting in 42-dimensional encoded input)
  • Learning rate = 0.01
  • Batch size = 10,000
  • Number of epochs = 300
  • Optimizer: Adam
  • Loss function: Mean Squared Error (MSE)

Francesco Photo Experiments:

Configuration 1:
  • L = 5 (22-dimensional encoded input)
  • Learning rate = 0.01
  • Number of epochs = 100
Configuration 2:
  • L = 10 (42-dimensional encoded input)
  • Learning rate = 0.005
  • Number of epochs = 100
Configuration 3:
  • L = 10 (42-dimensional encoded input)
  • Learning rate = 0.05
  • Number of epochs = 100
Configuration 4:
  • L = 15 (62-dimensional encoded input)
  • Learning rate = 0.01
  • Number of epochs = 100

PSNR Metric:

Peak Signal-to-Noise Ratio calculation:
\[PSNR = 10 \cdot \log_{10}(1/MSE)\]
Used to evaluate reconstruction quality

The model learns a continuous function mapping from 2D coordinates to RGB colors, effectively representing the image as a neural field. The positional encoding helps the network learn high-frequency details, while the MLP architecture provides the capacity to learn complex spatial patterns.

Original Image

Fox Image 1

Variable Speed Training

Fox Image 2

PSNR Plot

PSNR Plot

Training images of the fox model

Original Image

Mama Original

Variable Speed Training

Mama Training

PSNR Plot

Mama PSNR Plot

Training images of mama model

Original Image

Papa Original

Variable Speed Training

Papa Training

PSNR Plot

Papa PSNR Plot

Training images of papa model

Original Image

Francesco Original

L=5, lr=0.01

Francesco L5

L=5 PSNR Plot

Francesco L5 PSNR

L=10, lr=0.005

Francesco L10 0.005

L=10 PSNR Plot

Francesco L10 0.005 PSNR

L=15, lr=0.01

Francesco L15

L=10, lr=0.05

Francesco L10 0.05

L=10 PSNR Plot

Francesco L10 0.05 PSNR

L=15 PSNR Plot

Francesco L15 PSNR

Comparison of Extremes (L=5 vs L=15)

Francesco Extremes Comparison

Training images of Francesco model with different hyperparameters

NeRF: From 2D Images to 3D Neural Fields

Let's explore how NeRF transforms a collection of 2D images into a continuous 3D neural representation:

Ray Generation: Creating Our 3D Vision

Just like how our eyes perceive the world through light rays, NeRF starts by simulating this process:

Camera Space to World Space:

\[X_w = [R|t] X_c\]

We first transform the camera's view into world coordinates, where [R|t] represents how the camera is positioned and oriented in the world.

From Pixels to Camera Rays:

\[x_c = K^{-1} \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} s\]

Each pixel is converted into a ray in the camera's view, where K captures the camera's internal properties.

Creating View Rays:

  • Ray Origin: \[r_o = -R^{-1}t\]
  • Ray Direction: \[r_d = normalize(X_w - r_o)\]

Sampling: Exploring the 3D Space

With our rays defined, we now sample points along them to explore the 3D space:

Strategic Ray Selection:

  • We collect rays from all training images
  • Each ray carries information about its origin, direction, and the true color it should see
  • During training, we randomly sample these rays to learn different viewpoints

Point Sampling Strategy:

Along each ray, we sample points to understand what the ray encounters:

\[x = r_o + r_d \cdot t\]

where t spans from near (2.0) to far (6.0), giving us a comprehensive view of the space

Neural Network: Learning the Scene

The heart of NeRF is its neural network that learns to represent the 3D scene:

Input Processing:

  • Location in 3D space (x,y,z)
  • Viewing angle (θ,φ)
  • Enhanced through positional encoding:
  • \[\gamma(p) = (sin(2^0\pi p), cos(2^0\pi p), ..., sin(2^{L-1}\pi p), cos(2^{L-1}\pi p))\]

Network Architecture:

  • Deep network with 8 layers processing spatial information
  • Skip connections to maintain fine spatial details
  • View direction influences final color prediction
  • Produces color (RGB) and density (σ) for each point

Volume Rendering: Creating the Final Image

Finally, we combine all points along each ray to produce the final color:

\[C(r) = \sum_{i=1}^N T_i(1 - exp(-\sigma_i\delta_i))c_i\]

where:

  • \[T_i = exp(-\sum_{j=1}^{i-1} \sigma_j\delta_j)\] represents how much previous points block the view
  • σᵢ tells us how solid each point is
  • cᵢ is the color at each point

Training Process:

  • Process 10,000 rays in each training batch
  • Use Adam optimizer with 5e-4 learning rate
  • Achieve high-quality results (~23 PSNR) after 1000 steps

Through this process, NeRF learns to transform a set of 2D images into a rich 3D representation that can generate novel views of the scene from any angle.

Ray visualization - view 1

Novel View 2 Render
Ray visualization - view 2
Novel View 1 Render
Camera View 1 Render
Camera View 1 Render
Camera View 2 Render
Camera View 2 Render

Network Optimization Progress

300 Training Steps
NeRF Optimization 300 Steps
NeRF Reconstruction at 300 Steps
NeRF Reconstruction 300 Steps
1000 Training Steps
NeRF Optimization 1000 Steps
NeRF Reconstruction at 1000 Steps
NeRF Reconstruction 1000 Steps
3000 Training Steps
NeRF Optimization 3000 Steps
NeRF Reconstruction at 3000 Steps
NeRF Reconstruction 3000 Steps

PSNR Plots

300 Steps
PSNR Plot 300 Steps
1000 Steps
PSNR Plot 1000 Steps
3000 Steps
PSNR Plot 3000 Steps

Bells and Whistles: Depth Reconstruction

Depth Map (Grayscale)
NeRF Depth Reconstruction Grayscale
Depth Map (RGB)
NeRF Depth Reconstruction RGB

NeRF A reconstruction at high level of fidelity