Here's a detailed description of the model's architecture and hyperparameters:
2D coordinate (x, y) ∈ [0,1]²
Takes a coordinate p and maps it to a higher dimensional space using:
\[\gamma(p) = (sin(2^0\pi p), cos(2^0\pi p), ..., sin(2^{L-1}\pi p), cos(2^{L-1}\pi p))\]
CopyInput (2 + 2*2*L dims) → Linear(256) → ReLU → Linear(256) → ReLU → Linear(256) → ReLU → Linear(256) → ReLU → Linear(3) → Sigmoid → Output (RGB)
3-dimensional RGB color values ∈ [0,1]³ Sigmoid activation ensures valid color range
Peak Signal-to-Noise Ratio calculation: \[PSNR = 10 \cdot \log_{10}(1/MSE)\] Used to evaluate reconstruction quality
The model learns a continuous function mapping from 2D coordinates to RGB colors, effectively representing the image as a neural field. The positional encoding helps the network learn high-frequency details, while the MLP architecture provides the capacity to learn complex spatial patterns.
Let's explore how NeRF transforms a collection of 2D images into a continuous 3D neural representation:
Just like how our eyes perceive the world through light rays, NeRF starts by simulating this process:
\[X_w = [R|t] X_c\]
We first transform the camera's view into world coordinates, where [R|t] represents how the camera is positioned and oriented in the world.
\[x_c = K^{-1} \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} s\]
Each pixel is converted into a ray in the camera's view, where K captures the camera's internal properties.
With our rays defined, we now sample points along them to explore the 3D space:
Along each ray, we sample points to understand what the ray encounters:
\[x = r_o + r_d \cdot t\]
where t spans from near (2.0) to far (6.0), giving us a comprehensive view of the space
The heart of NeRF is its neural network that learns to represent the 3D scene:
Finally, we combine all points along each ray to produce the final color:
\[C(r) = \sum_{i=1}^N T_i(1 - exp(-\sigma_i\delta_i))c_i\]
where:
Through this process, NeRF learns to transform a set of 2D images into a rich 3D representation that can generate novel views of the scene from any angle.