Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Occupancy Networks and Neural Fields

Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.

AdvancedTier 3Frontier~45 min
0

Why This Matters

Traditional 3D representations (meshes, voxel grids, point clouds) are discrete and fixed-resolution. Neural fields represent 3D geometry and appearance as continuous functions parameterized by neural networks. This allows querying the scene at arbitrary resolution and learning 3D structure directly from 2D images.

NeRF (Neural Radiance Fields) demonstrated that a simple MLP can represent complex scenes with photorealistic quality, trained only from posed photographs. This opened new directions in 3D reconstruction, view synthesis, and scene understanding.

Mental Model

A neural field is a function fθ:RnRmf_\theta: \mathbb{R}^n \to \mathbb{R}^m where the input is a coordinate (position in space, or position plus viewing direction) and the output is a property at that coordinate (color, density, occupancy, signed distance). The network parameters θ\theta encode the entire scene. Querying the function at a new coordinate gives you the scene property at that point.

Neural Radiance Fields (NeRF)

Definition

Neural Radiance Field

A NeRF represents a scene as a continuous function:

Fθ:(x,d)(c,σ)F_\theta: (\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma)

where x=(x,y,z)\mathbf{x} = (x, y, z) is 3D position, d=(θ,ϕ)\mathbf{d} = (\theta, \phi) is viewing direction, c=(r,g,b)\mathbf{c} = (r, g, b) is emitted color, and σ0\sigma \geq 0 is volume density. The density σ\sigma depends only on position (geometry is view-independent), while color depends on both position and direction (capturing view-dependent effects like specular highlights).

The network architecture is a simple MLP with positional encoding. The input coordinates x\mathbf{x} are mapped through sinusoidal functions at multiple frequencies before being fed to the network:

γ(p)=(sin(20πp),cos(20πp),,sin(2L1πp),cos(2L1πp))\gamma(p) = (\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p))

This positional encoding lets the MLP represent high-frequency spatial detail that it would otherwise smooth over (due to the spectral bias of MLPs toward low-frequency functions).

Volume Rendering

Proposition

Volume Rendering for Neural Radiance Fields

Statement

The expected color C(r)C(\mathbf{r}) of a camera ray r(t)=o+td\mathbf{r}(t) = \mathbf{o} + t\mathbf{d} is:

C(r)=tntfT(t)σ(r(t))c(r(t),d)dtC(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \cdot \sigma(\mathbf{r}(t)) \cdot \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt

where T(t)=exp(tntσ(r(s))ds)T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds\right) is the accumulated transmittance from the near plane tnt_n to point tt. The product T(t)σ(r(t))T(t) \cdot \sigma(\mathbf{r}(t)) gives the probability density that the ray terminates at tt.

Intuition

A ray travels through space, accumulating color from each point weighted by two factors: how dense the material is at that point (σ\sigma) and how much light has already been blocked before reaching that point (TT). Dense regions contribute more color. Regions behind opaque surfaces contribute nothing because TT is near zero.

Proof Sketch

Model light transport as a 1D absorption-emission process along the ray. The transmittance T(t)T(t) satisfies dT/dt=σ(t)T(t)dT/dt = -\sigma(t) T(t), giving the exponential form. The color integral follows from summing the emitted radiance at each point, weighted by the probability of the ray reaching that point and being absorbed there.

Why It Matters

This equation is differentiable with respect to σ\sigma and c\mathbf{c}, which are outputs of the neural network. By comparing the rendered pixel color C(r)C(\mathbf{r}) to the observed pixel color in a training image, you can backpropagate through the volume rendering integral to train the NeRF. The only supervision needed is posed 2D images.

Failure Mode

The integral is approximated by quadrature (summing over discrete samples along the ray). Too few samples produce aliasing and miss thin structures. Too many samples are computationally expensive. Hierarchical sampling (coarse then fine) mitigates this but does not eliminate it. Training also requires accurate camera poses; errors in pose estimation produce blurry reconstructions.

In practice, the integral is approximated as:

C^(r)=i=1NTi(1exp(σiδi))ci\hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \cdot (1 - \exp(-\sigma_i \delta_i)) \cdot \mathbf{c}_i

where δi=ti+1ti\delta_i = t_{i+1} - t_i is the distance between adjacent samples and Ti=exp(j=1i1σjδj)T_i = \exp(-\sum_{j=1}^{i-1} \sigma_j \delta_j).

Occupancy Networks

Definition

Occupancy Network

An occupancy network represents a 3D surface as the decision boundary of a classifier:

fθ:R3[0,1]f_\theta: \mathbb{R}^3 \to [0, 1]

where fθ(x)f_\theta(\mathbf{x}) is the probability that point x\mathbf{x} is inside the object. The surface is the level set {x:fθ(x)=0.5}\{\mathbf{x} : f_\theta(\mathbf{x}) = 0.5\}.

The surface can be extracted at any resolution using marching cubes on a grid of query points. Unlike voxel grids, the resolution is limited only by the density of the query grid, not by the representation itself.

DeepSDF: Signed Distance Functions

Definition

DeepSDF

A neural signed distance function maps points to their signed distance from the surface:

fθ:R3Rf_\theta: \mathbb{R}^3 \to \mathbb{R}

where fθ(x)>0f_\theta(\mathbf{x}) > 0 outside the object, fθ(x)<0f_\theta(\mathbf{x}) < 0 inside, and fθ(x)=0f_\theta(\mathbf{x}) = 0 on the surface. The gradient fθ\nabla f_\theta gives the surface normal at any point.

DeepSDF has a geometric advantage over occupancy networks: the SDF value gives the distance to the nearest surface point, enabling efficient sphere tracing for rendering and providing a natural regularizer (f=1\|\nabla f\| = 1 almost everywhere for a true SDF).

Gaussian Splatting

3D Gaussian Splatting (2023) represents scenes as a collection of 3D Gaussian primitives, each with position, covariance, color, and opacity. Rendering projects these Gaussians onto the image plane and alpha-composites them.

This is an explicit representation (a finite set of primitives with explicit parameters) rather than an implicit one (a function evaluated at query points). The key advantages:

  1. Rendering speed: Rasterization of Gaussians is much faster than ray marching through a neural field. Real-time rendering at high resolution is possible.
  2. Optimization: Each Gaussian's parameters are optimized directly via gradient descent on the rendering loss. Adaptive densification adds Gaussians where the reconstruction error is high.

The tradeoff: Gaussian splatting requires storing millions of Gaussian parameters (memory-intensive), while NeRF compresses the scene into a compact MLP. NeRF generalizes better to unseen viewpoints; Gaussian splatting can have artifacts at extreme novel views.

Common Confusions

Watch Out

Neural fields are not neural networks that output meshes

A neural field is a function from coordinates to properties, evaluated pointwise. It does not output a mesh or point cloud directly. Extracting a mesh requires querying the field on a dense grid and running marching cubes (for occupancy/SDF) or rendering many views (for NeRF). The representation is continuous and implicit; the mesh is a derived output.

Watch Out

NeRF requires posed images, not just any photo collection

NeRF needs accurate camera intrinsics and extrinsics (position and orientation) for each training image. These are typically obtained from structure-from-motion (SfM) tools like COLMAP. Without accurate poses, NeRF cannot learn a consistent 3D scene. Recent work (Nerfacto, BARF) jointly optimizes poses and the neural field, but this remains harder than the fixed-pose setting.

Watch Out

Gaussian splatting is not a neural network

3D Gaussian Splatting uses gradient-based optimization but the scene representation is a set of Gaussians with explicit parameters, not a neural network. There are no learned weights, hidden layers, or activation functions. It is a differentiable rendering framework, not a neural field.

Key Takeaways

  • Neural fields represent 3D scenes as continuous functions parameterized by neural networks
  • NeRF maps (position, direction) to (color, density) and renders via volume integration
  • Volume rendering is differentiable, enabling training from 2D images alone
  • Occupancy networks use a binary classifier; DeepSDF uses signed distance
  • Gaussian splatting trades implicit compactness for explicit rendering speed
  • Positional encoding is critical for representing high-frequency detail in MLPs

Exercises

ExerciseCore

Problem

A NeRF samples 64 points along each ray, and the image is 800x800 pixels. How many forward passes through the MLP are needed to render one image? If each forward pass takes 10 microseconds, how long does rendering take?

ExerciseAdvanced

Problem

Explain why a standard MLP without positional encoding struggles to represent a scene with sharp edges and fine texture. What does the positional encoding γ(p)\gamma(p) specifically enable?

References

Canonical:

  • Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" (ECCV 2020)
  • Mescheder et al., "Occupancy Networks: Learning 3D Reconstruction in Function Space" (CVPR 2019)

Current:

  • Kerbl et al., "3D Gaussian Splatting for Real-Time Radiance Field Rendering" (SIGGRAPH 2023)

  • Park et al., "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation" (CVPR 2019)

  • Zhang et al., Dive into Deep Learning (2023), Chapters 14-17

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.