Occupancy Networks and Neural Fields

Sneiderman, Robby

Beyond LLMS

Occupancy Networks and Neural Fields

Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.

AdvancedTier 3FrontierSupporting~45 min

Prerequisites

Feedforward Networks and Backpropagation

Prereq Map

Learning position

Read this page in the graph.

beyond-llms | layer 4 | tier 3. This page has 1 direct prerequisite and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

3D Gaussian Splatting

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Traditional 3D representations (meshes, voxel grids, point clouds) are discrete and fixed-resolution. Neural fields represent 3D geometry and appearance as continuous functions parameterized by neural networks. This allows querying the scene at arbitrary resolution and learning 3D structure directly from 2D images.

NeRF (Neural Radiance Fields) demonstrated that a simple MLP can represent complex scenes with photorealistic quality, trained only from posed photographs. This opened new directions in 3D reconstruction, view synthesis, and scene understanding.

Mental Model

A neural field is a function $f_\theta: \mathbb{R}^n \to \mathbb{R}^m$ where the input is a coordinate (position in space, or position plus viewing direction) and the output is a property at that coordinate (color, density, occupancy, signed distance). The network parameters $\theta$ encode the entire scene. Querying the function at a new coordinate gives you the scene property at that point.

Representation Families at a Glance

Family	Query or primitive	Typical supervision	How geometry or images are recovered	Best fit	Main constraint
Occupancy network	$f_\theta(x) \in [0,1]$	object observations, point clouds, voxels, single images	extract a level set with marching cubes	shape reconstruction and completion	no direct view-dependent appearance model
Signed distance field	$f_\theta(x) \in \mathbb{R}$	shape observations plus SDF samples or surface constraints	zero level set or sphere tracing	geometry-first reconstruction with normals	color and radiance must be added separately
Radiance field (NeRF family)	$(x,d) \mapsto (c,\sigma)$	posed multi-view images of one scene	volume rendering along rays	novel-view synthesis with view-dependent effects	expensive query-time evaluation unless heavily accelerated
Gaussian splatting	explicit Gaussians with position, covariance, opacity, color	posed multi-view images of one scene	rasterize and alpha-composite projected splats	fast rendering and interactive view synthesis	explicit memory footprint and geometry often needs extra regularization

The important split is not just "implicit vs explicit." Occupancy networks and DeepSDF were introduced as geometry representations for objects or shape classes. NeRF and Gaussian splatting are scene-reconstruction methods for novel-view synthesis. They all live in the coordinate-based 3D world, but they do not answer the same question.

Neural Radiance Fields (NeRF)

Definition

Neural Radiance Field $F_{θ}$

A NeRF represents a scene as a continuous function:

$F_\theta: (\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma)$

where $\mathbf{x} = (x, y, z)$ is 3D position, $\mathbf{d} = (\theta, \phi)$ is viewing direction, $\mathbf{c} = (r, g, b)$ is emitted color, and $\sigma \geq 0$ is volume density. The density $\sigma$ depends only on position (geometry is view-independent), while color depends on both position and direction (capturing view-dependent effects like specular highlights).

The network architecture is a simple MLP with positional encoding. The input coordinates $\mathbf{x}$ are mapped through sinusoidal functions at multiple frequencies before being fed to the network:

$\gamma(p) = (\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p))$

This positional encoding lets the MLP represent high-frequency spatial detail that it would otherwise smooth over (due to the spectral bias of MLPs toward low-frequency functions).

Volume Rendering

Proposition

Volume Rendering for Neural Radiance Fields

Statement

The expected color $C(\mathbf{r})$ of a camera ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ is:

$C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \cdot \sigma(\mathbf{r}(t)) \cdot \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt$

where $T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds\right)$ is the accumulated transmittance from the near plane $t_n$ to point $t$ . The product $T(t) \cdot \sigma(\mathbf{r}(t))$ gives the probability density that the ray terminates at $t$ .

Intuition

A ray travels through space, accumulating color from each point weighted by two factors: how dense the material is at that point ( $\sigma$ ) and how much light has already been blocked before reaching that point ( $T$ ). Dense regions contribute more color. Regions behind opaque surfaces contribute nothing because $T$ is near zero.

Proof Sketch

Model light transport as a 1D absorption-emission process along the ray. The transmittance $T(t)$ satisfies $dT/dt = -\sigma(t) T(t)$ , giving the exponential form. The color integral follows from summing the emitted radiance at each point, weighted by the probability of the ray reaching that point and being absorbed there.

Why It Matters

This equation is differentiable with respect to $\sigma$ and $\mathbf{c}$ , which are outputs of the neural network. By comparing the rendered pixel color $C(\mathbf{r})$ to the observed pixel color in a training image, you can backpropagate through the volume rendering integral to train the NeRF. The only supervision needed is posed 2D images.

Failure Mode

The integral is approximated by quadrature (summing over discrete samples along the ray). Too few samples produce aliasing and miss thin structures. Too many samples are computationally expensive. Hierarchical sampling (coarse then fine) mitigates this but does not eliminate it. Training also requires accurate camera poses; errors in pose estimation produce blurry reconstructions.

report a correction →

In practice, the integral is approximated as:

$\hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \cdot (1 - \exp(-\sigma_i \delta_i)) \cdot \mathbf{c}_i$

where $\delta_i = t_{i+1} - t_i$ is the distance between adjacent samples and $T_i = \exp(-\sum_{j=1}^{i-1} \sigma_j \delta_j)$ .

Occupancy Networks

Definition

Occupancy Network $f_{θ} (x) \in [0, 1]$

An occupancy network represents a 3D surface as the decision boundary of a classifier:

$f_\theta: \mathbb{R}^3 \to [0, 1]$

where $f_\theta(\mathbf{x})$ is the probability that point $\mathbf{x}$ is inside the object. The surface is the level set $\{\mathbf{x} : f_\theta(\mathbf{x}) = 0.5\}$ .

The surface can be extracted at any resolution using marching cubes on a grid of query points. Unlike voxel grids, the resolution is limited only by the density of the query grid, not by the representation itself.

DeepSDF: Signed Distance Functions

Definition

DeepSDF $f_{θ} (x) \in R$

A neural signed distance function maps points to their signed distance from the surface:

$f_\theta: \mathbb{R}^3 \to \mathbb{R}$

where $f_\theta(\mathbf{x}) > 0$ outside the object, $f_\theta(\mathbf{x}) < 0$ inside, and $f_\theta(\mathbf{x}) = 0$ on the surface. The gradient $\nabla f_\theta$ gives the surface normal at any point.

DeepSDF has a geometric advantage over occupancy networks: the SDF value gives the distance to the nearest surface point, enabling efficient sphere tracing for rendering and providing a natural regularizer ( $\|\nabla f\| = 1$ almost everywhere for a true SDF).

Gaussian Splatting

3D Gaussian Splatting (2023) represents scenes as a collection of 3D Gaussian primitives, each with position, covariance, color, and opacity. Rendering projects these Gaussians onto the image plane and alpha-composites them.

This is an explicit representation (a finite set of primitives with explicit parameters) rather than an implicit one (a function evaluated at query points). The key advantages:

Rendering speed: Rasterization of Gaussians is much faster than ray marching through a neural field. Real-time rendering at high resolution is possible.
Optimization: Each Gaussian's parameters are optimized directly via gradient descent on the rendering loss. Adaptive densification adds Gaussians where the reconstruction error is high.

The tradeoff is representational, not just numerical. Gaussian splatting stores many explicit scene primitives and renders them quickly. NeRF stores a compact implicit scene function and evaluates it at render time. Raw 3D Gaussian Splatting is often much faster to render, but later follow-on work such as 2D Gaussian Splatting adds geometric regularization because appearance quality and geometric accuracy are not identical goals.

Common Confusions

Watch Out

Neural fields are not neural networks that output meshes

A neural field is a function from coordinates to properties, evaluated pointwise. It does not output a mesh or point cloud directly. Extracting a mesh requires querying the field on a dense grid and running marching cubes (for occupancy/SDF) or rendering many views (for NeRF). The representation is continuous and implicit; the mesh is a derived output.

Watch Out

Occupancy networks, DeepSDF, and NeRF are not interchangeable

Occupancy networks and DeepSDF are geometry representations. Vanilla NeRF is a scene-specific radiance field for novel-view synthesis. If the task is shape completion, you care about surfaces and normals. If the task is view synthesis, you care about rendered color along rays. Putting these models under the same "neural fields" umbrella is useful, but pretending they solve the same problem is not.

Watch Out

NeRF requires posed images, not just any photo collection

NeRF needs accurate camera intrinsics and extrinsics (position and orientation) for each training image. These are typically obtained from structure-from-motion (SfM) tools like COLMAP. Without accurate poses, NeRF cannot learn a consistent 3D scene. Recent work (Nerfacto, BARF) jointly optimizes poses and the neural field, but this remains harder than the fixed-pose setting.

Watch Out

Continuous representation does not mean cheap evaluation

A continuous field removes voxel-grid discretization, but the cost moves to query-time evaluation. Vanilla NeRF still requires many samples per ray and one network evaluation per sample. Much of the systems work after NeRF, including hash encodings and splatting, is about reducing that query cost rather than changing the underlying 3D task.

Watch Out

Gaussian splatting is not a neural network

3D Gaussian Splatting uses gradient-based optimization but the scene representation is a set of Gaussians with explicit parameters, not a neural network. There are no learned weights, hidden layers, or activation functions. It is a differentiable rendering framework, not a neural field.

Summary

Neural fields represent 3D scenes as continuous functions parameterized by neural networks
NeRF maps (position, direction) to (color, density) and renders via volume integration
Volume rendering is differentiable, enabling training from 2D images alone
Occupancy networks use a binary classifier; DeepSDF uses signed distance
Occupancy and SDF fields are geometry-first; NeRF and splatting are usually scene-specific rendering methods
Gaussian splatting trades implicit compactness for explicit rendering speed
Positional encoding is critical for representing high-frequency detail in MLPs

Exercises

ExerciseCore

Problem

A NeRF samples 64 points along each ray, and the image is 800x800 pixels. How many forward passes through the MLP are needed to render one image? If each forward pass takes 10 microseconds, how long does rendering take?

ExerciseAdvanced

Problem

Explain why a standard MLP without positional encoding struggles to represent a scene with sharp edges and fine texture. What does the positional encoding $\gamma(p)$ specifically enable?

References

Canonical:

Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" (ECCV 2020)
Mescheder et al., "Occupancy Networks: Learning 3D Reconstruction in Function Space" (CVPR 2019)
Park et al., "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation" (CVPR 2019)

Current:

Kerbl et al., "3D Gaussian Splatting for Real-Time Radiance Field Rendering" (SIGGRAPH 2023)
Muller et al., "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding" (SIGGRAPH 2022). Hash-grid acceleration for neural fields
Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains" (NeurIPS 2020). Why positional encodings counter spectral bias
Huang et al., "2D Gaussian Splatting for Geometrically Accurate Radiance Fields" (2024). A geometry-focused follow-on to 3D Gaussian Splatting
Tewari et al., "Advances in Neural Rendering" (Eurographics STAR 2022)

Last reviewed: May 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

1

3D Gaussian Splattinglayer 4 · tier 3

Graph-backed continuations

3D Gaussian Splatting