Part 1

Fitting a Neural Field to a 2d Image

For this, we traing a train a simple MLP to learn the rgb values for a given pixel. First, we implement positional encoding to take the pixel coordinates in the image into a higher dimension representation (by taking the sin and cos values of 2^i * pi * x until L for some value L which we choose to be 10). We also normalize the coordinates to lie from [0, 1] before feeding them into the positonal encoding. The main difference between my implementation and the suggested model is that I used skip layers to allow for deeper models, as well as using a SiLu activation. The first test I did only used 2 hidden layers (so 1 input 2 hidden 1 output).

The training process (including the loss/PSNR graphs) for the baseline. For this and the following examples, the images are at epochs: 0, 1, 10, 20, 50, 100

From there, I tested adding more hidden layers, as well as testing increasing the dimension of the encoding L from 10 to 20.

The training process for 4 hidden layers instead of 2, performs slightly better.
The training process for L=20 instead of L=10. Performs a little better even though it drops off at the end.

I also did this for one of my own images (an owl). For this I combined both increasing L and the number of hidden layers as they both seemed to improve performance.

The training process (including the loss/PSNR graphs) on the owl.

Part 2

Sampling Rays

Before training a model to fit the neural radiance field, we have to setup the process to handle the images. This involves transforming them from and to the camera space and world space. We also need to handle the intrinsic matrix to convert between camera coordinates to pixel coordinates.

From there we want get the rays (the origin and ray direction), which we can calculate based on the camera intrinsics, the transformation matrix, and the pixel coordinates. We also setup a dataset to sample these at random. We also implement sampling points along the rays (somewhat) evenly. This is done by adding the direction * t (where t is some value from a near and far parameter), and we use 64 t values. We also add a little bit of noise ontop of the t values in order to prevent overfitting. This gives us the following results when we visualize the setup using viser.

The first is what a random sample looks like (100 rays). The second is randomly sampling arrays in one image (also 100 rays).

From there, we can start to work on training an MLP to fit the neural radiance field. For our model, we use 12 hidden layers, and similar to in part 1 we use SiLu and skip connections to allow for this deeper model. We also have to positional encode the ray direction and the point coordinates. The model then returns a density (or sigma) value and a rgb value. We then use this and a discretized approximation of the volume rendering equation in order to produce a pixel color value, which we can use along with the MSE loss to train the model. We train on 50 epochs of 125 minibatches of size 8000 (so around 5x longer than the staff solution). This took around 30 minutes to train.

To reconstruct a full image given the transformation matrix, we sample over all u,v pixel coordinates for the image and get the pixel color, and reconstruct the full image using these colors giving us the example renders below.

I also implemented background rendering. This is done by adding some final background color * the probability of the ray not terminating (which means we can see the background). I simply did white as it stands out a lot. (B&W)

The reconstructed images at epoch: 0, 1, 2, 5, 10, 20, 50
The mse loss and psnr graphs on the validation data. By training it for longer we get a psnr around 26.
A generated video using the test example transformation matrices we were given. The first is at 10 epochs (5 min) and the second is at 50 epochs (30 min), and the 3rd is with a different background.