Sunday, 23 November 2014

Geometry Memory Efficiency and Quantisation

I ended up spending a fair amount of time on the Katana integration in the end, at least in terms of pulling in geometry attributes and static transforms and exposing materials, so that I could pull production scene geometry into Imagine and see how well it coped against the latest versions of PRMan and Arnold. Initially, before I started work on memory efficiency back in June, it was embarrassingly bad, with Imagine using around 9-11 times as much memory, and Imagine only being able to fit around 52 million unique triangles in 24 GB of RAM. In this original state, Imagine was using gathered vertex attributes, so there was a lot of duplication of data.

The first thing I did was to add the ability to use indexed vertex attributes, and this way the source attributes (points, normals, UVs, etc) could be shared among any triangle / vertex in the mesh, with "just" indices being used to index into them. I put "just" in quotes, because vertex indices can also require a lot of memory, especially if the indices for each attribute are different (i.e. points, normals and UVs all have different indices per vertex).

This change reduced memory usage by around 3 times, but it was still not great. I added logic to detect if certain attributes used the same indices (i.e. normals and UVs), and this allowed me to re-use/share indices for multiple attributes if the indices allowed that. The next thing I experimented with was de-duplication of vertex attributes. In a normal closed mesh consisting of quads, the majority of points, normals and UVs are shared by multiple faces, and annoyingly, DCC tools like Maya don't seem to put much effort into de-duplicating these attributes (other than points) on export (possibly due to the fact that some renderers like PRMan don't support specifying indexed attributes explicitly at ingestion stage), so UVs and smooth normals values are generally specified up to 2/3 times. Doing this de-duplication (of normals and UVs) added to the startup/build time cost (as well as the peak memory usage), but did reduce overall memory usage quite significantly - sometimes by as much as 40%. However, it came at an additional cost, in that the indices for each vertex attribute would now be totally different and couldn't be shared, so while the memory usage for the raw vertex attributes themselves was now less, a bigger percentage of the memory usage was now being used by the indices themselves.

Changing the infrastructure to allow indices to be specified as uchar (1 byte), ushort (2 bytes) or uint (4 bytes) depending on the number of attributes of a particular type there were as well as allowing sharing of indices brought more flexibility and efficiency for smaller meshes (but at the unfortunate cost of some rather nasty template type logic in the code), however memory usage was still around 3.5-4.5x more than Arnold and PRMan for the same geometry, and peak memory usage was even higher during build time for de-duplicating normals and UVs.

I had been using 48-byte triangles (on top of the vertex data), based on the fast Shevtsov, Soupikov, Kapustin intersection algorithm which caches several things like base points, edge lengths and overall normal for each triangle, without having to do a lookup into the mesh vertex attributes themselves to do the intersection test - however this was using far too much memory (especially for deformation motion blur), so I added the ability to use several different possible triangle types dynamically per-mesh at build time, with the minimal Moeller-Trumbore algorithm being a new type which only needed 2-4 bytes (depending on number of indices items) of additional storage per triangle for the index into the indices for the triangle (I eventually got this down to 0 bytes extra by passing this index through from the acceleration structure).

This left Imagine using around 1.8-2.5x more memory than the latest versions of the other renderers, which while a significant improvement compared to the starting point, still left room for more. Due to the fact that there seemed to be a balance of memory used for raw geometry attributes vs the memory used for the indices into the attributes, with de-duplicating the attributes requiring a fair bit more memory for the indices themselves, the memory for both needed to be reduced at the same time.

I first investigated bit-packing the indices, but while doing this is fairly trivial when pre-computing / processing them up front (using delta encoding or high-watermark encoding), when completely random access is needed this becomes a lot more difficult. I experimented with hiword/loword encoding of delta differences between the indices for progressive triangles, and storing the indices for two triangles in one set of indices, but due to the variable nature of indices between adjacent values, it was difficult to provide random access without using some sort of keyframe-style encoding which got rather complicated. I realised that when vertex attributes like normals and UVs are fully specified - i.e. duplicates are given for each face, assuming that the attributes are listed in the order the polygon vertices are specified (in other words they aren't indexed), then it's possible to only store the base index for that triangle - the other two indices will either be the next sequential numbers (for triangles and the first triangle of a quad), or sequential after a gap of two shifted from the base vertex, with the final vertex being the base one for the second triangle of quads. I stole a bit from the item index to store the triangle type, and this allowed these indices to be worked out by only storing one value for all three indices for a particular vertex attribute. A further optimisation was realising that in certain cases (the mesh consisting fully of triangles or quads, or mostly quads with a very limited number of triangles) you don't even need to store this base index fully - you can store the offset value after dividing the triangle index by a constant (which needs to be worked out before hand based on the type of mesh), meaning an 8-bit signed char integer can in certain (limited, but fairly common in general use) cases be used to store indices into the millions due to the indices of a polygon being consecutive due to the fact that shared attributes between faces weren't being de-duplicated, so this base index is explicitly calculable on-the-fly.
These optimisations brought down indices memory usage significantly in the majority of situations, with no noticeable runtime penalty, but they relied on the fact that vertex attributes weren't being de-duplicated.

So I then turned to trying to quantise non-point vertex attributes (point attributes need to be stored at full float precision due to FP precision requirements for the vast majority of general rendering situations, at least at VFX scale). It turns out there's been a lot of research in this area, especially for normals, but a lot of it has been done with game engines / real time in mind. I first tried a naive compact representation of normals using spherical coordinates stored at half precision - using a total of 4 bytes instead of 12 bytes. While this brought down memory usage significantly, the accuracy was pretty bad and lossy, with axis-aligned directions not being fully reconstructable (for example 0.0,1.0,0.0 ends up being reconstructed as 0.000484,0.999999,0.000484) which is enough to cause shading issues.

Reading through the research on this subject (starting with Deering's work in 1995 at Sun) did show that using full float (96 bits) representations was wasteful for normals: that's accurate enough to shoot rays from the Moon to Mars with centimetre-accuracy on the surface of Mars (I'm assuming the relative positions of the planets makes a huge difference to this comparison!). In general (according to Meyer, Submuth, Subner, et el. in 2010), only 51 bits of floating point accuracy are required.
I tried a few non FP implementations of bit-packing at both 16-bit and 32-bit (because normals are unit length, you can just store two of the values and reconstruct the third): 16-bit is too lossy, but does allow storing axis-aligned directions losslessly, so might be useful for low LOD type situations where smooth normals were still required for some reason. 32-bit seems to work well, although you have to be careful about the distribution of the directions. It is obviously lossy, so comparing a normals AOV between full 96-bit precision and 32-bit packed does show differences, but in fairly comprehensive comparisons of beauty and other light AOV side-by-side renders, and worst-case test scenarios (extremely heavily subdivided sphere with heavy specular highlights, and the same sphere being perfectly reflective and reflecting a high-res checkerboard environment map light rendered at 4k square) and I've only spotted barely-perceptible minor differences in this latter test case. There's a slight (just under 1%) overhead to rendering with 32-bit packed normals, due to three multiplications, one divide and a square-root being required to convert back to a full-precision normal. When using 16-bit packed normals an 8192 item LUT table can be used for the lookup, with no overhead at all. Using a LUT with 32-bit packed normals is unfeasible, as the LUT would need to contain 536M items, which is obviously ridiculous. For the moment I'm happy with this normal encoding, but there are more advanced and variable (in terms of size and precision) encoding methods with greater accuracy I could look into if I find accuracy isn't good enough in the future. This change reduced normal memory storage down to a third.

Compressing UVs is more difficult, depending on what UV values you want to accept: using half format for each U,V value is acceptable in the range -2.0f - 2.0f, but outside of that, for any texture atlas usage, it's too lossy, so for standard UDIM ranges (0.0f - 10.0f) just doesn't work well with obvious stretching and differences, as the half precision just isn't good enough at the bigger ranges.
In the end, I settled for a slightly hacky, but still generally very usable solution of compressing U,V values into 16 bits each - with a supported range of -10.0f - 10.0f - by just storing a scaled integer value of each value. I could double this accuracy by not allowing a full mirroring of the values below 0.0, but given as MPC's v values are 1.0f - v, to render production assets I need this ability at least for the v value so I'd need to offset them, but that's easy enough, and I so far have only noticed very minor artefacts from using this compression method when comparing against full-float UV representation with hero assets with high res (8k tiles, 40+ UDIMs) textures rendered at 4k, so I'm happy with this for the moment. The error losses for each value are currently approximately 0.0002. However, it might be worth investigating a possible modification to ILM's half format which would have less range (say, -32.0 - 32.0) but more accuracy which might work better as a more generic solution for UVs, as long as the accuracy is there in the core -10.0 - 10.0 range. So this change reduced UV storage by half.
The infrastructure I've implemented for the quantisation allows great flexibility, so per mesh I can decide whether to store at full precision or any quantised combination supported by different attribute types.

I also made further memory reduction changes to Imagine's acceleration structures - instead of storing arrays of pointers to the objects/triangles in the acceleration structure, I'm now storing the actual objects themselves, which saves quite a bit of memory - 8 bytes per triangle / object, which brought total acceleration structure memory usage down by about 30%.

Together, all of these changes have now allowed Imagine to generally use less geometry memory than Arnold 4.2 and PRMan 19 RIS depending on how the geometry is specified. If it is specified as pre-indexed attributes, Image doesn't do as well, but testing many large-scale production scenes with from 100-300M unique triangles, Imagine is in a much better place memory-wise than it was six months ago, and is definitely very competitive geometry memory-wise and speed-wise against Arnold 4.2 and PRMan 19: Imagine is now able to comfortably render over 500M unique triangles in 24GB of RAM with UVs and per-vertex normals.

I think there's still more work to be done though on memory usage in general, as Arnold's claimed acceleration structure sizes in its stats are about 25% lower than Imagine's (meaning Imagine doesn't always win at peak memory usage), and PRMan 19's claimed acceleration structure memory usage is even more impressive: sometimes a fifth of Imagine's. Assuming these numbers are correct (I trust PRMan's a lot more than Arnold's as Arnold's unaccounted memory usage is around 40% of total usage on average, so it's possible its acceleration structure usage just isn't being tracked fully), I've still got some work to do on this front: quantising the bboxes for each node and storing references to objects and other leaves more efficiently with relative indices instead of absolute indices allowing each index value to be stored in less bytes.