Updated Schedule
- Nov 15 - Nov 21: (Completed) Get started on the CPU path tracer codebase. Start to implement the megakernel CUDA version.
- Nov 21 - Nov 27: (Completed) Complete the megakernel version. Start implementing the wavefront version.
- Nov 28 - Dec 3 (Ongoing, milestone due): Complete the wavefront version. Explore different pipelines, optimizations, and design choices.
- Dec 4 - Dec 7: Add wavefront design (pipelining) to BVH traversal, switch to "wider" BVH, and parallel ray-triangle intersection test.
- Dec 8 - Dec 11: Add wavefront design to ray generation and material/shading evaluation.
- Dec 12 - Dec 15: Test and profile different implementations on different scenes. Analyze the workload and performance. Prepare the final report and poster session.
Work Done So Far
We have completed implementing a CUDA version of an existing CPU path tracer using the naive megakernel approach, i.e., one CUDA thread is responsible for the whole procedure of ray generation, intersection tests, shading evaluation, ray extension, and image updating. All scene data (geometry, acceleration structures, lightings, materials, cameras) are stored in GPU global memory.
We also tried to directly use the CPU version of the BVH (i.e., a binary tree with large tree height and small leaf node size) in CUDA for ray-instance intersection, and a sequential ray-triangle intersection test in the leaf nodes. We measured some preliminary performance and compared to the CPU and the OptiX hardware accelerated version.
Goals and Deliverables
We are slightly behind our previous schedule, but the next direction is clear and the overall progress is smooth. Hence, we are still confident that we can achieve the following goals and present the listed deliverables.
As for the "nice to have" goals in case we have extra time, we might not be able to try SIMD implementation on CPU, or implement a heterogeneous version using both CPU and GPU, but we may still try to accelerate BVH construction using GPU, or including BVH refitting for animated scenes. We may also try to incorporate the hardware accelerated ray-triangle intersection test into our pipeline.
Preliminary Results
We ran our current megakernel version of the path tracer on several different scenes and compare its performance to the CPU version and the implementation using OptiX (thus NVIDIA's RT cores) hardware accelerated intersection tests. Note that the OptiX version uses the same megekernel design.
|
|
|
|
We use scenes of different geometry complexity and shading complexity (variety of materials and lightings). The dragon scene contains a single object with simple material, whereas the landscape contains 23,241 plant models and 3.1B triangles. The other two scenes sit inbetween and feature different sets of materials.
As this is simply the preliminary results to get a sense of the performance bottleneck and optimization direction, we only time the total rendering time for rendering 1 sample-per-pixel (spp) for a 1280x720 image, and average the rendering time over 64 samples.
From the table, we can see that even with the naive megakernel design, the CUDA implementation can still achieves 2x-5x speedup compared to the CPU implementation. However, considering that path tracing is massively parallelizable, this speedup is clearly unsatisfactory, and the main reason is the extremely high divergence occured during BVH traversal, ray extension, and shading evaluation.
Comparing the performance of CUDA and of OptiX implementation, we found that the performance of ray-scene intersection test is crucial for the overall rendering performance, especially in scenes with highly complex geometries. Therefore, the first optimization we plan to implement for our wavefront version is to make the BVH traversal a separate pipeline stage. We need to make the BVH "wider" by increasing the fan-out width and decreasing the tree height, which reduces the costly pointer chasing and allows us to parallelize ray-box/ray-triangle intersection test over all threads within a group. We do not expect that this can reach the same speedup as the OptiX version, but it will be a significant optimization to consider first.
Summary
We are going to implement a parallel path tracer using a specific optimization technique named wavefront path tracing in CUDA. We will compare its performance to simpler parallel implementation using megakernels on CPU and GPU, and analyze how ray path and shader divergence affect parallelization performance.
Background
Challenges
Execution Divergence
Memory Access Locality
Space and Communication Overhead
Synchronization and Orchestration Overhead
Kernel and Pipeline Organization
Resources
Goals and Deliverables
Goals Plan to Achieve
- Implement two path tracers in CUDA using wavefront design and megakernel design, respectively.
- Compare their performance and analyze how our wavefront design alleviate execution divergence, and how that affects the final performance. We target a speedup over 1.3x in simple scenes, and over 2x in more complex scenes, based on the results in [1].
- Test under multiple scenes with different configurations to analyze when wavefront path tracing benefits the most.
- Experiment with some basic ray sorting and work queueing methods and analyze and performance and tradeoffs.
Goals Hope to Achieve
- Explore more complex design choices like streaming, threadpool, etc.
- Implement wavefront path tracing also using SIMD on CPU and compare its performance to the megakernel version using SIMD, and without using SIMD.
- Accelerate BVH construction and/or refitting using GPU.
- Extend the path tracer to utilize heterogeneous architectures, e.g., offloading certain highly divergent rays to CPU, or using the RTX cores in the GPU for hardware acceleration.
Deliverables
- We may present an interactive demo to show the speedup of our design and how it varies with the scene, if our final implementation can achieve interactive framerates.
- Speedup plots and other profiling metrics (warp utilization, cache hit rate, memory usage, etc.) of our wavefront design compared to the megakernel design on GPU, under different scene configurations.
- Performance comparison and analysis of different design choices mentioned in the challenges section.
Platform Choice
Although wavefront design also benefits SIMD execution on CPU, the degree of parallelism is much lower than GPU, thus the issue is less pronounced. Also, implementing ray casting and shading using SIMD intrinsics or ISPC might be more complex than implementing them in CUDA.
Overall, even though normal path tracing has high divergence and does not match well with the data parallel programming model, it is naturally parallelizable and can still enjoy the massive parallelism of GPU. In practice, a good wavefront implementation of path tracing on GPU is also much faster than a CPU version using OpenMP, hence it is interesting to see how we can adapt the workload to better suit the CUDA programming model.
Schedule
- Nov 15 - Nov 21: Get started on the CPU path tracer codebase. Start to implement the megakernel CUDA version.
- Nov 21 - Nov 27: Complete the megakernel version. Start implementing the wavefront version.
- Nov 28 - Dec 3 (milestone due): Complete the wavefront version. Explore different pipelines, optimizations, and design choices.
- Dec 4 - Dec 10: Test and profile different implementations on different scenes. Analyze the workload and performance.
- Dec 11 - Dec 15: Prepare the final report and poster session.
References
[1] S. Laine, T. Karras, and T. Aila, “Megakernels considered harmful: Wavefront path tracing on
GPUs”, Proceedings of the 5th High-Performance Graphics Conference, pp. 137–143, 2013.
[2] A. Keller et al., “The iray light transport simulation and rendering system”, ACM SIGGRAPH 2017
Talks, pp. 1–2, 2017.
The website template was borrowed from Michaël Gharbi and Ref-NeRF.