The application of openMP and CUDA in path-tracing




We are going to implement an optimized Scotty3D rendering program with OMP and CUDA with NVIDIA GPUs. Our plan is to start with a render function that uses static grid and thread pool with a sequential hit function where we simply traverse all primitive. We will use openMP with BVH tree to speed up the tracing function and use CUDA to accelerate the entire camera film rendering process.


On a spectrum of rendering algorithms, rasterization is still the mainstream rendering strategy for computer graphics considering its computation cost and realistic visuals. Moreover, it has highly optimized graphics cards to support the specific rendering pipeline. However, rasterization is a highly-specialized solution for limited visual scenarios and doesn’t perform well in complex scenes involving a variety of optical effects. This is where ray tracing comes to play. Due to its natural logic that shooting the light rays around space, it has higher fidelity at the expense of a higher computation load. Even after Nvidia launched graphics cards with hardware ray tracing support in 2018, the performance penalty is still non-negligible for real-time rendering. Thus, to adopt this technique widely in modern rendering cases, developers have a long way to go and need to further optimize the performance based on current hardware, especially considering the growing scene complexity and more rigorous latency requirement from end-users. When more computation powers and specialized hardware become possible, a practical parallel acceleration implementation is imperative to scale the problem efficiently. Currently, lots of innovative research have been made in this area, and we may want to investigate some ideas to implement our version of parallel ray tracing.


1.Our first step is to use OMP to speed up the tracing function. Like what we did in Assignment 3, we cannot simply throw everything to OMP. Instead, we need to implement the Bvh tree and then integrate it with the OMP task.

2. Then we need to replace the static thread pool assignment with CUDA. The first challenge is that graphical data structures are complicated, and if we are dealing with complex scenes, it is hard to fit all graphical information into GPU memory. So we need to come up with an optimized way to load and utilize them without a lot of movement between RAM and GPU memory, which hurts the performance.

3. Also, because of the characteristic of CUDA program, each cuda thread will execute all lines in the code space. In other words, if there are a lot of branching (if statement), all threads will execute the command even if they are not performing the computation. The base version of the starter code has a huge amount of branchings, which is a source for overheads.


For machines, we will use the GHC machines containing NVIDIA GeForce RTX 2080 B GPU as resources to utilize GPUs. For starter code, we will use Scotty3D (https://github.com/CMU-Graphics/Scotty3D), a 3D modeling, rendering, and animation package that students complete as part of 15-462/662 Computer Graphics. This code base uses conventional thread pool to render the scene and naive iteration to perform ray tracing.


Plan to achieve:

Implement the BVH tree and the openMP version of trace function. We expect to get a 6x - 7.5x speed up over the sequential version with a 8-core machine. The ideal speed up would be 8x but we should take the tree-building process and other resource overheads into consideration.

Implement the CUDA version of path-tracer. We cannot predict a precise speed up because we cannot estimate the extent to which we could implement. However, our expectation is that this version should be faster than the openMP version.

Hope to achieve:

Come up ways to optimize the CUDA version by taking out branches in the code. his should further speed up the program..


A presentable demo of 3D graph to render during the poster session with speed up graphs comparing OpenMP and CUDA


For tools and framework, we use OpenMP and CUDA. OpenMP is good with its task management ability to deal with tree structure recursive like bvh tree. CUDA ,with its huge number of threads and warps resembling the characteristic of pixels, is designed to work with graphical problems in the first place.

For language, we choose C++ because it has existing libraries supporting the above parallel platforms.


  1. 11.7–11.13: Implement the sequential version of path-trace function
  2. 11.14-11.20: Implement the bvh tree and OpenMP version. Start to think about how to implement in CUDA
  3. 11.21-11.27: Implement the CUDA version in a very basic way.
  4. 11.28-12.4: Optimize the CUDA version by integrating the BVH while not hurting the performance with recursion. If we have time, optimize the CUDA version with memories and branching code.
  5. 12.5-12.9: Finalize the program and report. Be prepared for the demo and presentation.