Alright, so this image is from a project that I did for a machine learning course that I took. (image is essentially just YOLOv11) We only had a few weeks, so I just opted to use existing models. I was converting an image to a 3d point cloud for robotics essentially. Say you have a robot arm, and you want to pick something up, you need to know where you are in space pretty well as well as where in space the object you are picking up is. So, I used YOLOV11 for object segmentation and DepthAnythingV2 for depth estimation. From a regular image, you can get the object class detections from YOLO as well as the object masks. With DepthAnythingV2 you get the depth of the scene, which when combined with the object mask gives you a width, height, and depth. Take the min/max of each coordinate axis and you get a 3d AABB.
Typically, when I started learning about Machine Learning, I always thought you take your input features and map them to a few output classes, and I knew that you can input features as an image (essentially all the pixels), but haven't really seen images as outputs before. Which is essentially what R-CNN and later YOLO output. (YOLO being a performance optimization allowing real time object detection)
Converting depth to Z distance is usually Z = 1 / depth, but DepthAnythingV2 has a "metric" model (which is better in every way) that gives you Z = Depth. Note that the metric model has seperate training weights for indoor vs outdoor scenes. Also note that the depth generated isn't perfect and doesn't hold up to rotation very well.
But, while working on this project I had my mind blown twice, which is why I wanted to get this online to spread the knowledge a bit there. First, by Gaussian Splatting. Which is fast photorealistic point clouds essentially. So, you go to a location, take some photogrammetry style photos (spin around something and take photos from multiple angles short distances apart) Generate Gaussian splats (check out something like
PolyCam's webpage, they do photogrammetry and have an online gaussian splat generation tool) And then you have a 3d representation of the real world scene that renders at 100+ FPS and looks photorealistic. The biggest downside is the static nature of the scene and I assume difficulty animating similar to voxels. The geometry is composed of 3d ellipsoids (3d gaussians) that also have an alpha blending value. But essentially, you render until you saturate your opacity for the pixel traveling through the gaussian splats from your view point for each pixel. Kind of like ray tracing and volumetric rendering. -- Essentially I stumbled on to Nvidia's NeRF first, which was neat. Also, COLMAP, which they use to generate camera poses from multiple images (structure from motion) is pretty impressive. (NeRF and Gaussian Splatting both use COLMAP to generate the data from images)
But, as I was looking into Gaussian Splatting I was wondering how they took 3d point clouds from COLMAP, made gaussians, and then "corrected" the gaussians using back propagation. That's when I stumbled into differentiable rendering.
Differentiable rendering / Inverse Rendering is essentially software rendering with the property that you can back propagate due to the entire process being invertible and differentiable. So, you know machine learning corrects the weights of a model by using back propagation on the loss function (usually mean squared error or log loss cross entropy) stepping the weights in the negative gradient direction that corrects them the most, using the predicted outputs and intended outputs.
Differentiable rendering does the same, except you have a target image and can correct primitives to that target image. Say you have a bunch of random triangles and a picture of a car from multiple angles. You can correct the positions, colors, and orientation of those triangles to best match the car from all angles. Not sure if that's the best explanation, but it's pretty amazing. If you think about rendering the biggest issue regarding differentiability is the rasterization of triangles to pixels. You fix that then essentially there is something called autodiff, which is a process of keeping track of differentiated equivalents of your original functions as you write code. Which is then used during back propagation in the "backwards" step after the initial forward pass. I don't want to get into the weeds too much, but essentially you can back propagate from a 2d image back to 3d primitives.
Anyway, here are some links about all these things:
YOLO / DepthAnythingV2 Project Page
YOLO / DepthAnythingV2 Project Video
Original PjReddie YOLO page
Ultralytics YOLOv11 Page
DepthAnythingV2 Page
Gaussian Splatting Paper Github
Nvidia NeRF blog page
COLMAP
TinyDiffRast - Differentiable Rendering
Differentiable Rendering interactive example using ellipses
Gaussian Splatting Talk
Robot Arm using Isacc Sim
Nvidia Groot page