3D scene graph inference and refinement for vision-as-inverse-graphics [article]

Lukasz Romaszko, Taku Komura, John Winn, Chris Williams, University Of Edinburgh, University Of Edinburgh
The goal of scene understanding is to interpret images, so as to infer the objects present in a scene, their poses and fine-grained details. This thesis focuses on methods that can provide a much more detailed explanation of the scene than standard bounding-boxes or pixel-level segmentation - we infer the underlying 3D scene given only its projection in the form of a single image. We employ the Vision-as-Inverse-Graphics (VIG) paradigm, which (a) infers the latent variables of a scene such as
more » ... f a scene such as the objects present and their properties as well as the lighting and the camera, and (b) renders these latent variables to reconstruct the input image. One highly attractive aspect of the VIG approach is that it produces a compact and interpretable representation of the 3D scene in terms of an arbitrary number of objects, called a 'scene graph'. This representation is of a key importance, as it can be useful e.g. if we wish to edit, refine, interpret the scene or interact with it. First, we investigate how the recognition models can be used to infer the scene graph given only a single RGB image. These models are trained using realistic synthetic images and corresponding ground truth scene graphs, obtained from a rich stochastic scene generator. Once the objects have been detected, each object detection is further processed using neural networks to predict the object and global latent variables. This allows computing of object poses and sizes in 3D scene coordinates, given the camera parameters. This inference of the latent variables in the form of a 3D scene graph acts like the encoder of an autoencoder, with graphics rendering as the decoder. One of the major challenges is the problem of placing the detected objects in 3D at a reasonable size and distance with respect to the single camera, the parameters of which are unknown. Previous VIG approaches for multiple objects usually only considered a fixed camera, while we allow for variable camera pose. To infer the camera parameters given the votes cast by the detected objects [...]
doi:10.7488/era/307 fatcat:wh4yazewtjcj7foogmja22f3ni