Picture perception reveals mental geometry of 3D scene inferences

Erin Koch, Famya Baig, Qasim Zaidi
2018 Proceedings of the National Academy of Sciences of the United States of America  
Pose estimation of objects in real scenes is critically important for biological and machine visual systems, but little is known of how humans infer 3D poses from 2D retinal images. We show unexpectedly remarkable agreement in the 3D poses different observers estimate from pictures. We further show that all observers apply the same inferential rule from all viewpoints, utilizing the geometrically derived back-transform from retinal images to actual 3D scenes. Pose estimations are altered by a
more » ... onto-parallel bias, and by image distortions that appear to tilt the ground plane. We used pictures of single sticks or pairs of joined sticks taken from different camera angles. Observers viewed these from five directions, and matched the perceived pose of each stick by rotating an arrow on a horizontal touchscreen. The projection of each 3D stick to the 2D picture, and then onto the retina, is described by an invertible trigonometric expression. The inverted expression yields the back-projection for each object pose, camera elevation, and observer viewpoint. We show that a model that uses the back-projection, modulated by just two free parameters, explains 560 pose estimates per observer. By considering changes in retinal image orientations due to position and elevation of limbs, the model also explains perceived limb poses in a complex scene of two bodies lying on the ground. The inferential rules simply explain both perceptual invariance and dramatic distortions in poses of real and pictured objects, and show the benefits of incorporating projective geometry of light into mental inferences about 3D scenes. 3D scene understanding | picture perception | mental geometry | pose estimation | projective geometry T he three panels in Fig. 1A, show one pair of connected sticks lying on the ground, pictured from different camera positions. The angle between the perceived poses of the two sticks changes from obtuse (Fig. 1A, Left) to approximately orthogonal (Fig. 1A, Center) to acute (Fig. 1A, Right), illustrating striking variations in perception of a fixed 3D scene across viewpoints. Fig. 1B shows two dolls lying on the ground, pictured from different camera positions. The perceived angle between the two bodies changes from obtuse (Fig. 1B, Left) to approximately orthogonal (Fig. 1B, Right). The situation seems quite different when a picture of the 3D scene in Fig. 1A, Center is viewed from different angles in Fig. 1C . The entire scene seemingly rotates with the viewpoint, so that perceived poses are almost invariant with regard to the observer, e.g., the brown stick points to the observer regardless of screen slant. Similarly, in Fig. 1D , the doll in front always points toward the observer, even when the viewpoint shifts by 120°. The tableau in Fig. 1D was based on a painting by Phillip Pearlstein that appears to change a lot with viewpoint. It has the virtue of examining pose estimation of human-like limbs located in many positions in the scene, some flat on the ground, whereas others could be elevated on one side or float above the ground. As opposed to relative poses, relative sizes of body parts change more in oblique views of the 2D picture than of the 3D scene. Interestingly, extremely oblique views of the pictures appear as if the scene tilts toward the observer. We present quantification of these observations, and show that a single model explains both perceptual invariance (1-4) and dramatic distortions (5-9) of pose estimation in different views of 3D scenes and their pictures. Results Geometry of Pose Estimation in 3D Scenes. For a camera elevation of ϕ C , a stick lying at the center of the ground plane with a pose angle of Ω T uniquely projects to the orientation, θ S , on the picture plane (SI Appendix, Fig. S1A ; derivation in SI Appendix, Supplemental Methods), θ S = atanðtanðΩ T Þsinðϕ C ÞÞ. [1] Seen fronto-parallel to the picture plane (observer viewing angle ϕ V = 0), the orientation on the retina θ R = θ S . As shown by the graph of equation 1 in SI Appendix, Fig. S1A , sticks pointing directly at or away from the observer along the line of sight (Ω T = 90°or 270°) always project to vertical (θ R = 90°or 270°) in the retinal plane, while sticks parallel to the observer project to horizontal in the retinal plane. For intermediate pose angles, there is a periodic modulation around the unit diagonal. If observers can assume that the imaged stick is lying on the ground in one piece (10), they can use the back-projection of Eq. 1 to estimate the physical 3D pose from the retinal orientation, Fig. 2A, center column (View 0 Deg) shows the back-projection curve for physical 3D poses against their 2D retinal orientation. To study how humans estimate 3D poses, we used sticks lying on the ground at 16 equally spaced poses, either alone or joined to another stick at an angle of 45°, 90°, or 135°. Observers viewed Significance We show that, in both 3D scene understanding and picture perception, observers mentally apply projective geometry to retinal images. Reliance on the same geometrical function is revealed by the surprisingly close agreement between observers in making judgments of 3D object poses. These judgments are in accordance with that predicted by a back-projection from retinal orientations to 3D poses, but are distorted by a bias to see poses as closer to fronto-parallel. Reliance on retinal images explains distortions in perceptions of real scenes, and invariance in pictures, including the classical conundrum of why certain image features always point at the observer regardless of viewpoint. These results have implications for investigating 3D scene inferences in biological systems, and designing machine vision systems.
doi:10.1073/pnas.1804873115 pmid:29987008 fatcat:54w6m4onjvazhhifdfi67f4roe