VR / AR Fundamentals — 2) Audiovisual Spatiality & Immersion

Michael Naimark
17 min readFeb 9, 2018


Fooling two eyes and two ears, moving around, and with an unframed image.

[March 2019: This article, and the entire VR/AR Series, is now available bilingually in Chinese and English courtesy of NYU Shanghai.]

Welcome to the third of six weekly posts in sync with my “VR / AR Fundamentals” class at NYU Shanghai, where I’m currently visiting faculty. The class was billed as “partially technical but in an understandable way for general liberal arts students.”

Hope you find these helpful. Your feedback is welcome!

Lots to cover today. Let’s get started.

Perhaps the biggest surprise after Facebook acquired Oculus in 2014 was how VR video stole the show. Choose your metric — press, festivals and art shows, the popularity of non-interactive mobile VR hardware like Google cardboard and non-interactive cinematic apps like Jaunt and Within — most of the general public’s first VR experience was a VR documentary, a DIY artwork, or a corporate-sponsored short. Gamers were infuriated. Oculus founder Palmer Luckey railed to me about a particular VR camera company “going into a dark tunnel with no end” and was public that “stereoscopic video” is not “3D video.”

In a sense, Palmer is right. We’ll learn why, and why not, and what the tradeoffs are, and where the future lies, in today’s session.

Stereoscopic & Stereophonic Displays

Fooling two static eyes and two static ears looking at a framed image

Last week we overviewed audiovisual resolution and fidelity from a single framed source to a single stationary eye and ear. That’s far from how we normally experience image and sound. Now, let’s overview what it takes to fool both eyes and both ears with a stationary head and a framed representation.

Parallax & Disparity

The first thing to understand about stereoscopic displays is that we have two eyes, so we see objects in front of us from slightly but significantly different angles. This difference is called parallax. Disparity, particularly binocular disparity, without splitting hairs in this context, means the same thing.

As they relate to VR and AR, parallax and disparity vary with distance: when we look at things close up, parallax and disparity are more pronounced, and we we look at things far away, they fade to zero. Indeed, when we look out of an airplane from 35,000 feet at the ground, both eyes see exactly the same thing.

So if you’re making a VR travelogue consisting of far-off mountains, skylines, and helicopter shots, stereoscopy is unnecessary. For practically everything else, stereoscopy is essential, especially for anything “within arm’s reach” (one reason we have two eyes) or for anything intimate, like people close up.

The most under-appreciated feature of VR and AR headsets is how awesomely good they represent close-up imagery, much better than theater-based stereoscopic (3D) movies. Hollywood has long known that it’s best to keep 3D imagery at or behind the screen and to save imagery in front of the screen for “gotchas” like blood, bats, and broomsticks. The individual displays for each eye in headsets overcome this “big screen close-up 3D” problem. The VR games community knows this, and consequently much of VR game imagery is literally in your face.

Generating stereoscopic images from 3D computer models is a no-brainer. But, as we’ll soon learn, it’s far from trivial for camera-based VR imagery.

Interpupillary Distance

Interpupillary distance, or IPD, is simply the distance between our two eyes. We see IPD on our eyeglasses prescription so the lenses can be properly centered on each eye. IPD varies widely in the general population, from around 50 to 75 mm, with a slight gender difference (62 mm for female and 64 mm for men, on average).

High-end VR systems like the HTC Vive and Oculus Rift, and AR systems like Microsoft Hololens, have IPD adjustments on their headsets. One reason is to maximize the amount of light going into our eyes. But many in the VR and AR tech communities insist that custom matching IPD is essential for perceptual reasons. Imagine having a VR experience with eyes narrower or wider than yours!

Does it matter? If so, are the lower-end VR and AR headsets with no IPD adjustment, like Google Daydream and Samsung Gear VR, doing more harm than good? Or are most people flexible and adapt if things are a little off?

Here’s an interesting data point.

Children’s exhibit by Edwin Schlossberg for Macomber Farm, 1981

“I’m Seeing Like a Sheep” is a non-electronic exhibit simply consisting of two short periscopes turned sideways. Looking through the exhibit changes your IPD to that of a sheep, much wider than ours. I’ve done it, and the affect is intense and mesmerizing, and difficult to describe: everything looks “more 3D” or “closer but not bigger.” On the one hand, it’s pretty weird. On the other hand, most peoples’ eyes adapt and enjoy the experience.

I’m partially of the “let ’em adapt” school for many VR and AR applications, especially to encourage extreme and artful experimentation. At the same time, I acknowledge and respect the work in the labs quantifying such variations. Both are important at this early stage.


Convergence is the natural movement of our eyes inward to “fuse” near objects. With VR and AR displays, left and right images must be properly aligned for proper convergence.

Most people can feel when their eyes converge on near targets

Convergence and accommodation are often referred together as the pillars of stereoscopic imaging.

So remember in the Prologue, I asserted that the biggest question today regarding AR headsets is whether see-through opacity is doable? Here‘s an alternative scenario.

You’re wearing AR headsets in your office or living room and a close-up AR-generated image of an apple appears. The headset presents the proper left-right binocular disparity for each eye, and both images are left/right offset for proper convergence. And in the case of Magic Leap, the image of the apple appears focussed at the appropriate distance, perhaps a half meter away.

So, how much will you care if the apple is a little transparent rather than opaque? The apple appears singular (properly converged) while the background appears double. The apple is in focus and the background is not. Maybe you won’t care. Maybe it’s an attention thing.

Stereo & Binaural Sound

Most everyone has heard stereophonic sound, from speakers and earphones, and we easily understand how stereo can make a sound source appear spatial if it sounds louder in one ear than the other.

It turns out that when a sound source comes from the right, not only does the right ear hear it louder, the right ear also hears it earlier. Our audio processing and brain are extraordinarily sensitive to the microsecond differences. This speed, or “phase difference,” is the definition of binaural sound. Hearing A-B comparisons between stereo-only and binaural sound is mind blowing, particularly, as you may guess, for close-up or “intimate” audio, like someone whispering in your ear.

It also turns out that, because small speed differences are so perceptible, that inner reflections around the shape of our ears affect where we hear a sound coming from. Like interpupilary distances of our eyes, we have different ear shapes, and different head sizes, shapes, and densities. The sum total of these differences is called the head-related transfer function (HRTF) and is, like IPD, measurable over large populations, albeit more complicated.

Neumann KU 100 Binaural Dummy Head

Also like IPD, many insist that HRTF is essential to get right for each person, otherwise the sound will sound “misplaced.” Binaural sound can be synthesized or it can be digitally added to pre-recorded sound. The most common way to achieve binaural sound is to record with binaural mics, little microphones worn in the recordists’ ears. Another way to record binaural sound is with a specially made binaural dummy head, whose physical characteristics match an average head. Professional binaural dummy heads are not cheap, and some experts claim they can hear differences between them.

Multiscopic & Multiphonic Displays

Fooling moving eyes and ears looking at a framed image

So now we understand how to fool both eyes and ears to believing a representation is indistinguishable from the real scene, but without moving the head and only looking at a non-panoramic, framed image. Freeing the head to move around, even a bit, is even more complicated.


I polled my class “how many of you have seen a hologram?” Almost everyone raised their hand. Then I described what a real hologram is — a recording of the interference patterns between coherent “reference” light and the same light reflected off of an image — and what it can and cannot do, and asked the class again. The number dropped to two.

Concert size holograms are not holograms.

Humans recorded by many cameras pointing at them to make 3D computer models are not holograms.

“Projected” holograms, where the image appears “in thin air” not in the line-of-sight with the hologram itself, are not holograms.

Immersive displays that use “holographic lenses” (Microsoft Hololens) or “holographic reflectors” (Intel Vaunt AR glasses) to channel otherwise 2D images are not holograms.

My intention is not to be the hologram police; bigger battles exist. There are, however, two unique defining features of actual holographic displays that, if they’re missing, it’s fair call it out.

One is per-pixel focus, meaning, like in the real world, when our eyes fixate on any part of any thing, they accommodate, or focus, on it. When we look at a face close up, the nose is a different distance than the ears, and our eyes will accommodate accordingly.

The other feature is true multiscopic viewing where all viewpoints of the image can be present and visible at the same time: no need for glasses. Like looking through a window, when we bob our head up and down or left and right, the view changes accordingly. If we mask off all but a tiny spot on the window, we can still see all viewpoints by moving our head. That requires a lot of images for each tiny spot, on the order of thousands of times more than for a single pixel. While stereoscopic displays simply require twice as much data as monoscopic displays, holographic displays require thousands of times more data to display all viewpoints.

Light Field Lab recently announced a holographic monitor with per pixel accommodation and where all viewpoints are visible. The current prototype is only 4 by 6 inches. The source imagery for this powerful little window can be generated from 3D computer graphics or it can come from a camera, but a pretty badass one like we’ll see below.

Light Fields

A light field, like a hologram, has a specific technical definition — a 4D set of rays hitting every point of the surface (u,v) arriving from every direction (theta, phi) — which can allow seeing every viewpoint inside or in between the set. And like holography, “light field” is being used pretty fast and loose. One reason may be that it’s easy to think of it generically, “fields of light,” unaware that it’s actually a science. And to the credit of several of my colleagues, I’m hearing “light fields-like” to acknowledge that there are things like light fields that are not light fields. (Oddly, we almost never hear “hologram-like.”)

Light fields can be computationally derived from 3D computer models or they can be recorded with cameras, lots of cameras.

Paul Debevec, from Stanford SCIEN Workshop on Light Field Imaging, 2015

A light field with a camera array like this can provide every viewpoint within the sphere. The more cameras, the more resolution, but at a cost of data size.

Volumetric Video

Volumetric video can best be imagined as pixels in 3D space rather than in 2D space. These 3D pixels are sometimes called Voxels or Point Clouds.

This is a good moment to introduce Depth Maps, an additional “depth” channel, sometimes called “Z,” to the conventional red, green, and blue image channels, so where each pixel has an RGBZ value. This Z channel can be actively measured, like by using an infrared camera and sensor as in the case of the Microsoft Kinect, or it can be passively computed by matching features and flows using multiple cameras.

Depth maps, where white is close is black is far

Depth maps allow RGB pixels to be “splatted” out in the Z dimension to make a 3D “point cloud” model, allowing 3D navigation around it. But there’s a little problem with only one depth map: when moving around, sometimes “nothing,” or occlusions, may be seen.

Early example of a 3D point cloud made via a depth map derived from a stereo pair of images

Occlusions can be filled in either by interpolation or by additional data, like from other viewpoints. Both are lively areas in volumetric video today, roughly aligned with Hollywood taking the lots-of-cameras approach and Silicon Valley (and noteworthy, Seattle) taking a computational approach.

Spatial Sound

In the context of VR and AR, spatial sound is when sounds seemingly emanate from particular directions or locations. Using head tracking, these sounds can appear to “stay put” as we move our head around. Thus, spatial sound in headsets are significantly different from home stereo and 5.1 sound systems.

“Ambisonic sound” is a common spatial sound format for VR and AR, adding “full sphere” directionality. Controlling amplitude or loudness of sounds at different angles in the sphere may approximate closeness or proximity of the sound.

Better is when the ambisonic or spatial sound is also binaural, where sound locations can be additionally represented by phase or delay from one ear to the other. Several binaural ambisonic sound formats are emerging.

A noteworthy alternative for VR and AR spatial sound is the use of many loudspeakers, for example, one for each sound source, in place of earphones and head tracking (which will happen “automatically” when one turns one’s head). Many loudspeakers may be a good solution for “VR theaters” or location-based venues, where each audience member wears a VR or AR headset but can keep their ears uncovered.

Panoramic Displays

Fooling moving eyes and ears looking all around

So now we understand how to “look through a virtual window” pretty well. We know about the resolution and fidelity for image and sound, and we know how to fool both eyes and ears even when the head can freely move around. How can we now expand from framed to unframed, and fool the eyes and ears in an immersive panorama?

The word “panorama” was coined by British painter Robert Barker in 1792 to describe his novel way to paint cylindrical landscapes, and by the mid-1800s, largescale panoramas, housed in dedicated architecture, were a popular form of “being” somewhere else.

Giant painting, elevated platform, and soft overhead light

Panoramic photography followed shortly after the birth of photography, and panoramic cinema followed shortly after the birth of cinema. Here we’ll touch upon relevant elements for VR and AR today.

Monoscopic Panoramas & No-Parallax Points

The simplest way to make VR or AR panoramas is with a monoscopic, or 2D, representation. There will be no stereoscopic 3D since both eyes see the same thing. Most VR video shot today is with monoscopic panoramic cameras. The single 2D representation can be treated like conventional video in the editing and post-production pipeline, albeit where each video frame is a 360 by 180 degree “equirectangular” image.

To cover the full panoramic sphere, more than one camera is required: two 180 degree cameras, three 120 degree cameras, up to rigs sometimes containing ten or more cameras. The attraction to many cameras, and big ones, is resolution and fidelity (which is why we spent the entire first session on this).

But many and big cameras present a problem: in order to have a singular viewpoint, all of the cameras must shoot from the same viewpoint, otherwise there will be discontinuities in the image. This point may be called the no-parallax point (often erroneously called “nodal point”) and the further these point are offset from each other, the more the discontinuities may show, particularly for close-up objects. This is what’s referred to as “stitching error.” The sad truth is that 2D panoramic cameras with parallax offset can result in fundamentally unfixable error, particularly for near-field imagery.

Some error may be fixable by interpolation and of course everything is fixable painstakingly by hand. Either way, a lucrative new VR sub-industry has developed to fix VR video made with big, offset cameras used for their high resolution and fidelity.

The VR headset makers are frustrated because the imagery isn’t 3D or even stereoscopic, a key selling point for VR in the first place. Many VR producers shoot monoscopic panoramas because they’ve heard that stereoscopic panoramas are even more problematic. It’s a chicken-egg situation.

Stereoscopic Panoramas & Head Rotation

Stereo panoramas are a stereo pair of 2D panoramic images. Many consider them a sweet spot between flat, 2D monoscopic panoramic images and the very large data needed for multiscopic panoramas. They provide stereoscopic 3D and only require twice as much data as monoscopic panoramas. The tech giants know this, as well as a flurry of startups.

Stereo-panoramic video cameras. All are less than 5 years old.

Stereo-panoramic video is indeed stereo and panoramic, and to many the resulting imagery looks very much “like being there.” But looking around in VR or AR with stereo-panoramic video is not exactly the same as how our heads move and rotate.

Stereo-panoramic cameras fall neatly into two categories: paired and unpaired, referring to lens arrangement. Paired cameras, whether one pair (VR 180), two pair(two Ricoh Thetas side-by-side), three and on up can be viewed raw and easily assembled, but with all of the artifacts concentrated on the vertical seams, which can be hand-fixed to some extent. Unpaired cameras require computation to view anything stereo-panoramic. A good rule of thumb is that if the camera was made by someone without a PhD in Computer Science or a lab behind them, it’s most likely a paired camera rig.

Our eyes are in front of the axis of rotation of our neck. This is a nice evolutionary feature which allows us to gain additional motion parallax (or have parallax with only one eye). Stereo-panoramic video, by virtue of being only two viewpoints, does not add any additional parallax.

This is a good moment to bring up “3DOF” and “6DOF” in VR and AR systems. “DOF” stands for “degrees of freedom” and 3DOF means only having the 3 “rotational” degrees of freedom: pan, tilt, and roll. 6DOF means having the additional “positional” degrees of freedom: moving left and right, up and down, and in and out (simply: x, y, and z).

Stereo-panoramas, like mono-panoramas, are fundamentally 3DOF imagery. When the user looks left and right or bobs up and down, there is no additional imagery that can accommodate. Computer-generated 3D models are 6DOF by nature — the virtual camera can be anywhere.

To complicate matters further, high-end VR and AR headsets have the technology to offer 6DOF “position” tracking, while mobile VR and AR headsets do not and only offer 3DOF tracking.

So when Palmer Luckey rails that “stereoscopic video” is not “3D video,” he’s referring to the limitations of stereo (and mono) 3DOF VR: it’s not like real head movement.

Multiscopic Panoramas & Bandwidth

Holograms, light fields, and volumetric video can be made panoramic in theory. But as by now you may guess, it’s not cheap.

Last year Intel demonstrated true multiscopic panoramic VR video. The camera rig, made by Hype VR, consisted of 14 high-end video camera systems (the camera bodies alone totaled 70 pounds) spread our around a diameter of around 2 feet.

The demo is dramatic. People in VR headsets can sway back and forth and up and down, all the while seeing the image update accordingly. The limit of the sway is the 2 foot diameter of the camera rig. Unlike “room-scale VR” using 3D computer models, you can’t walk around beyond what the camera shot. The cost?

360 degree 6DOF volumetric video with a 2 foot range is currently 3 GB per frame

That’s 1,000 times the size of professional-grade 4K video, and only to sway around a bit. It may sound excessive, but if you were Intel, wouldn’t this be your future?

There’s a noteworthy irony here. Remember all of those monoscopic panoramic camera configurations with parallax offset resulting in expensive stitching fixes? It’s precisely that offset that may make these configurations ideal for multiscopic panoramic capture.

Putting it all Together

The Holy Grail for audiovisual spatiality and immersion is to forget about video playback, whether it’s 2D, stereoscopic, volumetric, pixels, voxels, frames, or “holograms” and put everything into a 3D computer graphics model. If all of the video can be moved into such a 3D model, looking around and moving around becomes entirely unconstrained.

Well, it’s not like I’m the first guy with such an enlightened thought.

Camera-based and image-based modeling has a long rich history, and every year things progress further. The problems are big, and they are beyond “cartoonishness.”

First, look around wherever you are right now and imagine making a 3D computer model of your environment magically using a camera. You can “sweep” the space to capture all the visual data (be sure to sweep under the table). Systems like this exist, as first popularized by Google’s Project Tango, and are at the core of emerging mobile AR developer’s tools. But what about motion? You can “flash” instead of sweep, using many cameras covering moving objects from many viewpoints. Systems like this also exist, going back to the camera rig used for the “bullet scene” in the Matrix. But for capturing motion in 3D in the real world, or even wherever your reading this right now, you’d need a lot of cameras if you want to minimize occlusions.

The bigger challenge is turning these “models” into models. Importing camera-based imagery into navigable 3D computer models differs from building 3D computer models from scratch in that no semantic knowledge exists. When you build a chair or a tree or a character from scratch in a computer model, “it” (the computer) “knows” what it is. This semantic knowledge enables us to move the chair or sway the tree or fight the character. This information simply doesn’t exist in the pixels or voxels or rays of video.

Making audiovisual material spatial and immersive is a first step. We’re getting pretty good at this.

Moving spatial and immersive audiovisual material into 3D computer models is a multi-faceted challenge happening now. One facet is how Hollywood and cinema have different approaches and needs than Silicon Valley and games.

Adding semantic understanding about the spatial and immersive audiovisual material in 3D computer models is perhaps today’s greatest challenge in this arena. The solution lies in crowdsourcing, big data, and deep learning.

Expect surprises some time soon.

Thanks to Jiayi Wang for the modules sketches and to Greg Downing for clarifying the differences between “no-parallax points” (technical term is “entrance pupil”) and “nodal points”.



Michael Naimark

Michael Naimark has worked in immersive and interactive media for over four decades.