VR / AR Fundamentals — 1) Audiovisual Resolution and Fidelity
Fooling a single static eye and a single static ear into believing that a framed representation is real.
[March 2019: This article, and the entire VR/AR Series, is now available bilingually in Chinese and English courtesy of NYU Shanghai.]
Welcome to the second of six weekly posts in sync with my “VR / AR Fundamentals” class right now at NYU Shanghai and thank you for the encouragement after the Prologue (“Science with Attitude”) last week.
Happy to hear that many of you got the “Fog of VR” reference (and its reference). Most of my students didn’t.
Most importantly, I trust that you all got the connection between understanding VR/AR fundamentals and making better VR/AR businesses, products, and art.
So OK, let’s get started.
We’ll be spending the entire session today on what might seem trivial, especially compared to the big hot issues in VR/AR today like 3D, 360, interactivity, content, and social VR. But stop and consider: if I showed you a peephole in a wall and told you that there’s either a real apple in a little framed box behind it, or an image of an apple in a little frame behind it, and asked you to look through the peephole and tell me which, could you? Or put another way, could I make some kind of image that would convince you it’s not an image but real?
My class, polled, was split on this.
Now, what if behind the peephole was a person’s face, perhaps a meter away, or an image of a person’s face. Could you guess correctly which? At least two additional factors are added: human faces have more detail and nuance than apples, and human faces move. Most of my class thought no, they couldn’t be fooled.
The science behind what it takes to fool one eye and one ear under these circumstances is mostly known. That’s one reason we’re starting here. Another reason is that these factors are the basis for the 99.9% of the audiovisual industry that’s not VR or AR. When you go to buy a camera, or a television, or a good sound system, resolution and fidelity are the major concerns. This session is a cursory overview.
Pixels & Color
We most often associate image resolution to pixel count. It’s simple numbers, for example, American televisions went from 640 x 480 (SD) to 1,920 x 1,080 (HD) to 3,840 x 2,160 (4K) and are now heading toward 7,680 x 4,320 (8K). Before pixels, image resolution was associated with film gauge, ranging from “Super 8”mm for home movies to 35mm for theater films and beyond, most notably to “70mm 15 perf” IMAX. Though it’s problematic to make direct comparisons, an eternal debate in the motion picture industry, 4K was designed as the digital proxy for 35mm film.
It’s even more complicated. The nature of the film grain or the image sensor, the shape and distribution of the grains or pixels, characteristics of “noise,” and how the image moves through the pipeline all effect how the image appears. Professional cinematographer Steve Yedlin, whose recent credits include Star Wars: The Last Jedi, has conducted the most comprehensive research in this area, carefully comparing highest-end video and film cameras, blowing images to extreme closeup, and showing us the results. A Clear Look at the Issue of Resolution, published last year by the American Society of Cinematographers, includes his demos.
Color fidelity is also both simple and complex. For example, you’ve probably seen a vast array of mind-bending color illusions. The simplest aspect of color fidelity is that the entire spectrum can be represented by adding different ratios of primary colors red, green, and blue. The color space of human perception is often depicted as a graph showing its entire limits, with an inner triangle showing the color spectrum seen given specific primary colors.
Note that the older sRGB standard has a smaller color spectrum seen than Rec. 2020, developed for ultra-high-definition-television such as 4k and 8k. Also note how much color that we actually see is missing. Personally, I miss vibrant purples, and every time I see this magical color in a flower or rich fabric, I remind myself that I’ve never seen it in any image, be it photograph, movie, or computer display, ever.
Dynamic Range & Brightness
Dynamic range, the range from the darkest blacks to the whitest whites, is a bit more straightforward, as long as you understand what an f-stop (or “stop,” above) is. F-Stop is a unit of measurement of the amount of light hitting a target, such as a camera or your eye. F-stops correspond to doubling or halving the amount of light incrementally. Though each adjacent stop may look only a tad lighter or darker, it’s actually twice or half the amount of light. (Why the f-stops on your camera correspond to weird numbers like f/1, f/1.4, f/2, f/2.8, f/4, f/5.6, f/8, f/11, f/16, f/22, etc., is beyond the scope of this session. OK, it’s simple math: it’s the powers of the square root of 2.)
Given that sensors in cameras have a narrower dynamic range than our eyes, in classical photography the image is compromised. For example, one must chose between exposing for wispy clouds or for shadow detail. Professional photographers and cinematographers know how to “trick” by artfully using “key” light and “fill” light to, for example, boost shadows.
Technology continues to produce better and better sensors, with, among many better qualities, better dynamic range. But there’s a simple technique, known almost since the birth of photography, for increasing dynamic range, but with a cost.
The same image is recorded at multiple exposures.
Then magically combined using an increasing variety of clever methods.
The cost, of course, is that multiple exposures take some amount of time, during which the camera shouldn’t move. (It can also be computationally corrected, but sometimes with other costs.) Today, “HDR” is a popular setting in DSLRs and smartphone cameras, and the multiple exposures can be sufficiently fast.
Now let’s talk about brightness.
How dark is a movie theater? You’re watching a movie in a state-of-the-art theater and there’s an image of a bright outdoor scene. But remember — you’re sitting in a dark room. So, how dark is a movie theater?
I know. I once measured a bright outdoor scene in the best movie theater in New York at the time, during a matinee on a sunny afternoon, then went outside and took another measurement. Any guesses?
Ten. Ten f-stops. That’s 2 to the 10th power, or 1,000 fold difference in the actual amount of light hitting my eyes in the theater and outdoors. Remember, you’re in a dark room and your pupils are wide open, and outdoors on a bright day, your pupils become tiny dots.
Non-projection displays, like your laptop and VR displays, are admittedly brighter than movie theater projection, but still nowhere near as bright as many bright conditions in the real world.
Photo-Realism, Abstraction, & Cartoonishness
Flashpoint alert: the topic of what looks real and what looks “cartoonish,” particularly when the intention is realness, has long been emotionally heated and sometimes with economic consequences. It divides Hollywood from Silicon Valley, real-world storytellers from fantasy world storytellers, and may have elements of both age and gender biases.
Let’s start with how to make a 3D computer model versus how to take a photograph. Computer models must be built from the ground up. Like canvas, they begin blank. Cameras default by capturing the real world photographs, which is of course, where the phrase “photo-realism” comes from.
The computer modeling industry has strived for photo-realism since its birth. Siggraph, the annual computer graphics conference, contains a lively collection of technical papers down to minutiae. From last year’s juried papers: liquid-hair interactions, self-illuminating explosions, “bounce maps,” tentacle simulation, and fur reflectance (cats of course). Every year it gets better.
There’s a difference between a photographed human, and ground-up created human where the intention is not photo-realistic (like De Kooning paintings or everyone in the Pixar universe), and a ground-up created human where the intention is photo-realistic. Computer graphics people, for example in the games industry, are prone to say “it looks really real” while camera-based movie people say “no, it looks like a cartoon.” That’s the flash point. (I think it’s also where the phrase “really real” comes from.)
A well-known, though not entirely well-understood, element of this flash point is the “uncanny valley.” As ground-up created images approach photo-realism, they look less credible and even disturbing, particularly for humans and human faces.
So then, how do we deal with this, from three months ago?
This is a still from a video “here in virtual reality” that Mark Zuckerberg and Rachel Franklin streamed live on Facebook to promote the annual Facebook / Oculus VR developer’s conference.
The press was not kind, and shortly afterwards, Zuckerberg apologized. Ironically, the video was also to announce Facebook’s partnerships with NetHope and American Red Cross, to help rebuild areas hit by Hurricane Maria.
Vanity Fair called the incident a completely avoidable public-relations disaster. Incidents like these cause “embryonic damage” that can stunt growth, possibly with long-term affects. Think Google Glass.
And special to VR / AR
Accommodation is the technical term for focus in our eyes, when the muscles around them squeeze the miraculously deformable lenses to bring near, mid, or far objects into clarity, like focussing the lens on a camera. I asked my students to hold a finger up in front of one eye, with something visible farther behind, like a window. Close the other eye, then focus on the finger, then on the window, then on the finger again. No further explanation is needed. Most people feel what’s going on.
When we sit in front of a movie screen 20 feet away, our eyes accommodate to 20 feet, regardless whether the projected image is of a landscape, storefront, or close-up. The same is true with our laptop one foot away, our eyes accommodate to one foot. When we use VR or AR headsets, simple optics change the accommodation but only to another fixed distance. The displays inside may be one inch from our eyes, but these simple optics refocus them to appear at perhaps ten feet, something considered comfortable for most imagery.
As mentioned last week, Magic Leap has demonstrated a new, patented method of dynamically changing accommodation in their AR headset. Reports are that it’s amazing. A tiny AR elephant will appear to our eyes focussed just in front of our actual hands, seen through the AR headset.
One might question how important variable accommodation, since VR and AR today work pretty well with fixed accommodation. Microsoft is working on a more advanced version (“per pixel focus control”), and next week we’ll discuss interactions between accommodation and other factors such as convergence. We’ll know more when Magic Leap releases developer’s units, announced to be early this year.
Oh, in a conversation with a seasoned, well-respected VR/AR heavyweight about such things, he smirked and said “Wanna know a secret?” Then he leaned in close and said “none of this matters to anyone over 50.”
If there’s one takeaway from today’s session, it might be to thoroughly understand what orthoscopy means, in the context of VR and AR.
Simply put, an image is orthoscopically correct when the viewing field of view equals the captured or rendered field of view. Literally, this means “not too big” and “not too small” but properly scaled. Think back to the peephole.
Orthoscopy is so important for VR and AR because it’s essential. If the VR headset lenses have 100 degree FOVs, it needs to be fed with 100 degree FOV images. Similarly, if you look left 90 degrees, the image needs to pan left 90 degrees. If this isn’t immediately obvious for VR, consider how critical it is for AR, where the imagery must directly match the real world. (Wanna see something relevant and cool? Check out this one minute silent video.)
Equally essential to appreciate, orthoscopy almost never exists in our non VR and AR image viewing. For one thing, it requires that we view images at fixed and proper distances, which we never do or care about.
Picasso replied “She’s beautiful but so tiny.”
That’s the punchline to a(nother) Picasso legend. In the height of his early fame, a gentleman approached Picasso on the streets of Paris and accused him of distorting reality. Seeming to change the subject, Picasso asked the gentleman if he had a wife or girlfriend. “Oui!” the gentleman replied and took out a small photo of the woman from his wallet.
This is a good moment to point out that understanding VR and AR fundamentals and respecting them are two different things. A cut in a movie is a “violation” of real world fundamentals, yet montage is one of the very bases of cinema as an art form.
Once, during the first VR wave, the brilliant and colorful Marvin Minsky told me he’d just seen “the coolest thing ever” in VR. The VR code was hacked such that when you looked up and down, everything was orthoscopically fine — look up 90 degrees, the graphic model tilted up 90 degrees accordingly — but when you looked left and right, the code “2X”ed the response: you look left 90 degrees and the graphic model panned left 180 degrees. “It’s like rubber-necking!” he exclaimed.
OK, so now we understand what it takes to make a static, framed representation, say of an apple, appear indistinguishable from a real apple from a single fixed viewpoint. Now let’s level up to dynamic elements. (News flash: this includes sound, which is dynamic by nature.)
Motion & Frame Rate
Temporal, or motion resolution, like spatial resolution above, is well-known in terms of human perception, albeit with interesting ambiguities. Movies, generally shot at 24 frames per second, can be shot at speeds as high as 120 frames per second with noticeable enhancement to realness. In 2016, award winning director Ang Lee’s “Billy Lynn’s Long Halftime Walk” was filmed and projected at 120 frames per second, plus it was 4K, plus it was stereoscopic (next session). It was met with mixed reviews: “hyper-reality that is at once galvanizing and disquieting,” “so overpowering, the ‘you are there-ness of it all’ so pronounced,” and “it doesn’t allow for artifice.” What? I thought the goal was realness, at least so would claim the VR and AR communities.
In addition to “Billy Lynn” other high frame rate formats exist.
You may have seen The Hobbit in “HFR.” It elicited similar excitement and unease as “Billy Lynn.” Part of the unease, many claim, is called the “Soap Opera Effect,” a theory that the unease is due to our association of high frame rate with (60 fps) video over (24 fps) film. More on this in a moment.
Both Showscan and Magi were invented by Douglas Trumbull, the world’s most prominent special effects wizard (2001, Close Encounters, Blade Runner, Tree of Life). And inventor in the areas of camera motion control and simulator ride platforms. And director of Silent Running and Brainstorm (which was to switch from 24 fps to 60 fps for the “brainstorm” sequences, and was derailed by the tragic death of its star Natalie Wood). Trumbull has more experience about what it takes to make a large screen look like a real-world window than anyone, and his latest Magi system incorporates a unique approach of alternating frames to each eye for an apparently higher frame rate.
But while filmmakers in Hollywood are building and using high frame rate cameras, computer scientists in Silicon Valley and around the world are making ways to interpolate frames.
So what’s the deal? Is it necessary to literally “crank up” the speed of motion picture capture, or can all those in-between frames be made computationally. As you can guess, it depends who you talk with.
There’s an interesting side note. As mentioned, many folks dislike high frame rate because of its association with low production value television, “soap operas.” Many televisions with motion interpolation have it engaged as the default setting. In 2015, a change.org petition drive called “Please STOP making “smooth motion” the DEFAULT setting on all HDTVs,” declared “it’s actually very distracting watching a classic like ‘Five Easy Pieces’ and having it look like a sitcom shot on video.” Over 12,000 people have signed so far.
As frame rate relates to VR and AR, there’s an additional factor: head motion. When we move our heads during a VR or AR experience, the headset displays must update fast enough to keep the imagery orthoscopically correct. One well-known problem is latency. Another one is what to do when the head motion is faster than the frame rate. The word for this is judder, and there are clever ways such as “asynchronous timewarp (ATW)” to interpolate frames to keep up. Incidentally I think it’s possible to use ATW to make VR imagery look like “Five Easy Pieces” and it would be a very weird effect.
Anyone who’s experienced excellent quality, artfully produced, true spatial audio in VR appreciates that it really is half the experience. It’s also easy to put sound in the back seat. I tell my partners that if they notice me slacking in this direction, please help.
That said, this section is relatively short since this week’s session is about “single source” sound piped into one static ear. Next week we’ll address spatial sound, which is richer and more complicated.
Single source sound to a single static ear is not complicated. Single source sound is a one-dimensional signal, like what travels down a wire from a microphone to a speaker or earphone. Both input and output technology is pretty good, and almost everyone has had the experience of mistaking a voice over a good speaker for a live person in the room.
We’ll briefly address the basics. I also want to emphasize the analogies between sound and image, as in the outline at the top.
The normal frequency range of humans is 20–20,000 Hz.
Dynamic Range & Loudness
Decent speakers (as well as decent microphones) can cover most of the human frequency range.
Sound levels, like image dynamic range and brightness addressed above, may be compressed or attenuated to fit within the range of the technology.
Audio Realism, Abstraction, & Cartoonishness
Like images, sound can be synthesized or recorded. Synthesized sound rarely sounds like real-world sounds. Up until recently, it’s been impossible to synthesize human voice that sounds indistinguishable from real human voices, and most synthesized voices still have their own “uncanny valley” quality. Traditionally, synthesized sound has been used for abstraction rather than real-world representation, and they sound abstract, by design.
Unlike images though, sound can be easily recorded, particularly as single sources, and seamlessly mixed into larger audiovisual environments such as movies or video games. The motion picture industry has long relied on foley artists, who use tabletop props and their own voice to make sounds of rustling leaves, creaking doors, horse trots, and police sirens.
Animated features, including high-end ones like Pixar movies, always use real people, often celebrities, for their characters’ voices, and any “cartoonishness” is stylistically created by the performer.
Even the video game industry, which engages armies of modelers to create their highest-end “AAA” 3D graphics environments, use recorded rather than synthesized sound. Game character voices are recorded human voices (often with the same hired voice repeating the same line with a zillion different variations), and sound effects come from vast libraries of thousands of pre-recorded sounds.
How images and sounds differ in synthesis and capture is astonishing to ponder. Imagine vast libraries of millions of real-world 3D visual components, like sound libraries, that can all be seamlessly mixed and matched into singular integrated 3D worlds!
But we’re getting ahead of ourselves.
See you next week.
1/26/18: VR / AR Fundamentals — Prologue
2/2/18: VR / AR Fundamentals — 1) Audiovisual Resolution and Fidelity
2/9/18: VR / AR Fundamentals — 2) Audiovisual Spatiality & Immersion
2/16/18: VR / AR Fundamentals — 3) Other Senses (Haptic, Smell, Taste, Mind)
[2/23/18: Chinese New Year school holiday]
3/2/18: VR / AR Fundamentals — 4) Input & Interactivity
3/9/18: VR / AR Fundamentals — 5) Live & Social (+ Epilogue)
Thanks to Jiayi Wang for the modules sketches and to Adrian Hodge and the NYU Shanghai Research and Instructional Technology Services.