VR / AR Fundamentals — 4) Input & Interactivity

Michael Naimark
14 min readMar 3, 2018


Using our effectors and intentions as inputs and how they shape interactive experiences.

[March 2019: This article, and the entire VR/AR Series, is now available bilingually in Chinese and English courtesy of NYU Shanghai.]

Welcome to #5 of 6 weekly posts in sync with my “VR / AR Fundamentals” class at NYU Shanghai, where I’m currently visiting faculty.

While the class is putting up with my weekly presentations, they’re also busy experiencing several dozen VR titles we’ve accumulated so far on all five major VR platforms (Sony PS4, Samsung GearVR, Google Daydream, FB Oculus Rift, and HTC Vive) and curating an additional several dozen more VR titles to add to our library. Then, production.

OK, here we go.


I/O — Sensors & Effectors

So far, we’ve gained a cursory understanding of what it takes to fool all of our senses — sight, sound, touch, smell, taste, and even mind-as-input.

We used the “Mixing Board Analogy:” if we know what all of the sliders are for full sensory immersion, and if they’re all turned up to “10,” then, by definition, the representation will be indistinguishable from its subject.

But we’d be ghosts — able to take it all in but have no affect on anything around us.

Human sensors sense signals from a system output. Human effectors are sensed by a system input.

This session we’ll cover the spectrum of input mechanisms and interactive systems. Now, it may seem a bit odd, to have held off on input and interactivity until now, especially since so much of VR and AR experiences are interactive by nature (even if only turning one’s head). The uniqueness of this approach is to walk you through VR and AR as a whole system. It’s that simple.

It’s also important to state up front that the system may not only receive explicit input from our effectors but may infer our intentions. This is increasingly important as AI enters the arena.


The big news with input & interactivity is sensors, cameras, and microphones are quickly becoming smart sensors, smart cameras, and smart microphones. The technologies for computation, and wireless connectivity for even more computation, are becoming small, cheap, and local. This is a game changer.


I’m separating sensors here into everything other than cameras and microphones, which we’ll address in a moment.

SparkFun Sensor Kit $129.99

Things that can be sensed continues to get broader and cheaper. The DIY hobbyist catalog SparkFun sells kits like the one above, as well as cheap sensors for muscles ($37.95), air quality ($19.95), motion ($4.95), pulse ($24.95), ambient light ($3.95), liquid level ($24.95), carbon monoxide ($7.25), tilt ($1.95), vibration ($2.95), dust ($11.95), and alcohol gas ($4.95). These raw sensors, along with the full range of tiny mechanical buttons and switches, are integrated into our everyday home electronics.

“9DOF” IMA senses gyroscopic, acceleration, and magnetic forces (Xsens).

Another category of sensors measure position and movement, and generally fall under the category Inertial Measurement Units. IMUs can measure three different forces — gyroscopic, acceleration, and magnetic forces — each in a 3 dimensions (hence “9DOF”). The actual sensors are tiny, like pinhead size, and cheap, like a few dollars, and are integrated into virtually every smartphone today. Integrating the three sensors do some things very well and other things not so well. They are excellent at determining angular changes, like when you rotate your head with a VR or AR headset, and measuring acceleration, for example for footstep counting. IMUs are not very good at absolute position sensing, like “left six inches” or “down an inch,” and easily drift when sensing relative position.

Heartmath Inner Balance measures heart rate variability (HRV). | Empatica E4 wristband measures HRV, acceleration, temperature, and skin resistance.

Things get more interesting with computation, for example, in the realtime biometric arena. Heartmath makes a device which clips on the ear to measure heart rate variability. HRV is more revealing than simply pulse, and with a little processing, emotional states such as frustration and appreciation can be detected and potentially controlled. Empatica makes a wristband which additionally senses acceleration (for example, if someone falls), skin resistance (stress), and temperature (exertion, fever). It grew out of research to help people on the autism spectrum. Empatica’s newest wristband, Embrace, can detect convulsive seizures and send realtime alerts, and was approved last month by the FDA as a medical device.

How would you measure breath?

Of all of the biometrics that may be measured, breath is the most sensitive and nuanced that is semi-consciously controlled. Think meditation. Consider how useful breath sensing could be for VR and AR if ever-so-slight changes can be sensed in real time.

I asked the class: how would you measure breath? “A mask.” But wouldn’t it be intrusive? “A sensor around the chest.” Can’t it be fooled? “An airtight air volume measuring device?” Gnarly. “Microphone near the nose?” Could it discriminate inhale from exhale?

Vision & Cameras as Input

Body tracking, or motion capture (“mocap”), with computer vision and cameras falls into several distinct categories. The original popular method uses many “outside pointing in” camera/light source pairs and markers worn by the performers at strategic places. The cameras and light sources are infrared and invisible to the eye and to conventional cameras. The markers are little dots or balls made of retroreflective material which bounces virtually all of the light back in the direction of the light source. The good news is that, though markers looks heavy, they weigh almost nothing and don’t impede performance. The bad news, of course, is the need for them at all.

Retroreflective markers (Vicon)

Newer systems, made to track many people simultaneously and in real time for location-based VR, use active LED markers rather than the passive retroreflective ones.

Active LED markers (Optitrack)

Motion capture can be wireless, without the need for outside-pointing-in camera/light source pairs, but requires wearing electronics, admittedly getting smaller and smaller but still larger and heavier than markers.

Wireless motion capture (Xsens)

At the low cost end, tools designed to be accessible for the artist community have been hacked together using consumer equipment like the Microsoft Kinect and conventional digital cameras.

Mocap for artists (Depthkit)

At the high cost end, earlier this year Intel unveiled a 25,000 square foot studio in Los Angeles with a 10,000 square foot dome motion capture space. Little has been revealed about the exact technology, maybe cameras only, except that it uses 5 miles of fiber cables and an “army of Intel servers” running at 6 terabytes per minute.

Intel Studios’ 10,000 square foot motion capture dome (Variety)

If you were Intel, isn’t this what you would do?

Hand tracking is a little more manageable. The first popular device was made by Leap Motion and is a small unit with two infrared cameras and three infrared LEDs. That’s the simple part. Their software, which includes undisclosed “complex math,” can detect hand positions even when parts are occluded. For VR and AR, this technology can replace wearing gloves or using controllers.

Leap Motion

Eyes as input has several applications. One is computational efficiency. If the system knows exactly where the eye is looking, it can render only the inner, foveal, region with high resolution imagery, a process called foveated rendering. Fove makes such an eye-tracking VR headset, and here’s an impressive demo how it can save computation and power consumption.


In order to actually track gaze, both the position of the eye and the position of the head are required. Tobii makes tracker that does both, but limited to looking at a screen.

How does (long range) eye contact work?

I asked the students. You’re on a busy sidewalk and 50 meters away you see an old friend you hadn’t seen in years. Somehow, your eyes meet and you both recognize each other. How can that be? The numbers, pixels and angles and such, simply don’t add up. It’s even been claimed that our eyes emit something yet to be discovered.

Dave, my TA, crushed this one, or at least said something utterly convincing to us all. First, you see the long lost friend, but you’re not sure. What happens next is a micro-movement, something a little more visible than eye contact, perhaps raised eyebrows. The response may be another micro-movement, perhaps a smile. These may lead to larger movements like a hand wave, and so on. So what began as a low-probability hunch quickly became amplified and confirmed via a larger interactive system.

Sounds & Microphones as Input

First, and we couldn’t say this even a few years ago, voice control is everywhere. Voice-to-text is not quite but is almost solved (except ask someone speaking with an accent or dialect) and opens the doorway to the greater world of search.

Mars Translation Buds

The bigger news now are smart mics, or smart earbuds, with dozens of players entering the market. As mentioned at the beginning, it’s not the microphone, it’s the computational and connectivity hardware that can be squeezed into something small enough to fit in your ear. Applications range from realtime language translation — how cool is that? — to smart sound cancellation. For example, a young mother living in a noisy urban apartment can teach her earbuds to cancel everything except the sound of her baby, all while listening to music.

Mind as Input

Last week we learned that there’s virtually no hard, repeatable evidence that our minds can “read” on their own, even if most of us believe in ESP, extra-sensory perception. But what about the mind “writing,” that is, where our brain is an effector, like our body, hands, and eyes?

EEG and fMRI

We already know how to get signals out of the brain. We can either use an array of electrodes on the head’s surface to measure Electroencephalography (EEG) signals or place the head in a Functional magnetic resonance imaging or functional MRI (fMRI). EEGs may be wearable and lightweight but sense brain signals at the surface at very low resolution, while fMRIs are huge, expensive, and noisy but with x-ray level resolution in 3D and with motion.

Now there appears something new on the horizon.

Last year, Facebook announced work on a brain mouse and a goal of typing with your brain at a speed of 100 words per minute, non-evasively. The only clue given was “optical imaging.”

Shortly before the announcement, Mary Lou Jepson, Facebook’s former Executive Director of Engineering, launched a startup called Open Water, whose goal is to “create a wearable to enable us to see the inner workings of the body and brain at high resolution,” much greater than current fMRIs. The technology she’s developing uses very tiny LCDs and detectors in very high quantities to reconstruct a holographic image of brain activity, all in the form-factor of a ski cap, or arm band, or chest band. Real holograms: Jepson has a degree in Computational Holography from MIT.

She points to the 2011 research from Jack Gallant’s lab at UC Berkeley, which subjected student volunteers to view hours and hours of movie clips inside an fMRI, then showed them new clips and compared the fMRI “fingerprints” with the previously viewed ones. She believes her technology is orders of magnitude higher resolution, and speculates about dream recording, memory injection, and ethical implications.

Presented clip above. Reconstructed clip from brain activity below. (Gallant Lab)


We’ll conclude with some system-level observations about interactivity, since its definitions and applications are not entirely consistent.

Not everyone means the same thing when using the “I” word. The most infuriating, to me, is hearing someone say “I like ‘Game of Thrones’ because it’s the most interactive show on TV.” I’ve heard such statements more than once, and suppose what they mean is more like “involving” or “empathetic.”

At the very least, interactivity has to involve a change to the system output based on a system input. Beyond that, anything is pretty much fair game: joysticks, mice, and game controllers; heartbeat, breath, and EEG; sunlight, snowfall, and glacial erosion. The system needs some kind of input, and the mechanism to react.


The phrase “User Interface” or UI has been around for a long time and is generally regarded as an academic discipline. The SIGCHI conference on “computer-human interaction” began in 1982 and regularly has 2,500 attendees.

User Experience” or UX was popularized in the mid-1990s by UC San Diego Design Professor Don Norman, a longtime heavyweight in the SIGCHI community, to include affective factors as well as behavioral concerns. But by 2007, he noted that its meaning had become imprecise as a consequence of its widespread use.

“X” is the new “I”

Today, “UX” and “UX design” are fashionable and popular job descriptions, but it should be noted that not everyone, and not every company, means the same things using them.

Navigation versus Manipulation

One often unacknowledged distinction within interactivity is the difference between navigating around a database and manipulating, or changing, the database itself. This distinction is particularly germane around VR and AR. For example, cinematic VR allows the viewers to look around so some call it interactive. But the VR gamers counter that it’s not interactive because you can’t blow things up or even move a chair.

Here’s the breakdown as it relates to VR and AR.

Navigational interactivity allows moving around. Moving around can either be positional (left/right, in/out, up/down) or rotational (pan, tilt, roll). All 360 video allows rotational navigation. Positional navigation for 360 video requires some kind of volumetric video, plus 6DOF head tracking. Getting positional movement in VR video, even a little shoulder sway, is a white-hot topic at the moment (discussed in session #2). On the other side, all computer model-based VR such as games allow both positional and rotational interactivity. It’s simply a matter of moving the virtual camera.

All computer model-based VR allow manipulation, so long as it’s a “true” semantic model rather than a point cloud non-semantic one. The system knows that these polygons are a chair and those polygons are a gun because they were built from the ground up and labelled, and the system can manipulate them accordingly.

There’s a noteworthy other kind of manipulation: branching. Rather than moving the chair, two different pre-recorded or pre-rendered scenes can give the user a choice of content beyond passive navigation. Branching-based “interactive movies” have been around since Expo ’67 and are often scoffed at because of their limited choice. But hold that thought. More in a sec.

I wrote about these distinctions here in 2016.


“Symmetry,” in the sense of media and interactive systems, can be thought of as being when the input channels and output channels are the same size. Person to person conversation, or perfect teleconferencing, or maybe perfect person-to-AI conversation, can be thought of as symmetrical media.

Symmetrical media on left. Asymmetrical media on right.

While we can’t expect most video games to be symmetrical, it has been argued that they are more symmetrical than point-and-choose television viewing, and that the more user control, the richer the experience.

The topic of symmetrical media has its roots in a Croatian-Austrian philosopher and was a driving force for the birth of the personal computer. The philosopher, Ivan Illich, wrote a book in 1972 called “Tools for Conviviality” where he argued that the more symmetry between the means of production and the means of consumption, the healthier the society, and he called this “conviviality.” The ratio of televisions to television cameras, for example, went from hundreds-of-thousands-to-one at its birth to thousands-to-one with the arrival of prosumer video camcorders to one-to-one (at least) today with smartphones: television has become more convivial. Conviviality was a battle cry during the birth of the PC and arguably one of the underlying principles of DIY and maker culture today.

I also wrote about this here, though long ago.


I asked the class:

When do you think an interactive anything is not working?

The most common answers were when there’s time lag or latency, or when there’s bad design. Bad design may be equated with “dishonest” design, where the UI/UX implies that the system will deliver more than it actually can and verifiably fails.


The more that AI enters the arena of interactive systems, the more we’ll see inference over explicit user input. It doesn’t take mind reading for “predictive text” to finish our sentences.

Control versus Illusion of Control

What if you, the user, did everything that I, the producer, expected? I give you an apparent choice to turn left or right but know you’d choose left, so only produced the left option at the branching point. Would you know? Would you care?

Harry Houdini, the great American magician, thrived on “second guessing” his audience. He was allegedly so brazen that he’d claim onstage, in front of an audience, that he would transform his lovely assistant into a bag of sand. Drums rolled. Tension. Then, suddenly a shout of surprise from the back of the theater, turns out by a shill, and while the audience redirected their attention behind, the lovely assistant would jump out of Houdini’s arms and a stage hand would replace her with the bag of sand. Trumpets from the front and everyone turns back around. Ta-da!

The “world’s first interactive movie” was made for the Czech Pavillion at Expo ’67 in Montréal. Every seat had a red button and a green button, and the seats were all numbered. The screen in front was surrounded by numbered red/green lights, so every audience member could be assured that their vote was registered.

“Kino-Automat”, Raduz Cincera, Czech pavilion, Expo ’67 Montreal

The film, “Kino-Automat” by Raduz Cincera, begins with the protagonist in front of a burnt-down apartment building saying “It wasn’t my fault,” then launches into the film as a flashback (a clue). After each scene, a hostess would come onstage and ask the audience to vote for one of the next two choices. Instantly the projected film (yes, film) played the voted-on scene. There were ten scenes. How did he do it?

He wrote each option to end with the same next option. Protagonist meets girl. Ask her out? If yes, they agree to meet at a coffee shop at 5. If no, he mopes throughout the day and by chance, runs into her at the coffee shop at 5. And so on.

Kino-Automat was intended to be a black comedy and commentary on the futility of democracy. It was, after all, from the Czech Republic in 1967.

At Expo ’67, visitors waited in long lines to get into the popular pavilions. The progenitor to Imax and a projection mapped theater were also there. The odds of someone seeing Kino-Automat more than once were about as slim as someone not turning around after the shout in Houdini’s audience.

Many years later, around 1996, Czech television offered to broadcast Kino-Automat by simulcasting on two channels, where the audience would chose by switching channels. Cincera said no, that it would be a disaster. But they pleaded with him and he finally gave in and they aired it. People called in and complained. “They felt cheated,” he said. “I was right. It was a complete disaster.”

Maybe context is everything.

Thanks to Jiayi Wang for the modules sketches and Golan Levin for the Mixing Board sketch.



Michael Naimark

Michael Naimark has worked in immersive and interactive media for over four decades.