Magic Mirror: I and My Avatar - A Versatile Augmented Reality

Installation Controlled by Hand Gestures

Alexander K. Seewald

and Alexander Pfeiffer

Seewald Solutions, L

archenstraße 1, 4616 Weißkirchen a.d. Traun, Austria

Danube University Krems, Dr. Karl Dorrek Straße 30, 3500 Krems, Austria

Keywords:

Augmented Reality, Depth Cameras, Hand Gesture Recognition, People Counting, Shopping Window Rating,

Gaze Direction Heat Map, Person Height Measurement.

Abstract:

Since 2012 we have been building the augmented reality system Magic Mirror based on Kinect V1 and V2’s

native API. It relies on the magic mirror metaphor, where a large screen shows a mirrored camera view with

overlaid graphical elements. In our case, it shows a different face mesh over the person’s face which reliably

tracks face poses in real time while leaving the eyes and mouth of the person visible for interaction and

to improve immersion; replaces the background with images that may be changed, smoothly zoomed and

dragged; and allows to take screenshots which are automatically printed out on photo cards with an unique QR

code linking to its digital twin. Control of the system is primarily via easily learned hand gestures very similar

to multitouch screen gestures known from mobile phones and tablets. We have demonstrated the system to

the public as well as in private over a wide variety of settings, faces and backgrounds. Here, we explain the

challenges inherent in creating high-quality face meshes and textures from 2D images, and how we solved

them; describe the different versions of the system, how they differ and their limitations; and demonstrate

the usefulness of our system in several applications from people counting and tracking to obtaining height

measurements without storing or processing personal data.

1 INTRODUCTION

In 2012 – when Microsoft’s Kinect for Windows,

which enabled using their XBOX depth camera on

PCs, ﬁnally became available in Europe – my co-

author Alex P. dropped a unit on my desk and chal-

lenged me to build a hand gesture control for his pre-

sentations using its – at that time – unprecedented

body pose tracking. The Kinect V1 used an ac-

tive sensor and was one of the ﬁrst commercially

available low-cost depth cameras. It also included

a state-of-the-art face tracker as well as a pretrained

random-forest-based classiﬁer to automatically detect

3D body positions and poses (Shotton et al., 2011)

– including a reasonably accurate position of both

hands – so initial tests indicated that it should be fea-

sible.

We implemented a ﬁrst prototype in about two

weeks. We were then invited by Microsoft to the Pio-

neers Festival 2012 in Vienna and presented a keynote

on our gesture control app – of course using the same

app. There were many newspapers and magazine re-

ports, and even an interview with a local TV station,

W24.

We built a downloadable version, Universal

Remote, in another few weeks and it sold reasonably

well. However the customer base of professional pre-

senters was quite small, so we also wanted to build

something for everyone to use and to show other uses

of our gesture module and the Kinect itself. We addi-

tionally believed that the potential of this technology

was far beyond our simple ﬁrst application.

Our main inspiration for Magic Mirror came from

gaming theory. (Caillois, 1958) extended the theories

by Johan Huizinga with four universal game types.

One of these is Mimicry. Mimicry means becom-

ing someone else and participating in an illusionary

world. This can occur between the free game type

(paida) and the rule-based game type (ludus). Here,

the gamer is inside a virtual character that is an Avatar

- a concept mentioned for example in Neal Stephen-

son’s Science Fiction novel Snow Crash (Stephenson,

1992), where it has been introduced to the public in

the context of computer gaming. However already in

the 80ies the computer game series Ultima introduced

See https://w24.k4w.at

300

Seewald, A. and Pfeiffer, A.

Magic Mirror: I and My Avatar - A Versatile Augmented Reality Installation Controlled by Hand Gestures.

DOI: 10.5220/0010817500003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 3, pages 300-307

ISBN: 978-989-758-547-0; ISSN: 2184-433X

the concept of an Avatar - especially in part IV Quest

for the Avatar where the gamer takes over the role of

an Avatar as virtual image. (Gee, 2003) considers the

different identities. For him there is the identity of

the gamer, the identity of the avatar in the game, and

the projected identity. By this he describes transfer

effects between gamers and their Avatars.

This naturally led to a face changer, which al-

lowed to replace the background as well as wearing

a ﬂexible mask which could show facial expressions

– basically an avatar – all rendered in real-time and in-

stantly responsive to changes in face and body pose.

And so we called the system Magic Mirror - I and my

Avatar, and deﬁned it as an Autonomous Augmented

Reality Art installation where AI and machine learn-

ing techniques for gesture control, body segmentation

and face tracking (such as Random Forest, Active Ap-

pearance Models, Support Vector Machines and Dy-

namic Programming) are utilized to allow users to

wear and intuitively change – by hand gestures – a

dynamic virtual carnival mask which tracks detailed

face expressions, and also replace their background

with other scenes, real and imagined.

Within Magic Mirror - I and my Avatar we ini-

tially provided a mixture of characters from the well-

known Massively Multiplayer Online Role-Playing

Game World of Warcraft

and contemporary Aus-

trian politicans and presented it at a gaming confer-

ence (FROG 2012). It was an instant hit.

Afterwards, to make the installation more accessi-

ble to the general public, we successively extended

the mask set. At present we have many different

masks based on seasonal variations (Easter Bunny,

Santa Claus and Christkind, Halloween Characters),

European politicians and other persons of interest

(such as the Pope and Edward Snowden). Our instal-

lation allows up to six users to each take over the role

of a virtual ﬁgure or a politician in parallel and en-

ables each one to change his face, reposition and re-

size the background, and make a snapshot, by simple

hand gestures.

One disturbing variant of the installation is to give

everyone the same unchanging face. Do you still feel

like yourself?

2 RELATED RESEARCH

(Osokin, 2018) describes a system for body pose esti-

mation from RGB cameras that works bottom-up and

therefore scales to high numbers of persons – con-

trary to the Kinect with its top-down approach that

restricts the number of tracked persons to at most six.

It reconstructs roughly the same number and types of

body parts as the Kinect. It works reasonably well

according to real-life tests

and needs low computa-

tional resources, comparable to the Kinect V1. How-

ever no 3D positions of body parts are obtained, there-

fore our current hand gesture system cannot be ap-

plied directly to its output.

(Castro-Vargas et al., 2019) describe a system to

directly learn four hand gestures (down, up, left, right)

via 3D convolutional neural networks trained directly

on depth camera images. While an interesting idea,

their quoted accuracy of 73% is not high enough for a

practical system.

(Ferrari et al., 2019) describe a system for 3D

face reconstruction from combined color and depth

camera (RGB-D) data. While the quality of the 3D

face construction is very good and comparable to

(Smolyanskiy et al., 2014), it has the disadvantage of

needing a sequence of RGB-D frames to work with –

rather than a single frame – and has not been tested

with a single frame at all. So it is likely an applica-

tion would not track face expressions sufﬁciently fast

to be considered real-time – which is however a pri-

mary constraint of our system.

3 HISTORY

In this section we give an overview of the different re-

leases as well as important components of the Magic

Mirror. Our claim that it is a single system is sup-

ported by the fact that all described variants can be

created with different preprocessor deﬁnes from a sin-

gle C/C++ source code project. Magic Mirror was

publicly demonstrated at many different locations.

We also presented it at each annual lecture Future Me-

dia by Alex S. at the Danube University Krems and at

various other non-public events.

3.1 Creating Face Textures

The creation of high-quality face textures and meshes

proved to be a major challenge. We initially thought

that the Kinect V1 would only accept real-life faces,

which would have made it hard to get e.g. Angela

We tested it ourselves at ICAART 2019, together with

all the other participants of the session.

Future and Reality of Gaming (FROG) conferences

(2012,2013,2014,2017), Vienna City Hall, Austria; Sub-

otron Shop (2014), Museumsquartier Vienna, Austria; In-

ternational Broadcasting Conference (IBC 2015), Amster-

dam RAI, Netherlands; Mastercard Ad Campaign, Mu-

seumsquartier Vienna, Austria (2016); Oberbank Wels,

Austria (2016,2017); Danube University Krems (twice

2016,2017); Welios Wels, Austria (2016,2017)

Magic Mirror: I and My Avatar - A Versatile Augmented Reality Installation Controlled by Hand Gestures

301

Figure 1: Face tracking with projection and high-quality face texture. Original image (left), Face texture w/ mesh (right).

Merkel and then-US-president Obama’s face textures.

However, by conducting extensive tests we found that

Kinect V1’s Face tracker relies mostly on the color in-

formation and only uses depth information to ensure

the correct size of the face w.r.t. distance. As far as we

know there is no publication that describes the Kinect

V1 face tracking algorithm but from our observation

we can safely assume it was a 2D/3D Active Appear-

ance Model similar to the one described in (Cootes

et al., 1998) with minor ﬁltering by depth informa-

tion, where the initial face position is taken from the

head position of the estimated body pose.

So by printing out a 2D face image in approxi-

mately real size (face centered and scaled to an A4

page) and vertically positioning within in a sufﬁ-

ciently body-similar setting (we ﬁxed the face image

on a calibration rig with chessboard pattern and put it

on a ofﬁce chair – see Fig. 1) it was possible to make

the Kinect V1 detect it as a face by slowly turning

the ofﬁce chair left and right. In hard cases we would

stand next to the ofﬁce chair until our body and face

was detected and then quickly move behind the chair,

which in most cases was sufﬁcient for our face mesh

to be transferred to the printed face image.

The best solution for face mesh creation proved to

be to project the 2D face triangles from the detected

face onto the original face image by reverse projec-

tion. This was done by measuring the corners of the

printed face image page at subpixel accuracy – yield-

ing an irregular quadrangle – and projecting each 2D

triangle corner point within the quadrangle back to the

high-resolution face image. This yielded satisfactory

results. Fig. 1 shows such a sample. In this case the

resolution of the face texture may be arbitrarily high.

The limitation is mainly the face tracker, which only

outputs integer point values (i.e. does not track at sub-

pixel resolution).

Figure 2: Initial rabbit face that could not be tracked (left),

changed rabbit face that could be tracked (right).

In some cases – e.g. the colorful Easter Bunny

faces – synthetic faces were not sufﬁciently face-like

to be detected. Minor modiﬁcation were usually able

to recover meshes from these faces as well, and the

meshes could in most cases be applied to the original

image with only minor modiﬁcations. See Fig. 2 for

an example.

Unfortunately, the Kinect V2 used a different face

tracker described in (Smolyanskiy et al., 2014) which

could not be deceived in this way. In extensive exper-

iments we could never make it recognize and track a

ﬂat 2D face print. We did however manage to feed the

color image data of Kinect V2 into the Kinect V1 face

tracker and thus had slightly better and higher resolu-

tion tracking. This needed a PC with both frameworks

installed as well as both Kinect V1 and V2 physically

connected, so it was unfortunately quite cumbersome

to use. Also, the new face tracker had about ten times

the tracked points of the old face tracker, and these

did not exactly correspond to points of the old face

tracker. We could still achieve satisfactory but not

perfect results by mapping each point of the old face

image mesh to a well-chosen point on the new face

tracker mesh.

Both the Kinect V1 and V2 face tracker only track

the face itself and not the head, which means that parts

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

302

Figure 3: Two face samples with face mesh (green lines)

and extended mesh (blue lines) that show masks with see-

through face (white=background, transparent). Right mask

is a 3D model mask from an earlier AR project, which was

rendered to 2D.

Figure 4: Sample image for Xmas Theme installation.

outside the face – forehead, hair, ears, and beards –

cannot be properly displayed on the rendered face.

To also render these parts, we extended the outer-

most 3D face triangles by vector arithmetic, gener-

ating both new points and new triangles from the face

texture mesh and the real-time face mesh, and ap-

plied the projection to these extended meshes. This

method worked surprisingly well, and also enabled

the use of extended masks which are transparent for

most of the face area. Fig. 3 shows two examples.

The right example is actually a 3D mask projected

to 2D which was taken from an earlier unpublished

Augmented Reality project. These were used for the

X-Mas Theme version, see Fig. 4.

3.2 Gesture Control

The gesture control was taken from our earlier unpub-

lished project Universal Remote and contained three

types of gestures: left/right (one-handed slide),

zoom in/out (two-handed), and one-handed drag.

We added another gesture, two-handed-drag, in

2016 exclusively for Magic Mirror. Table 1 shows

all supported gestures. These could be dynamically

assigned, e.g. zoom in/out could be assigned to mak-

ing a screenshot or to resizing the background image

Table 1: Supported Hand Gestures.

Name Image Description

move Slide with your hand to the left or the right to

activate left or right move gesture. Either hand

will work, only the direction of movement is

important.

zoom Make a diagonal movement with both hands.

Push both hands out to the front and move

them away from each other (zoom in) or to-

wards each other (zoom out), then retract them

again.

one-

handed-

drag

Raise either hand to press the left mouse but-

ton. Afterwards push the other hand out to

the front. This controls the mouse cursor. By

moving the other (non-raised) hand you can

drag the background image around just like

normal drag-and-drop.

two-

handed-

drag

Push both hands out to the front. This con-

trols the mouse cursor. By moving either or

both hands, you can drag the background im-

age around just like drag-and-drop. The center

point between both hands is used for dragging,

so using both hands will make the drag faster.

smoothly and ﬂuidly. The basis for this was the body

pose tracking by Kinect V1 and V2 – thankfully very

similar – described in (Shotton et al., 2011).

3.3 Kinect V1 (V1.0)

The ﬁrst version for Kinect V1 only supported one

face, but it was already possible to change back-

grounds and faces by left/right hand move gestures.

A snapshot could be taken using zoom in or out ges-

ture. Fig. 5 shows a sample image. We demonstrated

it at the FROG 2012 conference.

The next version V1.1 in 2013 – demonstrated at

the FROG 2013 conference – supported the maximum

of two faces that was possible for Kinect V1 using

the original Microsoft API. We also replaced the lo-

cal politicans with international politicans, the pope,

Edward Snowden, and a Guy Fawkes mask. These

masks already had see-through eyes and were fully

optimized with all four steps mentioned earlier.

3.3.1 Improving Image Quality (V1.5)

Unfortunately the image quality of Kinect V1 was

quite bad and image resolution was low, so we com-

bined it with a Hero GoPro 3, using its Mini-HDMI

output port and a HDMI frame grabber card to gen-

erate a combined real-time video from GoPro image

Magic Mirror: I and My Avatar - A Versatile Augmented Reality Installation Controlled by Hand Gestures

303

Figure 5: Kinect V1 – Sample Image. We used a z-ﬁlter

with adjustable distance here where all objects beyond a

certain distance were replaced by background. This ac-

counts for the presence of the chair and parts of the table

in this image. Later we implemented z-Skeleton ﬁltering

which replaced all background except for recognized bod-

ies (skeletons in Kinect terminology).

data and Kinect depth data. We treated the Kinect

V1 and the Hero GoPro 3 as two cameras of a stereo-

camera setting and computed intrinsic calibration ma-

trices (K) for both using calibration rigs with a 5x8

chessboard pattern with size A2 and 61.5mm squares,

followed by stereo calibration.

Stereo calibration returned the translation matrix

stereo

and the rotation matrix R

stereo

, which trans-

form the world coordinate system of the Kinect into

the one of the second camera. We generated a point

cloud from Kinect V1’s depth data using native API

functions, transformed it by R

stereo

to the sec-

ond camera’s view frame, and then projected it onto

the second camera’s focal plane using its intrinsic cal-

ibration matrices. This reproduces the almost perfect

alignment between Kinect V1 color and depth im-

ages, so depth images can be ﬁltered and combined

with the color images very easily. We then optimized

this mapping – including the initial native point cloud

generating function – to a single matrix multiplica-

tion which proved sufﬁcient for real-time rendering

and yielding almost the same results. The 3D body

part positions were transformed using the same ma-

trix.

In 2014 we installed the ﬁnal system for three

weeks behind a shopping window with Easter Bunny

face textures and meshes and showed that we could

count in/outgoing people as well as compute a win-

dow rating for shopping window attractiveness and

additionally a heatmap of gaze direction. More de-

tails can be found in Section 4.1.

This system was also demonstrated at the FROG

2014 conference using the international politicians

face set.

All algorithms were from OpenCV V1.0.

3.4 Kinect V2 (V2.0)

In 2015 we ported the system to Kinect V2 as the

Kinect V1 was no longer available. The API had

completely changed, but by extensive use of C/C++

macros it was possible to keep changes to a minimum

and still generate all V1 and V2 systems from the

same source code project. This was important since

we still needed the V1’s facetracker to scan new faces.

The Kinect V2 allowed up to six faces and had a

newer, much more accurate facetracker, but it also had

much higher hardware requirements since all color,

depth, face and body pose streams were always sent

at the highest possible resolution.

We demonstrated

the system at our stand at the International Broadcast-

ing Conference 2015 (IBC 2015). Fig 6 shows a sam-

ple image of the improved system taken at that con-

ference.

In August 2016 we provided a Magic Mirror in-

stallation for an ad campaign by MasterCard Austria

in the Museumsquartier.

For this we extended the

system to use the zoom in and zoom out gestures –

up to this time considered binary – to continuously

zoom within a larger image, and also introduced the

two-handed-drag gesture to position the zoomed im-

age with less physical effort by removing the need to

keep one hand raised. We also extended the system

with an option to automatically print photo cards af-

ter each screenshot. This extended version was called

V2.1. Running the system outdoors proved a chal-

lenge since the Kinect does not work in bright sun-

shine, so we used a pavillon to create shadow but had

to reposition it every other hour. Getting a large mon-

itor which was visible in bright sunshine was on the

other hand no problem at all.

In October 2016 we demonstrated a Halloween

version of the Magic Mirror (using Halloween faces

and background) at Oberbank Wels, and later a

Christmas version at Welios Wels and Danube Uni-

versity Krems (both in Austria). We added an unique

QR code on each printout which linked to the digi-

tal snapshot, thus integrating the visual printout and

its digital twin. Before this change, people had of-

ten used their phones to make screen snapshots and

share them. Now, it was possible to just scan the QR

code on the photo card and get to a digital version

of the same photo for easy sharing via social media

as well as having a physical photo card to remember

the event by. We also added rendering improvements

such as pseudo alpha-blending (improves the appear-

Approximately 5GBit/s via USB 3.

The Museumsquartier is a large plaza in Vienna, Aus-

tria with three large museums and several smaller spaces for

modern art open to the public.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

304

Figure 6: Kinect V2 – Sample Image.

ance of the border between body and background) and

z-skeleton ﬁltering (prevents the need to set up the

z-distance from which the background image is ren-

dered, by simply rendering just the recognized body

shapes – skeletons in Kinect terminology – on the

background). This new extended version was called

V2.2.

In April 2017 we demonstrated an Easter Bunny

version with zoomable background at Welios Wels in

Austria, and printed photo cards, again with an unique

QR-code that linked back to the image for easy shar-

ing.

4 USES OF THE SYSTEM

Apart from its entertainment value, Magic Mirror can

be used in several different ways either for analysis or

to add new types of user interaction. Here, we note

several ways the system was used in practice.

4.1 Tracking People

As we shortly mentioned before, in 2014 we put the

V1.5 system behind a shopping window for 21 days

with Easter Bunny faces and background at Museum-

squartier Vienna (at the Subotron shop). The Muse-

umsquartier is a wide open plaza in Vienna, Austria

with three large museums and several smaller spaces

for modern art. In front of this space there is a narrow

very long building which most people cross to get into

or out of the plaza

and this is where we placed the

system, perpendicular to this long building. For this

installation we added a magnetic polarization ﬁlter on

the GoPro camera to reduce shopping window reﬂec-

tions. We used a 16:9 computer monitor with one me-

ter / 40 inch diagonal. While the system was mainly

intended to provide entertainment, we were also able

It is also possible to enter and exit by the sides or at

other positions of the narrow building, so we saw only a

subset of the people crossing.

to track movements of people, determine how often

they looked directly at the monitor, as well as com-

pute aggregate approximate gaze direction into a heat

map, without storing any kind of personal data.

4.1.1 Counting People Passing

We used the system to count people passing from left

to right (Enter) and from right to left (Exit – both w.r.t

the Museumsquartier Vienna) by tracking each body

Id separately and computing Pearson’s correlation co-

efﬁcient between time and the x position of body cen-

ter for all body Ids with at least three tracked posi-

tions. For a correlation corr with abs(corr) > 0.9,

we took the sign of the correlation to indicate ei-

ther left or right directed movement. People where

abs(corr) <= 0.9 were labeled Neither, since in this

case no clear linear movement in one direction could

be detected (see Eq. 1,2).

bId

∑

t ∗x

bId,t

−

∑

bId,t

∑

−(

∑

−

∑

bId,t

−(

∑

bId,t

)

(1)

bId







Neither if |c

bId

| ≤ 0.9

Left if |c

bId

| > 0.9 and c

bId

< 0

Right if |c

bId

| > 0.9 and c

bId

≥ 0

(2)

Fig. 7 shows the number of people per hour who en-

tered or exited within the given hour of the day. Com-

puted values were averaged over the whole 21 days of

the Easter 2014 installation and show a distinct daily

pattern. Colored bands show additive values, e.g. for

hour 15 there are on average 11.7 people exiting, 24.2

people entering, and 12.2 people neither entering nor

exiting, yielding a total of 48.1 people counted at this

hour (averaged over the whole time period). A to-

tal of 13,212 people were counted during this period.

Note that due to signal attenuation by the shopping

window, likely only a subset of people was counted –

those who moved close enough to the shopping win-

dow (around 2.5m)

4.1.2 Shopping Window Rating

We also deﬁned a shopping window rating to deter-

mine the attractiveness of the shopping window at dif-

ferent times. Here, the system tracked how long the

people watched the screen. For this we simply com-

puted the proportion of tracked faces versus tracked

bodies, since the API cannot return any tracked face

without a corresponding body pose (see Eq. 3).

Magic Mirror: I and My Avatar - A Versatile Augmented Reality Installation Controlled by Hand Gestures

305

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

People/hour

Hour of day

Exit

Enter

Neither

Figure 7: People going into / out of Museumsquartier Vi-

enna by time of day, averaged over total timespan.

Rating

hour

|hour|

∑

t∈hour

trackedFaces

trackedBodies

(3)

Since the Kinect V1 face tracker reacts very

strongly to front faces, very weakly to turned faces

(up to 30

◦

), and not at all to side faces, this was suf-

ﬁcient to detect people looking directly at the screen.

Once a face is tracked it can be turned ±30

◦

but only

slowly, so these values might be slightly overesti-

mated. On the other hand such slow head movements

are rarely observed in public.

Of all 13,212 people counted, 1,677 looked at the

screen – on average for 15.5 seconds – which gives

an overall rating of 12.69%. The system was watched

for a total of 12,305 seconds (3.42 hours)

463 people (3.50%) watched for at least 15s, 205

people (1.55%) for at least 30 seconds, and 69 people

(0.52%) for at least 60 seconds.

Depending on the hour of day – i.e. at peak hours

– high ratings could be obtained, e.g. for hour 15

the rating was 22.51%. This means that of all peo-

ple standing in front of the screen during this time

period, almost a quarter looked at the screen. Rat-

ing estimates per hour are not smooth, partially due

to small sampling size for some hours but also since

people counts are not smooth. Fig. 8 shows the full

results per hour, again averaged over all 21 days.

4.1.3 Gaze Direction Heatmap

We were also able to utilize the integrated face tracker

to output rough estimates for head position and there-

fore approximate gaze direction, assuming the person

looks straight to the front. To some extent this is quite

likely since the face tracker initially only recognizes

front faces and later fast movements to the side will

make the face instantly lose tracking. We used this

data to build a heatmap which was overlaid on an im-

age of the actual shopping window. This shows where

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Rating (%)

Hour of day

Rating

Figure 8: Window rating for shopping window: Proportion

of people looking at the screen (Facetracking success) di-

vided by total number of people detected, grouped by time

of day, averaged over total timespan.

Figure 9: Heatmap for Museumsquartier Vienna Installa-

tion (approximate gaze points aggregated over total times-

pan). Brighter = more accumulated gazes at this point.

people generally look at when viewing the shopping

window and enhances the shopping window rating

with very speciﬁc information on salient objects that

is extremely hard to obtain using any other means.

Fig. 9 shows the logarithmically scaled heatmap for

the whole 21 days. As expected, our installation was

the most salient object.

4.2 Estimating Body Height

In 2016, we evaluated the use of Kinect V2 to deter-

mine people’s height. For this, we recorded several

hours of body position data, yielding 129 people in-

cluding 64 children. Body height was estimated as

the distance from the ﬂoor plane in meters at the head

position (corresponding to the center of the head plus

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

306

the difference between head and shoulder vertical po-

sition, see Eq. 4-6)

plane

f loor

= a ∗x + b ∗y + c ∗z + d (4)

dist

part

|a ∗ part

+ b ∗ part

+ c ∗ part

+ d|

√

+ b

+ c

(5)

height = dist

head

+ (dist

head

−dist

shoulderCenter

) (6)

As we did not have ground truth data except for one

person, we computed the standard deviation over all

body positions with more than 25 consecutive frames,

which was 3%. For the one person where ground truth

height was available, the estimate from averaging all

frames underestimated the true height by 4.3%. How-

ever this may be a systematic error that is easy to cor-

rect. Even if not, the true height is within the 1.3fold

conﬁdence interval of the average and thus not signif-

icantly different at 95% conﬁdence level, so a running

average of body height should be sufﬁcient to get rea-

sonable height estimates within 1-2s (15-30 samples).

4.3 Ars Electronica 2019

We submitted a short video – see https://ars.k4w.at –

demonstrating our system for Ars Electronica 2019.

We did however not win any prizes.

5 CONCLUSION

It was very interesting to work on this project, how-

ever we really do not want to continue with Windows

as a programming platform, nor do we want to rely on

Microsoft hardware in the future.

So one of our plans is to port Magic Mirror to a

small platform such as Raspberry Pi 4 or – should it

not be sufﬁciently powerful – small embedded PCs

running Linux – and make it work with other depth

cameras or perhaps even – one day – with normal

2D color cameras, using appropriate Deep Learning

tracking systems at the same functionality level.

We would also like to enable high-quality track-

ing through window glass (without cutting a hole in

it) and outside (without lugging a pavillon around and

repositioning it every hour as we did for the Master-

card event). For this specially developed hardware is

likely needed.

Every time we put up Magic Mirror, we could

count on smiling faces and happy people. So, since

the ten-year anniversary of this project is in 2022,

we also plan to take all our remaining Kinect sensors

and build – as far as we can – all previous installa-

tions of Magic Mirror into a single room (just like

a gallery), include free printout of photo cards, and

make all of this available for at least several months

of that year. If you are interested in visiting, check

https://mm.k4w.at for updates or send an email to us.

ACKNOWLEDGEMENTS

This project was partially funded by the Austrian Re-

search Promotion Agency (FFG) and by the Austrian

Federal Ministry for Transport, Innovation and Tech-

nology (BMVIT) within the Talente internship re-

search program 2013-2017, and by the Vienna Cen-

ter for Innovation and Technology (ZIT) 2015. We

would like to thank Max W., who edited all initial

masks, wrote the Mesh-Manipulator single-handedly

and contributed to this project in more ways we can

count; character designer Emanuel A., who provided

half of the Halloween masks as well as the Easter

Rabbit masks; Jogi N., Brigitte R., Sonja S., Michael

H., Julian F., Sonja F., Peter F., Mozart D. and all

other people and students who contributed to the

project over the years in a variety of ways.

REFERENCES

Caillois, R. (1958). Th

eorie des jeux. Revue de

etaphysique et de Morale, 63(1):83–102.

Castro-Vargas, J. A., Zapata-Impata, B. S., Gil, P., Garcia-

Rodriguez, J., Torres, F., et al. (2019). 3DCNN per-

formance in hand gesture recognition applied to robot

arm interaction.

Cootes, T. F., Edwards, G. J., and Taylor, C. J. (1998). Ac-

tive appearance models. In European conference on

computer vision, pages 484–498. Springer.

Ferrari, C., Berretti, S., Pala, P., and Del Bimbo, A.

(2019). 3D Face Reconstruction from RGB-D Data

by Morphable Model to Point Cloud Dense Fitting. In

ICPRAM, pages 728–735.

Gee, J. P. (2003). What video games have to teach us about

learning and literacy. Computers in Entertainment

(CIE), 1(1):20–20.

Osokin, D. (2018). Real-time 2d multi-person pose estima-

tion on CPU: Lightweight OpenPose. arXiv preprint

arXiv:1811.12004.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio,

M., Moore, R., Kipman, A., and Blake, A. (2011).

Real-time human pose recognition in parts from single

depth images. In CVPR 2011, pages 1297–1304. Ieee.

Smolyanskiy, N., Huitema, C., Liang, L., and Anderson,

S. E. (2014). Real-time 3D face tracking based on

active appearance model constrained by depth data.

Image and Vision Computing, 32(11):860–869.

Stephenson, N. (1992). Snow Crash: A Novel. Bantam

Books.

Magic Mirror: I and My Avatar - A Versatile Augmented Reality Installation Controlled by Hand Gestures

307