Exploitation of Noisy Automatic Data Annotation

and Its Application to Hand Posture Classiﬁcation

Georgios Lydakis

1,2

, Iason Oikonomidis

2 a

, Dimitrios Kosmopoulos

3 b

and Antonis A. Argyros

1,2 c

Foundation for Research and Technology - Hellas (FORTH), Greece

Computer Science Department, University of Crete, Greece

University of Patras, Greece

Keywords:

Automatic Data Annotation, Noisy Annotation, Hand Posture Classiﬁcation.

Abstract:

The success of deep learning in recent years relies on the availability of large amounts of accurately annotated

training data. In this work, we investigate a technique for utilizing automatically annotated data in classiﬁ-

cation problems. Using a small number of manually annotated samples, and a large set of data that feature

automatically created, noisy labels, our approach trains a Convolutional Neural Network (CNN) in an iterative

manner. The automatic annotations are combined with the predictions of the network in order to gradually

expand the training set. In order to evaluate the performance of the proposed approach, we apply it to the

problem of hand posture recognition from RGB images. We compare the results of training a CNN classiﬁer

with and without the use of our technique. Our method yields a signiﬁcant increase in average classiﬁcation

accuracy, and also decreases the deviation in class accuracies, thus indicating the validity and the usefulness

of the proposed approach.

1 INTRODUCTION

In the past few years, deep learning methods have

revolutionized the ﬁeld of Artiﬁcial Intelligence (AI),

achieving previously unattainable performance on a

plethora of challenging tasks. Examples include im-

age recognition (He et al., 2015), natural language

processing and machine translation (Cho et al., 2014)

and speech recognition (Graves et al., 2013).

In the case of supervised learning, annotating

large amounts of data can quickly become very costly

in terms of human effort, especially so if the annota-

tion procedure is itself difﬁcult, for example, creating

pixel-level segmentation masks. For this reason, be-

sides semi-supervised and unsupervised learning, re-

search is also highly active in the ﬁeld of reducing

the annotation effort required for training deep mod-

els in a supervised manner. In this work, we propose a

technique for utilizing a large number of samples that

have been automatically annotated with labels for a

classiﬁcation task. Given that the annotation is au-

tomatic, it is possible that the extracted labels may

https://orcid.org/0000-0002-9503-3723

https://orcid.org/0000-0003-3325-1247

https://orcid.org/0000-0001-8230-3192

Figure 1: A classiﬁer trained only on a small dataset may

fail to recognize that the hand posture depicted in these two

images is the same. We propose a method to exploit au-

tomatically annotated, noisy data in order to train a better

hand posture classiﬁer.

be noisy. Reliable automatic annotation systems are

generally very hard if not impossible to design and

build, depending on the targeted problem. Generally,

most current Computer Vision approaches can yield

somewhat reliable results under speciﬁc, controlled

scenarios, but in general it is unavoidable to have fail-

ure cases, to a varying degree.

The method we develop is generic in nature, and

can be tailored to address any classiﬁcation problem.

It assumes the existence of a small number of manu-

ally annotated samples, as well as a large set of au-

632

Lydakis, G., Oikonomidis, I., Kosmopoulos, D. and Argyros, A.

Exploitation of Noisy Automatic Data Annotation and Its Application to Hand Posture Classiﬁcation.

DOI: 10.5220/0010906200003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

632-641

ISBN: 978-989-758-555-5; ISSN: 2184-4321

tomatically annotated ones, whose labels might be

noisy. We begin by training a Convolutional Neural

Network (CNN) on the manually annotated data, re-

sulting in a classiﬁer that might not generalize well,

given the small number of samples.

Then, we compare the predictions of the network

for all automatically annotated samples with the noisy

ground truth labels. We incorporate in the training

set a subset of those, for which the predictions of the

classiﬁer agrees with the labels. The intuition is that

the agreement of the two predictors (classiﬁer, noisy

automatic annotation) is probably not coincidental.

The network is then trained again on the new dataset,

and the procedure is iteratively repeated until there is

no improvement in validation accuracy. We exper-

iment with two variations of the method, based on

how conservatively/aggressively we expand the train-

ing dataset with automatically annotated data.

In order to evaluate our technique, we apply it to

the problem of hand posture recognition from RGB

images (see Figure 1). The posture and motion of

hands play an important role in conveying informa-

tion in sign languages, when combined with other

non-manual features. Motivated by the domain of

Sign Language Recognition, we formulate a clas-

siﬁcation problem for hand postures used in Greek

Sign Language. More speciﬁcally, we aim to de-

velop a lightweight, yet robust hand posture classi-

ﬁer. We present a method that processes unlabeled

videos of subjects signing, and automatically assigns

posture labels to the frames. This is based on us-

ing a 3D hand pose estimation tool, Google’s Me-

diaPipe (Zhang et al., 2020), for estimating the pos-

ture in each frame, and comparing the 3D conﬁgura-

tions of joints to those corresponding to the problem’s

classes. Precisely because the annotation produced

in this manner is contaminated with noise, this is a

suitable problem on which to apply the techniques we

present for handling noisy ground truth data.

2 RELATED WORK

This section brieﬂy summarizes methods for the re-

duction of annotation effort, including machine learn-

ing approaches, and related ideas. A general tech-

nique which is comonly used both for annoatation ef-

fort reduction, and impeoved generalization is data

augmentation (Shorten and Khoshgoftaar, 2019). A

domain-speciﬁc technique is presented by Voigtlaen-

der et al. (Voigtlaender et al., 2021), for the prob-

lem of semi-supervised video object segmentation.

The authors propose a network to extract pixel-level

pseudo-labels given bounding boxes.

Reducing annotation effort is also desirable in the

ﬁeld of active learning, the ﬁeld of machine learning

where an algorithm repeatedly queries a user, known

as the oracle, for labeling new data. Sun et al. (Sun

and Loparo, 2020) argue that even though several au-

tomatic or semi-automatic annotation methods can re-

duce the number of instances that need to be labeled,

the queries to the oracle which offer the most to the

learning algorithm remain the most difﬁcult cases to

label. The authors attempt to alleviate this issue by

leveraging available metadata that can give the ora-

cle “hints” for the labeling process, by clustering data

points with similar metadata attributes. In contrast to

active learning approaches, our work does not assume

perfect on-demand annotation of samples, but rather

estimates the most probable class label of samples,

closer to the semi-supervised learning paradigm.

Semi-supervision refers to a family of machine

learning techniques which operate on a small set of

manually labeled data combined with a larger set of

unlabeled data. These techniques are also relevant

to the problem of reducing annotation effort. Semi-

supervised learning refers to methods that are trained

on a combination of a small amount of manually an-

notated data and a large amount of unlabeled data.

Honari et al. (Honari et al., 2018) develop two tech-

niques for landmark localization based on partially

annotated datasets. The authors leverage the lim-

ited samples with landmark annotation, as well as a

more abundant set of samples for which only a more

general, high-level label is available. This label can

be either for a classiﬁcation or regression task, and

serves as an auxiliary guide towards localization of

landmarks on the unlabeled data. Wan et al. (Wan

et al., 2017) present a semi-supervised approach for 3-

D hand posture estimation from single depth images.

The approach creates two generative models which

share a feature space, such that any point in this space

can be mapped to a unique depth image, and a unique

3-D hand pose.

Another approach related to semi-supervision and

active learning is that of label propagation (Zhu and

Ghahramani, 2002). Label propagation starts with a

small set of labeled samples, and a larger, unlabeled

set. The key idea is to progressively label samples

from the unlabeled set, based initially only on the

knowledge of the labeled samples, but gradually la-

beling more unlabeled ones, hence “propagating” the

known labels. Our approach is similar to label propa-

gation in the sense that it also begins by only using a

small set of "trusted" samples. In contrast, however,

we assume that all of the other samples are labeled as

well, with labels that feature noise.

Exploitation of Noisy Automatic Data Annotation and Its Application to Hand Posture Classiﬁcation

633

manual

(initial

training set)

Train CNN

D (many

unlabeled

samples)

Use M to label

Predict labels for D

Select some

samples where

labels agree

Add to training

set

Repeat while

accuracy on

validation

increases

auto

(noisy)

Figure 2: Overview of the proposed approach. We assume the availability of a small, manually labeled and therefore reliable

dataset D

manual

(top left), and a larger unlabeled one D (bottom) left. Furthermore, an automatic method is assumed to provide

noisy labels for D, yielding D

auto

(bottom middle). We start by training a classiﬁer on the reliable dataset D

manual

, and proceed

to iteratively expand the training set by adding samples from D where the prediction of the trained classiﬁer agrees with the

automatic annotation in D

auto

An old paradigm in semi-supervised learning is

self-learning, or self-training (Chapelle et al., 2009),

which is based on the repeated use of a supervised

method. Initially, the supervised algorithm is trained

on labeled data only, and on each step some of the un-

labeled data is labeled using the trained model. The

procedure is repeated, and the training set gradually

expands to feature data labeled by the algorithm it-

self. Also relevant to the topic of reduced annotation

effort is the concept of learning from data where the

ground truth is noisy. Automatically created ground

truth has a greater chance of featuring noise. There-

fore, techniques that tackle this issue also facilitate

the use of automatic annotation methods. Jiang et

al. (Jiang et al., 2018) develop a strategy for training

on noisy ground truth based on Curriculum Learn-

ing. Curriculum Learning, developed by Bengio et

al. (Bengio et al., 2009), is a technique for guid-

ing the optimization of a neural network by present-

ing the training examples in an order which encour-

ages it to progressively learn more complex features.

Jiang et al. (Jiang et al., 2018) apply this concept

by simultaneously training two networks, one which

learns the actual task featuring the noisy ground truth,

and another which learns how to guide the ﬁrst by

presenting samples that are deemed correct. While

the second network undergoes a training process, the

curriculum adapts to the data at hand. Hacohen et

al. (Hacohen and Weinshall, 2019) investigate Cur-

riculum Learning with two strategies. The ﬁrst one

involves a “teacher” network which transfers knowl-

edge it has accumulated from some other dataset. The

second is a type of bootstrapping, where the network

is ﬁrst trained on the target dataset without any cur-

riculum. Han et al. (Han et al., 2018) also tackle the

problem by coining the method “Co-teaching”. It in-

volves training two networks simultaneously, where

each one teaches the other about what data it con-

siders correctly labeled. The intuition is that a net-

work tends to learn correct samples in the ﬁrst stages

of optimization, and memorize the wrong ones at a

later point (Arpit et al., 2017). Mirzasoleiman et

al. (Mirzasoleiman et al., 2020) address the same is-

sues by developing a method that can select subsets of

the data that are likely to be free of noise. Their selec-

tion is based on inspecting the Jacobian matrix of the

loss function being optimized, and choosing medoid

data points in the gradient space. By choosing the

samples in this fashion, they avoid overﬁtting on cor-

rupted ground truth samples.

Li et al. propose DivideMix (Li et al., 2020),

an approach to learn from noisy labels by leverag-

ing semi-supervised techniques. Speciﬁcally, a mix-

ture model on the loss is used to divide the train-

ing data into a labeled set with clean samples and

an unlabeled set. Northcutt et al. propose Conﬁ-

dent Learning (Northcutt et al., 2021), an approach

to estimate the conﬁdence/uncertainty of dataset la-

bels. We choose this approach to experimentally com-

pare against our work, since it is a recent, state-of-

the-art work, with an easy to use and actively devel-

oped code base. Overall, on the topic of learning from

noisy labels, Song et al. present a comprehensive

overview (Song et al., 2020).

We propose an approach for working with noisy

ground truth data that have been automatically an-

notated. Our approach exhibits similarities to the

approaches presented here, especially to Curriculum

Learning (CL) and self-training approaches. This is

because it attempts to select appropriate subsets of the

automatically annotated training data. Furthermore,

this is done in an iterative manner, exploiting the pre-

dictions of the trained classiﬁer itself. Overall, the

most suitable learning paradigm ﬁtting our approach

is that of semi-supervised learning, and more speciﬁ-

cally self-training. In contrast to self-training, our ap-

proach explicitly focuses on cases with large amounts

of noisy data. To the best of our knowledge, no sim-

ilar approaches on self-training with noisy labels or

on related areas have been proposed in the relevant

literature.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

634

3 METHODOLOGY

In this Section we describe the proposed approach in

detail, and also the method we followed to prepare

data for the target task of hand posture recognition.

3.1 Exploiting Automatic Ground Truth

We assume a classiﬁcation problem featuring K

classes, and a method which is capable of automat-

ically classifying a sample in one of these classes.

However, it is assumed that this method is imperfect,

yielding some erroneous annotations. It is expected

that training a CNN model on a dataset with misla-

beled samples will affect its ability to discriminate

among the classes of the problem. It is therefore use-

ful to investigate techniques that can take advantage

of this noisy automatic labeling.

Among all candidate techniques, the simplest way

to utilize automatically annotated data is to consol-

idate large numbers of such samples, potentially re-

ducing the impact of failure cases on the performance

of the classiﬁer. One could also treat those sam-

ples as completely unlabeled, combining them with

a small number of manually annotated samples and

thus adopt a semi-supervised approach. However, it

is our intuition that the utilization of a noisy ground

truth label can be beneﬁcial, and therefore a more

sophisticated approach can be devised. Speciﬁcally

we assume the availability of a large dataset, D

auto

for which automatically produced ground truth labels

are available. We also have a small set of samples,

manual

which we have annotated manually, and thus

their labels are expected to be signiﬁcantly less noisy.

Training a model only on D

manual

is likely to yield a

classiﬁer that does not generalize effectively. How-

ever, an important observation is that it can still pro-

vide a noisy estimate of the likelihood of a sample be-

longing in a particular class. For a novel sample, the

commonly employed “one-hot encoding” for classiﬁ-

cation tasks will yield continuous values in the range

[0 − 1] (one per class), which can serve as estimates

of such likelihoods.

Therefore, we can train a model on D

manual

, com-

pute its predictions on all samples of D

auto

, and se-

lect only those samples where the prediction agrees

with the labeling produced by automatic annotation.

This yields D

auto,0

⊆ D

auto

, a set of samples which are

more likely to be correctly labeled, since we are con-

servatively selecting only the samples for which two

noisy predictors agree. We expect a model trained on

manual

∪ D

auto,0

to generalize better than one trained

on D

manual

only, since it has a larger number of sam-

ples from which to extract useful classiﬁcation fea-

tures. Furthermore, the increased number of samples

is more likely to prevent overﬁtting.

By continuing this iterative process, increasingly

robust classiﬁers are formed, and we expect a better

utilization of D

auto

, since the combination of the two

predictors will ﬁlter out some of the noise of the au-

tomatic annotation, as shown in Figure 2. Assuming

a validation set, we can continue this iterative process

until the validation accuracy of the classiﬁer trained

on the selected data no longer increases, or starts to

decrease. The latter could potentially occur if the au-

tomatic labeling results in many similar samples with

the same incorrect label. Even if only a few of those

samples are added in the training set, the capacity of

the model to memorize could result in more wrongly

annotated samples “contaminating” the training set

later on. We call this iterative scheme “Greedy Itera-

tive Dataset Expansion Algorithm”, or “G-IDEA”.

Furthermore, we expect the classiﬁer’s predictions

to become more trustworthy as the iterations progress,

since it has more data available for training. Further-

more, as previously discussed, if many similar incor-

rect samples exist in D

auto

, a model trained on a few

of them can end up memorizing them. Subsequently,

it is more likely to introduce more similarly incorrect

samples in its dataset as the iterations progress.

Motivated by these observations, we can modify

G-IDEA: On each iteration, only a portion of the data

where the predictions agree with the automatic la-

beling is included in the training set, as illustrated

in Figure 2. To select this portion, we assume that

the model outputs K numbers representing the likeli-

hoods of a sample belonging in each of the K classes.

For each of the classes we can then order the pre-

dicted samples by decreasing likelihood. Intuitively,

we are ordering the samples by a measure of how cer-

tain the model is of its predictions. We can then select

only a conservative percentage of each class’ data,

and then gradually increase this percentage as the it-

erations progress. We call this modiﬁcation “Con-

servative Iterative Dataset Expansion Algorithm”, or

“C-IDEA”. Algorithm 1 outlines a simpliﬁed version

of C-IDEA, with the addition of new samples per-

formed over the whole dataset, for clarity of the pre-

sentation. This approach, C-IDEA, is the proposed

method to exploit noisy, automatically annotated la-

bels. Apart from motivating the development of C-

IDEA, G-IDEA serves as a baseline in the quantita-

tive evaluation.

3.2 Hand Posture Recognition

As already mentioned, the algorithms presented

above are applicable in any classiﬁcation problem.

Exploitation of Noisy Automatic Data Annotation and Its Application to Hand Posture Classiﬁcation

635

Algorithm 1: Simpliﬁed version of C-IDEA: In practice the

addition of samples is performed per class to avoid class

imbalance in later iterations.

Input:

manual

, a set of manually annotated samples

auto

, a set of automatically annotated samples

validation

, validation set, also manually annotated

sel_r, initial training data selection ratio ( ∈ (0, 1] )

inc_ f actor, ratio increase factor per iteration

Output:

m, a CNN model trained on selected input data

selected

, the selected training data

m ← train a model on D

manual

selected

← D

manual

while accuracy on D

validation

increases do

new_data ← all samples of D

auto

\ D

selected

where the predicted class output of m agrees with

the label given in D

auto

l ← per sample likelihoods of all samples in

new_data, as estimated by m

new_data ← sort new_data according to de-

creasing likelihood l

sel_data ←ﬁrst bsel_r · size(new_data)c ele-

ments of new_data

selected

← D

selected

∪ sel_data

m ← train a model on D

selected

sel_r ← min(1, sel_r · inc_ f actor)

m ← m

best

(the best performing model among all

trained models according to validation accuracy)

selected

← D

best

(similarly to above, the training

set of m

best

)

return m, D

selected

For the current work, we choose to apply and evalu-

ate them on the problem of hand posture recognition

from RGB images. Given a single RGB image of a

human hand, the task here is to output a label indicat-

ing which of the K postures appears on the image.

Experimental evaluation of the proposed tech-

niques requires us to specify a set of hand postures

which we are interested in recognizing. Motivated by

the general problem of Sign Language Recognition,

we apply our methods to a set of hand postures that

convey semantic information in Greek Sign Language

(GSL). Although recognition and translation of GSL

cannot be performed with hand posture information

only, the conﬁguration of each hand can serve as a

useful feature in the general task.

3.3 Automatic Hand Posture Extraction

As already outlined, the proposed approach assumes

an automatic way to label a large part of the dataset,

auto

. Therefore, in the following we outline the

approach we used to automatically label images of

sign language. As already mentioned, this approach

doesn’t need to estimate perfect labels, in fact the

annotation is assumed to be noisy. Our approach is

based on extracting the 3-D keypoint structure of the

hands. Then, each frame is assigned the label of the

posture that best matches this structure, among a pre-

deﬁned set of postures we are interested in recog-

nizing. In order to represent 3-D hand postures, we

adopt a hand model commonly used in relevant litera-

ture (Panteleris et al., 2018; ?; ?). This model consists

of 21 keypoints, the joints of the palm and ﬁngers.

The ﬁrst step in assigning one of K labels to any

subset of the frames is extracting this 3-D structure

for all classes and all frames. To achieve this, we used

MediaPipe Hands (Zhang et al., 2020), a software de-

veloped by Google, capable of extracting 2.5-D hand

landmarks from RGB images.

Given an input image of dimensions W × H, the

scheme (Zhang et al., 2020) utilized by Mediapipe

represents each hand posture as a 21-tuple of 3-tuples,

P = ((x

, y

, z

), (x

, y

, z

), . . . , (x

, y

, z

)).

All x

, y

are in the range [0, 1], and x

W , y

H equal the

horizontal and vertical pixel coordinates of landmark

i in the image, respectively. Furthermore, z

repre-

sents the relative depth of landmark i: the wrist joint

is positioned at depth 0 by convention, and the re-

maining depths are appropriately assigned. In order

to compare hand postures via their corresponding 3-

D keypoints, we ﬁrst have to transform all poses to a

common coordinate system

For translation and rotation normalization we as-

sume that the wrist joint and the base joints of all

ﬁngers and thumb can be considered approximately

rigid in the human hand. Let P = ( j

, j

, . . . , j

) be

a hand posture in an image of dimensions W × H as

described previously, with j

= (x

, y

, z

) correspond-

ing to the i-th landmark as described above. Then,

= ( j

− j

, j

− j

, . . . , j

− j

) is the same pos-

ture translated so that the wrist lies at the origin.

We continue by rotating all the points of P

so that

the vector from the wrist to the base of the index

ﬁnger is lying on the x axis, and the palm is ly-

ing on a predeﬁned plane. It is now partially mean-

ingful to deﬁne the distance of two hand postures,

= ( j

1,1

, j

1,2

, . . . , j

1,21

), P

= ( j

2,1

, j

2,2

, . . . , j

2,21

) as

the sum of the Euclidean distances of all correspond-

ing joints, as in Equation 1. However, the character-

istics of each individual hand (e.g. ﬁnger lengths) can

For this work, we make the simplifying assumption

that no two classes can differ from one another solely by

a common rotation of all their landmarks.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

636

adversely affect this sum: two different hands per-

forming the same posture may end up having a dis-

tance signiﬁcantly larger than zero if their anatomical

structures are different.

Therefore, it is necessary to also normalize the in-

dividual hand characteristics. To this end, we chose

a predeﬁned hand model, H

= ( j

0,1

, j

0,2

, . . . , j

0,21

and apply the following transformation on all pos-

tures, in order to match their structure with H

. We

ﬁrst replace all the ﬁnger base joints in the normalized

pose by their respective ones in H

, effectively forcing

an identical palm size. Then, for each “bone” (that is,

segment of consecutive joints in the kinematic chain),

we change its length to match the respective one of

, while preserving its direction in space.

We can now proceed to compare hand poses. To

this end, as already mentioned, we employ the sum of

euclidean distances between corresponding hand key-

points, after they have been normalized as above

d(P

, P

) =

∑

k=1

k j

1,k

− j

2,k

k. (1)

We now have a way of comparing every hand pos-

ture of every frame with the postures that correspond

to the classes of our problem. More precisely, we are

given K input images, each corresponding to a partic-

ular hand posture (that is, to a class of the problem)

and a set of videos, each of them featuring one of sub-

jects performing these hand postures. The end goal is

to automatically label a subset

of the frames that can

be then be used to train the classiﬁer we are designing.

Towards this end, the K reference hand postures

are compared with each of the frames to be automat-

ically labeled, according to the metric of Equation 1.

Each frame is then assigned the label of the reference

pose with the lowest distance. The resulting dataset is

termed D

auto

. Furthermore, we proceed to manually

or semi-automatically (with the aid of automatically

extracted labels as above) annotate a small part of the

available images into the datasets D

manual

, D

validation

and D

test

, paying attention to include images of differ-

ent singers in each of them, excluding also the signers

in D

auto

4 EXPERIMENTAL EVALUATION

We present an evaluation of our proposed methods ap-

plied to the problem of hand posture recognition from

In practice many of the frames of a video recording

must be discarded because the signer is in the so-called

“neutral pose”. We resort to simple heuristics such as

thresholds on hand motion to detect and reject such frames.

RGB images. Firstly, we present the datasets that are

used for model selection (validation set) and perfor-

mance estimation (test set). We then present the de-

tails of the training procedure, with respect to the net-

work’s hyperparameters and data augmentation pro-

cess. Finally, we compare the performance of CNNs

trained with and without the use of the algorithms out-

lined in section 3.1.

4.1 Employed Datasets

In order to evaluate the proposed approach in real-

world data we employ the HealthSign dataset, which

is detailed in (Kosmopoulos et al., 2020). This is

a dataset focused on communication of deaf patients

with health professionals. From an analysis of the

450 most common glosses in the HealthSign dataset

we found 38 postures being used at least once. From

these 38 postures, we selected 19, for which the au-

tomatic annotation process had yielded some initial

label, aiding the manual annotation process.

Apart from the images in the HealthSign dataset,

more data were found from YouTube videos of sign-

ing subjects, as well as some more were captured

by us. Seven male and three female volunteers per-

formed the identiﬁed hand postures under our instruc-

tions, contributing to the available images. Overall,

the HealthSign dataset features 8 signers, and from

YouTube videos and our recordings we added another

7 and 10 signers respectively to the pool.

Among these, images from the HealthSign dataset

were used for manual annotation, essentially exclu-

sively populating the D

manual

dataset. The remaining

of the data were predominantly used for automatic la-

beling (populating D

auto

) whereas some of the sign-

ers were held out, populating respectively either the

validation

or the D

test

parts of the dataset. Figure 3

shows three example images for posture “Y” from the

test set, each from a different signer.

4.2 Classiﬁer Architecture and Training

For all the experiments presented below, we choose

the MobileNet v2 network (Sandler et al., 2019) ar-

chitecture as the base for the classiﬁers we train. We

choose it because it is a recent lightweight architec-

ture that can be used in real-time conditions on smart-

phones, while also achieving high accuracy (e.g. on

the ImageNet dataset (Deng et al., 2009)). In partic-

ular, we always start with a MobileNet v2 model that

has already been trained on the ImageNet dataset. In

order to adapt the model to our classiﬁcation problem,

we add a fully connected layer after the convolutional

part of the MobileNet architecture. Each of its neu-

Exploitation of Noisy Automatic Data Annotation and Its Application to Hand Posture Classiﬁcation

637

Figure 3: Three examples of images included in the test set

for the posture “Y”, from three different signers.

rons has as inputs all outputs of the ﬁnal layer of the

MobileNet architecture, and there are as many neu-

rons as the K classes of the problem. Finally, the K

outputs (x

, x

, . . . , x

) are fed through a log-softmax

layer since we are aiming for classiﬁcation among K

classes. During training, the objective function is the

negative log likelihood loss function, which is com-

monly used for classiﬁcation problems.

We tried both the AdaDelta (Zeiler, 2012) and

Adam (Kingma and Ba, 2017) optimizers, with

the PyTorch (Paszke et al., 2019) default learning

rates 1 and 10

−3

respectively. Preliminary exper-

iments determined that the Adam optimizer gener-

alized marginally better. The experiments detailed

below were all performed with the Adam optimizer

with this parameterization. During training we used

a batch size of 64 samples. In every experiment pre-

sented below, the corresponding CNN was trained for

120 epochs. At the end of each epoch, the accuracy

of the classiﬁer on the validation set was measured,

and the ﬁnal model for that experiment was chosen to

be the one which achieved maximum validation accu-

racy.

The fact that the classes are assumed rotationally

invariant allows us to use a random rotation between

◦

and 360

◦

as augmentation. Since the images fed

to the network are already tightly cropped around the

hand, we avoid using random cropping as augmenta-

tion, since it could result in obscuring parts of the im-

age that contain useful information for classiﬁcation.

We do, however, use a random translation augmenta-

tion of up to 10% of the image’s dimensions. After

applying all transforms, each image that is fed to the

network is always scaled to be 224 × 224 pixels in

size, and also normalized so that its pixel values have

a mean of µ = (0.485, 0.456, 0.406) and standard de-

viation of σ = (0.229, 0.224, 0.225) (values computed

on the ImageNet dataset, and recommended for use

with a pretrained MobileNet model).

For the case of training images that originate from

the HealthSign dataset, we also apply one custom data

augmentation step before applying any of the others

described above. Speciﬁcally, because the Health-

Sign videos were recorded in a room with a mostly

green background, we can determine an approximate

Figure 4: Transforming the background of HealthSign im-

ages. On the left, one of the original images. On the right,

the same image augmented with a random background.

pixel value (r

, g

, b

) for the background color in

an ofﬂine preprocessing step. Then, during training,

we can perform a crude segmentation of the back-

ground based on these color values. Having selected

the background pixels, we then replace them with the

corresponding pixels from an image randomly sam-

pled from the Stanford backgrounds dataset (Gould

et al., 2009). Figure 4 illustrates applying this trans-

formation to an example image. We apply this trans-

formation with a probability of 0.7 on any HealthSign

training image that is fed to the network. This serves

to provide greater variety to the HealthSign data, re-

ducing the probability of overﬁtting on irrelevant fea-

tures, such as learning to expect a green background

around some or all hand postures.

4.3 Experimental Evaluation

As a ﬁrst, baseline experiment, we trained a classi-

ﬁer as outlined above, only on the manually anno-

tated dataset D

manual

. In this case, the trained model

achieved an average accuracy of 46.5% on the valida-

tion set, and 41.5% on the test set.

Next, we experimented with the use of automati-

cally annotated data as a complement for our small set

of manually annotated images, naively adding them

to the training pool. This is a large pool of samples

from 17 signers. We progressively added more au-

tomatically annotated samples ordered by lowest dis-

tance (see Equation 1), only keeping the balance of

samples per class and signer. We experimented with

several values for the number of samples per class

for every signer, beginning with 15 images per signer

class, similarly to the manually annotated samples.

By the term signer class we refer to all the images

belonging in a particular class which feature the same

signer. Speciﬁcally, we experimented with the val-

ues of 30, 60, 120 and 500 images per signer class.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

638

Table 1: Validation accuracy achieved when using a vari-

able number of images per signer class. Top row: Images

per signer class, bottom row: Validation accuracy.

15 30 60 120 500

69.06% 71.33% 70.4% 72.3% 73.38%

Figure 5: Validation accuracy achieved on each iteration

when using G-IDEA.

Table 1 summarizes the highest validation accuracies

recorded with each conﬁguration.

We observe that, even with 15 images per signer

class, there is a signiﬁcant improvement (25%-30%)

over using manually annotated data only. Further-

more, increasing the number of images per signer

class is, in general, beneﬁcial to validation accuracy.

We record the highest validation accuracy of 73.38%

when using 500 images per signer class. However,

we also observe that the increase in accuracy is rather

small in proportion to the increase in training set size.

Considering that a large number of new, correctly

labeled samples are introduced, one might expect a

larger increase in accuracy. This hints that a signif-

icant amount of mislabeled samples have gradually

been introduced in the dataset as well. We stopped

experiments at 500 images per signer class because

the increase in performance had almost plateaued at

500, and the additional computational cost didn’t jus-

tify experimenting with larger values.

Next, we experimented with G-IDEA, serving as

a baseline for C-IDEA. In this case, the training sam-

ples were iteratively selected by the proposed ap-

proach, as outlined in Section 3.1. For this reason,

we anticipated the method to result in a less noisy

dataset. Consequently, we also expected improve-

ments on validation and test accuracy. Indeed, this

was observed, as shown in Figure 5, depicting the

evolution of validation accuracy on each iteration of

the algorithm. Peak accuracy was reached on iteration

2, and the algorithm terminated on iteration 3, since

the accuracy no longer increased. We ran the algo-

rithm for two more iterations, and validation accuracy

continued to decrease very slowly.

Figure 6: Validation accuracy achieved on each iteration

when using C-IDEA.

Average accuracy was increased over the naive ap-

proach by approximately 5% both on the validation

and test set. Furthermore, we observed in the re-

spective confusion matrices that cases of confusion

between speciﬁc classes decreased compared to the

naive approach. Also, without the use of this algo-

rithm the lowest class accuracy recorded on both val-

idation and test sets was 25%, even if the average

accuracy was in the order of 70%. When using G-

IDEA, the lowest class accuracy was 44%. Filtering

noisy ground truth samples therefore apparently con-

tributes to the performance of a trained classiﬁer, al-

though there is still potential for further improvement.

Finally, the proposed approach, C-IDEA was ex-

perimentally assessed, which also selects new train-

ing samples iteratively, but makes its selections in a

more conservative manner. We assessed whether this

more conservative strategy could yield improvements.

In particular, we used the aforementioned algorithm

with parameters sel_rat = 0.1 and inc_ f actor = 0.15.

The evolution of validation accuracy as the iterations

progress is shown in Figure 6. Compared to G-IDEA,

accuracy increased at a slower rate, which is to be ex-

pected, since the increase of training data is by design

slower. However, from iteration 5 onward, C-IDEA

achieved accuracies higher than the ones recorded in

the previous experiment, culminating in a validation

accuracy of 83.81%.

We recorded an accuracy of 83.81% on the vali-

dation set, and 85.72% on the test set, thus improving

by approximately 5% over G-IDEA. The lowest class

accuracy recorded on both validation and test set was

now 65%, and we observed once again a decreased

number of cases where one class was signiﬁcantly

confused with another. Therefore, C-IDEA appears

capable of selecting mostly correct samples.

As a comparison with the state of the art, we

performed an additional experiment using the Conﬁ-

dent Learning method by Northcutt (Northcutt et al.,

2021). In that experiment, the whole training dataset,

Exploitation of Noisy Automatic Data Annotation and Its Application to Hand Posture Classiﬁcation

639

Table 2: Summary of validation and test accuracies

achieved by each method.

Technique/data Vali. Test

accuracy accuracy

Manual data 46.45% 41.52%

Manual & automatic data 73.38% 74.19%

G-IDEA (proposed) 78.83% 80.04%

C-IDEA (proposed) 83.81% 85.73%

Conﬁdent Learning

(Northcutt et al., 2021) 73.58% 74.84%

manual

∪D

auto

, was jointly provided to that method

which then decided which samples were to be trusted.

The resulting selection was then used to train a clas-

siﬁer. The validation accuracy of this model was

73.58%, marginally better than the naive approach.

Therefore, while it manages to slightly improve over

the naive approach, evidently our proposed approach

is better suited for the task at hand. One reason for

this may be the fact that our approach treats differ-

ently the parts D

manual

and D

auto

, trusting the ﬁrst to

exploit the second.

The experimental analysis presented in this sec-

tion proves that automatically annotated data can be

highly beneﬁcial in our problem. The mere inclusion

of automatically labeled samples contributes signiﬁ-

cantly to the generalization of the network, increas-

ing average accuracy on unseen data from the range

of 40%-45% to 73%-74%. Furthermore, the cost in

human effort for gathering the data is rather small.

Additionally, the use of C-IDEA further increases ac-

curacy to 83%-85%. These results are summarized

in table 2, along with the performance of Conﬁdent

Learning (Northcutt et al., 2021) for comparison.

5 DISCUSSION

5.1 Summary

We presented a method for utilizing automatically an-

notated data in training CNNs on classiﬁcation prob-

lems. The method is based on training the network on

a small subset of manually annotated data, and then

iteratively adding samples that are likely correctly la-

beled. Iteratively, the network is retrained on the new

training set, gradually becoming more accurate. We

applied these techniques on the problem of hand pos-

ture recognition from RGB images.

We used the implementation provided by the authors.

5.2 Limitations

A limitation that stood out during the experimental

evaluation of our approach was a sensitivity of the

proposed approach in inherently ambiguous classes.

Speciﬁcally for our test case, postures of different

classes that nevertheless were similar, for example

differing by the pose of a single ﬁnger, were more

tricky to correctly classify. This may happen because

an incorrectly labeled sample may lead to a cascad-

ing effect: after the network is trained with it, similar,

incorrectly labeled samples enter the training set, per-

petuating the initial classiﬁcation error. Nevertheless,

the proposed approach still outperformed the naive

approach, probably because the selected correctly la-

beled samples outnumbered the incorrect ones in the

later iterations. More generally, spurious entries in

the early steps are problematic because they may lead

to this cascading effect. A potential mitigating strat-

egy would be to reevaluate the selection, and remove

some of the least conﬁdent samples in each iteration.

5.3 Future Work

Among the numerous future directions of this work, a

few stand out: Better heuristics to determine the most

appropriate samples to add in each iteration can po-

tentially yield further improvements. Also, the prob-

lem of hand posture recognition may beneﬁt from dif-

ferent 3D hand pose estimation techniques, other than

MediaPipe (Zhang et al., 2020), or even in conjunc-

tion with it. Another interesting direction to investi-

gate is the possibility to train and use multiple classi-

ﬁers, in a boosting fashion. This would allow for more

reliable class predictions and possibly faster conver-

gence of C-IDEA. Finally, problems other than hand

posture classiﬁcation are worth investigating.

ACKNOWLEDGEMENTS

This work is partially supported by the Greek Sec-

retariat for Research and Technology, and the EU,

Project HealthSign: Analysis of Sign Language on

mobile devices with focus on health services T1E∆K-

01299 within the framework of “Competitiveness,

Entrepreneurship and Innovation” (EPAnEK) Opera-

tional Programme 2014-2020.

REFERENCES

Arpit, D., Jastrz˛ebski, S., Ballas, N., Krueger, D., Bengio,

E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville,

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

640

A., Bengio, Y., et al. (2017). A closer look at mem-

orization in deep networks. In International Confer-

ence on Machine Learning, pages 233–242. PMLR.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J.

(2009). Curriculum learning. In Proceedings of

the 26th annual international conference on machine

learning, pages 41–48.

Chapelle, O., Scholkopf, B., and Zien, A. (2009).

Semi-supervised learning (chapelle, o. et al., eds.;

2006)[book reviews]. IEEE Transactions on Neural

Networks, 20(3):542–542.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using rnn encoder-

decoder for statistical machine translation.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Gould, S., Fulton, R., and Koller, D. (2009). Decomposing

a scene into geometric and semantically consistent re-

gions. In 2009 IEEE 12th International Conference

on Computer Vision, pages 1–8.

Graves, A., rahman Mohamed, A., and Hinton, G. (2013).

Speech recognition with deep recurrent neural net-

works.

Hacohen, G. and Weinshall, D. (2019). On the power of

curriculum learning in training deep networks. In In-

ternational Conference on Machine Learning, pages

2535–2544. PMLR.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang,

I., and Sugiyama, M. (2018). Co-teaching: Robust

training of deep neural networks with extremely noisy

labels. arXiv preprint arXiv:1804.06872.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-

ual learning for image recognition.

Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C.,

and Kautz, J. (2018). Improving landmark localiza-

tion with semi-supervised learning. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 1546–1555.

Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L.

(2018). Mentornet: Learning data-driven curricu-

lum for very deep neural networks on corrupted la-

bels. In International Conference on Machine Learn-

ing, pages 2304–2313. PMLR.

Kingma, D. P. and Ba, J. (2017). Adam: A method for

stochastic optimization.

Kosmopoulos, D., Oikonomidis, I., Constantinopoulos, C.,

Arvanitis, N., Antzakas, K., Biﬁs, A., Lydakis, G.,

Roussos, A., and Argyros, A. (2020). Towards a vi-

sual sign language dataset for home care services. In

2020 15th IEEE International Conference on Auto-

matic Face and Gesture Recognition (FG 2020), pages

520–524. IEEE.

Li, J., Socher, R., and Hoi, S. C. (2020). Dividemix:

Learning with noisy labels as semi-supervised learn-

ing. arXiv preprint arXiv:2002.07394.

Mirzasoleiman, B., Cao, K., and Leskovec, J. (2020). Core-

sets for robust training of deep neural networks against

noisy labels. Advances in Neural Information Pro-

cessing Systems, 33.

Northcutt, C., Jiang, L., and Chuang, I. (2021). Conﬁ-

dent learning: Estimating uncertainty in dataset labels.

Journal of Artiﬁcial Intelligence Research, 70:1373–

1411.

Panteleris, P., Oikonomidis, I., and Argyros, A. A. (2018).

Using a single rgb frame for real time 3d hand pose

estimation in the wild. In IEEE Winter Conference on

Applications of Computer Vision (WACV 2018), also

available at CoRR, arXiv, pages 436–445, lake Tahoe,

NV, USA. IEEE.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,

Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,

Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-

Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,

Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019).

Pytorch: An imperative style, high-performance deep

learning library. In Wallach, H., Larochelle, H.,

Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Gar-

nett, R., editors, Advances in Neural Information Pro-

cessing Systems 32, pages 8024–8035. Curran Asso-

ciates, Inc.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2019). Mobilenetv2: Inverted residuals

and linear bottlenecks.

Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on

image data augmentation for deep learning. Journal

of Big Data, 6(1):1–48.

Simon, T., Joo, H., Matthews, I., and Sheikh, Y. (2017).

Hand keypoint detection in single images using mul-

tiview bootstrapping.

Song, H., Kim, M., Park, D., Shin, Y., and Lee, J.-G. (2020).

Learning from noisy labels with deep neural networks:

A survey. arXiv preprint arXiv:2007.08199.

Sun, Y. and Loparo, K. (2020). Context aware im-

age annotation in active learning. arXiv preprint

arXiv:2002.02775.

Voigtlaender, P., Luo, L., Yuan, C., Jiang, Y., and Leibe,

B. (2021). Reducing the annotation effort for video

object segmentation datasets. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision, pages 3060–3069.

Wan, C., Probst, T., Van Gool, L., and Yao, A. (2017).

Crossing nets: Dual generative models with a shared

latent space for hand pose estimation. In Confer-

ence on Computer Vision and Pattern Recognition,

volume 7.

Zeiler, M. D. (2012). Adadelta: An adaptive learning rate

method.

Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A.,

Sung, G., Chang, C.-L., and Grundmann, M. (2020).

Mediapipe hands: On-device real-time hand tracking.

Zhu, X. and Ghahramani, Z. (2002). Learning from labeled

and unlabeled data with label propagation.

Exploitation of Noisy Automatic Data Annotation and Its Application to Hand Posture Classiﬁcation

641