Interpretable High-level Features for Human Activity Recognition

Yale Hartmann

, Hui Liu

, Steffen Lahrberg

and Tanja Schultz

Cognitive Systems Lab, University of Bremen, Bibliothekstraße 1, 28359 Bremen, Germany

Keywords:

High-level Feature, Human Activity Recognition, Hidden Markov Model, Motion Unit, Few-shot Learning,

Wearable Sensors.

Abstract:

This paper introduces and evaluates a novel way of processing human activities based on unique combinations

of interpretable categorical high-level features with applications to classiﬁcation, few-shot learning, as well

as cross-dataset and cross-sensor comparison, combination, and analysis. Feature extraction is considered as

a classiﬁcation problem and solved with Hidden Markov Models making the feature space easily extensible.

The feature extraction is person-independently evaluated on the CSL-SHARE and UniMiB SHAR datasets

and achieves balanced accuracies up from 96.1% on CSL-SHARE and up to 91.1% on UniMiB SHAR. Fur-

thermore, classiﬁcation experiments on the separate and combined datasets achieve 85% (CSL-SHARE), 65%

(UniMiB SHAR), and 74% (combined) accuracy. The few-shot learning experiments show potential with low

errors in feature extraction but require further work for good activity classiﬁcation. Remarkable is the possi-

bility to attribute errors and indicate optimization areas easily. These experiments demonstrate the potential

and possibilities of the proposed method and the high-level, extensible, and interpretable feature space.

1 INTRODUCTION

When researchers get involved in Human Activity

Recognition (HAR), an important research topic to

facilitate today’s modern life, they will usually incor-

porate it into mature Machine Learning (ML) algo-

rithms for related tasks, such as automatic segmen-

tation, feature extraction and selection, activity mod-

eling technology, among others. In these, only the

research involving ML technology is considered, and

many early and subsequent works important to HAR,

such as equipment, signal acquisition, Digital Signal

Processing (DSP), system evaluation, customization,

and practical applications omitted. Therefore, the re-

search goals are often to answer these two questions:

Which ML approaches are applicable in HAR? How

to adjust the model topologies and parameters to im-

prove the recognition rate of certain tasks?

Many pieces of literature provide solutions to

the ﬁrst question. For example, Convolutional Neu-

ral Networks (CNNs), Recurrent Neural Networks

(RNNs), Hidden Markov Models (HMMs), and other

well-known ML algorithms have shown their capa-

bility of solving problems for segmentation, feature

study, activity modeling, and other related aspects

https://orcid.org/0000-0002-1003-0188

https://orcid.org/0000-0002-6850-9570

https://orcid.org/0000-0002-1714-2529

https://orcid.org/0000-0002-9809-7028

(Oniga and S

o, 2014) (Ord

nez and Roggen, 2016)

(Yang et al., 2015) (Singh et al., 2017) (Ronao and

Cho, 2014). Upgraded architectures for HAR based

on these fundamental algorithms, such as Residual

Neural Networks (ResNets) (Tuncer et al., 2020)

and Hierarchical Hidden Markov Models (HHMM)

(Youngblood and Cook, 2007), have also been emerg-

ing. Based on the above-listed methods, aiming

to improve research results via model improvement,

parameter adjustment, and experimental iterations,

more works have been proposed to solve the sec-

ond problem (Ronao and Cho, 2016) (Arifoglu and

Bouchachia, 2017) (Uddin et al., 2011) (Rebelo et al.,

2013) (Amma et al., 2010).

However, if comparing the rich outcomes of HAR

research so far to other ML-based recognition re-

search ﬁelds, an aspect of HAR that needs further

exploration emerges. In Automatic Speech Recogni-

tion (ASR), human vocalization is widely studied and

applied to modeling (Schultz, 2002). For example,

researchers apply the three-state HMM-based Bakis-

model, constructing phonemes by imitating phonetics

in segmenting the pronunciation (Bakis, 1976). Each

state (begin/middle/end), also called sub-phoneme,

models parts of a phoneme, not only enhancing train-

ing efﬁciency and recognition performance but also

endowing the model with phonetic and biologic sig-

niﬁcance. Another well-known example is that in

the ﬁeld of image and video recognition, many ap-

Hartmann, Y., Liu, H., Lahrberg, S. and Schultz, T.

Interpretable High-level Features for Human Activity Recognition.

DOI: 10.5220/0010840500003123

In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 4: BIOSIGNALS, pages 40-49

ISBN: 978-989-758-552-4; ISSN: 2184-4305

proaches are inspired and analogized by the physi-

ological model of how human eyes deal with stereo

vision, i.e., from human vision to computer vision

(Dickinson et al., 2009) (Scheirer et al., 2014). In

contrast, HAR’s research so far seems to have been

gathering in Artiﬁcial Intelligence (AI) itself, and

there is very little literature extending the study to

somatology. In fact, the concept of “human activ-

ity” speciﬁcally is practical and essentially linked to

physiology and sports science. In order to integrate

related knowledge into the research ﬁeld of computer

science, gaps need to be bridged. Once these bound-

aries are crossed, the HAR research will have a solid

foundation for theoretical modeling and experimen-

tal optimization than purely discussing the meaning

of the topology and parameters in ML models. One

such step has been made in the introduction of Mo-

tion Units, where human activities have been parti-

tioned into their distinct phases and states/sub-phases

based on gait analysis and sport science knowledge

(Liu et al., 2021b).

In this paper, we introduce a method to fur-

ther and more easily incorporate knowledge from

other ﬁelds into Human Activity Recognition by en-

abling non-machine learning experts to develop a

recognition system. We propose a setup where re-

searchers can deﬁne high-level properties of activities

with their respective possible values, such as Back-

wards/Neutral/Forwards on which the activity clas-

siﬁcation is performed. These features are initially

extracted with out-of-the-box classiﬁers and can later

be optimized or transformed into feature functions by

machine learning experts. One requirement of said

properties is that each activity differs from any other

activity in at least one of them, making ﬁnal classiﬁ-

cation straightforward, possible errors attributable to

the erroneous feature, and the extracted feature space

highly interpretable.

2 FEATURE SPACE

The proposed method transforms activities into a

high-level feature space with discrete, categorical fea-

tures, where each combination of features is unique

across activities. Thereby breaking the activity clas-

siﬁcation problem into several sub-problems and a

combination task. Instead of classifying a sequence as

“Walk” directly, it is recognized as a forward move-

ment (without Left/Right or Up/Down components)

at a low speed and includes a cyclic knee movement

and, therefore, is classiﬁed as “Walk”.

The classiﬁcation problem to choose between, say

22 different activities, is reduced to multiple classi-

ﬁcation problems with only a handful of discrete fea-

ture values and a ﬁnal classiﬁcation in this new feature

space. Note that these sub-problems have more data

available for each target class but are not necessarily

more straightforward to solve as they include vastly

different activity signals. Take “Stand” and “Falling

Forward”, while intuitively being very different, both

start in a standing position, have little to no muscle

activity, and have no movement to the left or right,

and, therefore, share surprisingly many features and

would be assigned the same target in several of them.

Setting these ideas into practice and evaluating

them requires speciﬁc activities and data. In this case,

two datasets, namely, CSL-SHARE (Liu et al., 2021a)

(called CSL19 in the earlier pieces of literature) and

UniMiB SHAR (Micucci et al., 2017). Both datasets

mainly contain Activities of Daily Living (ADL) like

“walk,” “go upstairs,” and “sit.” In addition, the CSL-

SHARE dataset includes several sport-related activi-

ties, such as “shufﬂe-Left/Right” and “V-cut”, while

the UniMiB SHAR dataset collects eight types of

falls. The CSL-SHARE dataset applies four types

of wearable sensors (two triaxial accelerometers, two

gyroscopes, four EMG sensors, and one electrogo-

niometer) integrated into a knee bandage, while the

UniMiB SHAR uses an accelerometer in an Android

smartphone to sense the signal. One particularly chal-

lenging aspect of the UniMiB SHAR data is that the

participants wore the smartphone in different orien-

tations, making it hard to distinguish directions and

other movements.

The initial high-level features are chosen based

on (Liu et al., 2021b) and are as follows: Class,

Left/Right, Up/Down, and Back/Front. Class de-

scribes the base activity, e.g., “walk”, “fall”, and

“lounge”. The other three are based on the Six-

Direction Nomenclature (Liu et al., 2021b) and de-

scribe entire body translational movement. In con-

trast to Motion Units, the assignment here is not on

sub-sequences but the whole activity level, thinking

of these properties more as phonetic features than as

phonemes as done with Motion Units. Furthermore,

in this paper the “anchored” aspect is replaced by

adding a “neutral” option to each of the three direc-

tional axes, e.g., Left/Neutral/Right. This neutral op-

tion is vital, as all features need to be extractable on

every activity for a well-deﬁned feature space.

However, with these four features, the feature

combinations are not unique between activities and,

therefore, the feature space is extended. On the CSL-

SHARE dataset, three features are added: starting

Foot, muscle Force, and Knee angle. The Foot is

required to distinguish between “V-Cut left with the

left foot ﬁrst” and “V-Cut left with the right foot

Interpretable High-level Features for Human Activity Recognition

Figure 1: Assigned features per activity. Rows describe features with their respective categorical values per activity. Activities

with the same value in a row use the same color. For example, the “FallingBack” and “FallingForward” activities have the

same Start position marked in dark blue, but differ in the Back/Front feature (dark blue and white). No further meaning or

order in color is intended, and the color value is solely maintained inside each row, e.g., dark blue does not refer to the same

value in the Start and Foot row.

ﬁrst”, while “Jump with one leg” requires more mus-

cle force to lift the body’s weight compared to “Jump

with two legs”. Lastly, “Sit” and “Stand” can be

distinguished when comparing knee angles. On the

UniMiB SHAR side, a Speed feature to distinguish

between “Running” and “Walking” is added. Addi-

tionally, a Start position and Impact feature are added

to distinguish between “Stand up from sitting” and

“Stand up from laying” as well as “Hitting an obsta-

cle” and “Syncope”.

The activities are grouped on association and ac-

tivity description of the data publishers in each of the

high-level features. Figure 1 displays the assignments

for both datasets. The full assignment is required only

in the combination experiment (Section 6). The fol-

lowing extraction and classiﬁcation experiments only

use the features developed for each dataset. Making

the feature space more concise and resulting in sim-

pler models. For the CSL-SHARE Class, Left/Right,

Up/Down, Force, Foot, and Knee are used. For

the UniMiB SHAR Activity, Back/Front, Left/Right,

Up/Down, Start position, Speed, and Impact are used.

The current choice of features can be improved

and should be developed for a broader range of ac-

tivities. As Table 1 indicates, the 6DN features help

distinguish activities across both datasets. The Speed,

and Force features also help distinguish several activ-

ities. However, the Impact, Foot, and Start feature

are developed mainly for speciﬁc datasets and do not

provide much information on activities of the other

dataset. Developing the features for a larger number

of activities and datasets should prove helpful. Espe-

cially when paired with a procedure that minimizes

the number of features while maximizing activity dis-

tances. Clustering comes to mind, either with ma-

chine learning techniques or by manually recursively

splitting the activity pool. Furthermore, taking inspi-

ration from other ﬁelds like the Bewegungslehre from

sports science and education (Meinel and Schnabel,

1987) or the Labanotation from dance (Guest, 1977)

could provide powerful features.

Note that the Start position feature differs from

the other features because it requires knowledge of

the starting time of an activity. When extracting the

features, not into a vector but as a sequence, simply a

position feature (values Lay/Sit/Stand) would provide

the equivalent information by making Start a special

case of the Position feature.

These choices already demonstrate the extensibil-

ity and adjustment possibilities for and to new activ-

ities and datasets. However, they also show the difﬁ-

culty and importance of good feature choices. A poor

choice of features might lead to activities not being

discriminable in the feature space either due to the

same feature values or poor feature extraction. Fur-

thermore, a choice of features that each distinguishes

only a few activities will not scale as each feature adds

computational cost with little to no beneﬁt in classi-

ﬁcation. The importance of feature choice also high-

lights the closeness between feature choice and activ-

ity deﬁnition. If no feature can distinguish two activ-

ities, are they indeed two different activities, and on

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

which granularity should activities be distinguished?

The proposed method makes modeling for human ac-

tivity recognition more accessible for non-machine

learning experts and easier to incorporate knowledge

from other ﬁelds, emphasizing feature choice and de-

sign while abstracting the machine learning aspect.

3 FEATURE EXTRACTION

High-level feature extraction here describes the pro-

cess of extracting attributes like Left/Neutral/Right

from a given motion data sequence. The attributes

are categorical and lend themselves to treating the ex-

traction as a classiﬁcation problem. In the following,

three extraction methods are explored and evaluated

based on Hidden Markov Models, Random Forests,

and custom feature engineering. The main beneﬁt of

treating feature extraction as a classiﬁcation problem

is the easy extensibility. If a new feature is needed,

only annotation of each activity is required to train an

out-of-the-box classiﬁer, like a Random Forest. How-

ever, Random Forests struggle with sequences of dif-

ferent lengths and, therefore, Hidden Markov Models

are conﬁgured to work similarly to an out-of-the-box

classiﬁer.

The Random Forest is trained on the mean, sum,

max, and min low-level features for each channel’s

ﬁrst thousand samples. The low-level feature fol-

lows the notion that trends like Left/Right or Up/Down

accumulate over time. The limitation to the ﬁrst

thousand samples results in the fact that for UniMiB

SHAR, the whole sequence is used, while for CSL-

SHARE, the ﬁrst second of an activity is used (mean

duration of most activities is around 1.7 seconds),

which should be enough data to extract the chosen

features and performed best in a cross-validation ex-

periment.

The Hidden Markov Models can automatically

learn to pay attention to essential segments of a se-

quence as well as channels by choice of topology and

Gaussian Mixture training, which helps given the cho-

sen features. For instance, not all parts of walking or

falling might be important to determine if the feature

value should be Left, Neutral, or Right. Additionally

to the one-hot-encoded prediction of each feature, the

HMM-based extraction also returns the HMMs con-

ﬁdence in the prediction as additional information to

the combination task. Resulting in a feature space di-

mensionality of 39 (33 from feature values plus six

conﬁdences per feature) on the CSL-SHARE dataset

and 37 on the UniMiB SHAR dataset (30 unique val-

ues plus seven conﬁdences). The setup and hyperpa-

rameter choice follow the ﬁndings of (Hartmann et al.,

2021), including mean-based rotation removal for the

UniMiB dataset. Accordingly, a sliding rectangular

window is used, the mean, RMS, slope, and max fea-

tures calculated, and the whole sequence normalized

afterward for both datasets. A grid search with a 10-

fold person-independent evaluation was performed to

determine the best window size, as the high-level

features are less time-sensitive compared to activi-

ties, and some longer trends, like a left curve, should

be more pronounced with longer windows. Accord-

ingly, for the CSL-SHARE dataset, 100ms windows

are used (exceptions are the Left/Right, Up/Down,

and Foot, where 200ms windows are used) while for

UniMiB SHAR, 200ms windows are used (exceptions

are Left/Right and Back/Front with 400ms windows).

Each activity is modeled with one of three predeter-

mined topologies: general-purpose, one step, several

steps. All three follow the same pattern where the ﬁrst

and last states are called “Random” and are shared

across all activities. Thus, trimming the sequence

but not inﬂuencing the classiﬁcation. The general-

purpose topology contains three consecutive states,

while the step topology uses ﬁve, and several steps

are modeled with three times ﬁve, i.e., three steps.

The inner three to ﬁfteen states then learn the main

distinguishing aspects of the feature. Technically and

for future experiments, the best matching topology for

a sequence could be chosen automatically.

Figure 2: Balanced accuracies for feature extraction in a

10-fold person independent cross-validation.

The feature extraction with HMMs and Ran-

dom Forests are then evaluated in a 10-fold person-

independent cross-validation. The balanced accuracy

Interpretable High-level Features for Human Activity Recognition

is chosen, as the feature targets can be highly im-

balanced (see Figure 1). The results are depicted

in Figure 2. The HMM-based extraction outper-

forms the Random Forest one consistently. Accord-

ingly, the following will focus on the HMM-based

extraction. The features on the CSL-SHARE dataset

are extracted with balanced accuracies in the very

high ninety percent, with the lowest accuracy in the

Left/Right feature at 96.1% balanced accuracy. This

performance is encouraging, especially when com-

pared to the activity classiﬁcation accuracy of 93.7%

in a leave-one-person-out cross-validation that is state

of the art (Hartmann et al., 2021). On the UniMiB

SHAR dataset, similar observations can be made. The

balanced accuracies are around eighty percent, with

Left/Right being an exception at 55.7% balanced ac-

curacy. While these accuracies are promising for the

overall classiﬁcation, it is crucial to note that errors

on the feature level propagate and add up in the ﬁ-

nal classiﬁcation. Speciﬁcally, the extraction of the

Left/Right feature requires further attention.

A closer look into the Left/Right evaluation in Fig-

ure 3 reveals that while the features are correctly ex-

tracted for almost all activities, the three activities

“Walk”, “Walk 90

◦

left”, and “Walk 90

◦

right” are

causing problems. Note that “Walk 90

◦

” refers to

walking a curve in three steps at which end a 90

◦

turn

is ﬁnished, rather than walking in a straight line. As

Left/Right is the only feature distinguishing these ac-

tivities in the high-level feature space, this error will

propagate in the activity level classiﬁcation (see Sec-

tion 4).

The classiﬁer-based feature extraction allows for

effortless feature extension but adds a computational

cost. For this reason, it is imperative to investigate if

the features can also be extracted with hand-crafted

functions. Furthermore, investigating what the clas-

siﬁers base their decision on is of interest. As seen

above, one of the main challenges and opportunities

on the CSL-SHARE dataset is correctly identifying

the direction in “Walk”, “Walk 90

◦

left”, and “Walk

◦

right”. Taking a closer look at the characteristics

of these three activities shows that they have very dis-

tinct slopes in the readings of the lower gyroscope in

the Left/Right direction, speciﬁcally during the swing

phase, which ought to be further exploited. There-

fore, a cross-channel feature was developed using the

goniometer to determine the swing phase combined

with extracting said slope from the lower gyroscope.

The cross-channel feature, named AvgSlopeSwing in

the following, is then classiﬁed by a random forest

requiring only ﬁve estimators and the whole setup

evaluated in a leave-one-person-out cross-validation

scheme where it performed on par with the HMM-

Figure 3: Confusion matrix for the Left/Right feature bro-

ken down into errors per activity.

based extraction at 72.6% accuracy in the three-class

problem. While the observed behavior extracted with

this feature ﬁts most participants, it mixes directions

with the neutral case for others. Figure 4 illustrates

this. The feature demonstrates that the computational

cost of classiﬁer-based feature extraction can be re-

duced with feature engineering and should be further

pursued.

Figure 4: Average slope during swing phase feature illus-

trated for Left/Neutral/Right walking with two select sub-

jects.

4 CLASSIFICATION

The discriminatory power of the extracted high-level

feature space combined with the Hidden Markov

Models conﬁdences is investigated using a 10-fold

person-independent cross-validation. The feature ex-

tractors are trained on the same training data as

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

Figure 5: CSL-SHARE confusion matrix and activity distance. Color indicates activity distance in high-level feature space

by the number of different feature values. The number in each cell denotes confusion in percent.

the combining classiﬁer to avoid overestimating the

method’s actual performance. A Random Forest is

used for classiﬁcation in the high-level feature space.

On the CSL-SHARE dataset, overall performance

of 85% is achieved, compared to 93.6% reported pre-

viously (Hartmann et al., 2021). The main prob-

lems in classiﬁcation occur where there are few dis-

tinguishing features or the distinguishing features’ ex-

traction accuracy is lacking. Figure 5 shows that most

confusions occur, where the distance between activi-

ties is low. In that sense, the ﬁgure highlights areas

of attention where high confusion is expected due to

low distance in red and areas with high distance and

low expected confusion in blue. Confusion in low-

distance areas is further increased if the extraction of

features struggles, which can be taken from Figure 2

and 1. For instance, the difference between the four

“Spins” to their “V-Cut” counterparts is only in the

Force feature (also the Speed feature, which is omitted

here). Another example is the “Walk” and “Walk 90

◦

”

activities that differ in the Left/Right feature, which,

as shown in Figure 3, has problems with determining

the direction of these three activities.

A similar result is found in the UniMiB SHAR

dataset, with a 64% accuracy compared to 77.0% re-

ported previously (Hartmann et al., 2021). The confu-

sion matrix reveals that the different falls are hard to

discriminate due to only a few distinguishing features.

“FallingLeft” and “FallingRight”, for instance, are

mainly differing in the Left/Right feature from each

other as well as the other falls. However, as shown

in Figure 2, the Left/Right feature is not extracted

very well, possibly due to only one example activity

for each feature value in the UniMiB SHAR dataset.

Similarly, the Start feature has some problems, which

manifests when comparing “Stand up from sitting” to

“Stand up from laying” and again when comparing

“FallingBackSC” and “FallingBack”. A closer look

into the projected high-level feature space shows that

it errs 50% of the time in the Left/Right and 70% in the

Start feature, leaving the ﬁnal classiﬁer little chance

to correctly predict the activity.

These classiﬁcation evaluations show that while

this setup trails the reported state-of-the-art perfor-

mances by ten points on both datasets, there is po-

tential in the method. Speciﬁcally, due to its high in-

terpretability, errors can be attributed and considered

when further developing the features and feature ex-

traction.

Interpretable High-level Features for Human Activity Recognition

Figure 6: UniMiB SHAR confusion matrix and activity distance. Color indicates activity distance in high-level feature space

by the number of different feature values. The number in each cell denotes confusion in percent.

5 IMBALANCED DATA AND

FEW-SHOT LEARNING

A signiﬁcant advantage of the proposed method is that

data from different activities is combined in the high-

level feature extraction, resulting in a higher number

of data points for each feature value. This data shar-

ing allows features to be extracted from unseen activ-

ities and such with a low occurrence in the dataset

while at the same time maintaining interpretability.

Few-shot learning and imbalanced data are especially

interesting in human activity recognition as there is

a vast amount of different activities, which could be

picked up quickly with a few-shot learning algorithm.

Furthermore, some activities are hard to record and

have little training data available but require accurate

recognition. Falls are a common example, which in

current datasets are typically simulated but are crucial

to recognize accurately.

The following experiment evaluates the high-level

classiﬁer on the full data, except for one activity, from

which only one sample is included in the training data

in a 5-fold person-independent cross-validation. This

procedure is repeated for each activity and with vary-

ing amounts of given samples. The activity occur-

rences in the CSL-SHARE dataset are mostly equally

distributed at roughly 20 recordings per activity and

participant. Therefore, four sample sizes are evalu-

ated, namely: 1 sample (0.3% percent of otherwise

available samples in the 5-fold scheme), 8 (2.5%), 16

(5%), and 32 (10%). In the UniMiB SHAR dataset

the occurrences of each activity differ, ranging from

153 samples (“Standing up from sitting”) up to 1985

(“Running”), with the median at roughly 500 sam-

ples. Accordingly, the following numbers of sam-

ples were evaluated: 1 sample (0.25% mean), 10

(2.5% mean), 20 (5% mean), 40 (10% mean). As

the training data is highly imbalanced and includes

only a few training examples, the high-level classi-

ﬁer is switched from a Random Forest to a Balanced

Random Forest utilizing undersampling (Chen et al.,

2004). All other parameters remain the same, includ-

ing the feature extraction. Four metrics are evalu-

ated: the overall accuracy, the f1 score of the low

sample activity, the percentage of samples with fully

correct extracted features, and the amount of errors in

the extracted features (calculated as mean difference

between the one-hot-encoded feature vectors without

conﬁdences). The overall accuracy can be compared

with the classiﬁcation results in Section 4 and indi-

cate the overall performance implication if one activ-

ity is low-resourced. The f1 score indicates how well

the low-resourced activity is recognized after train-

ing. The percentage of samples with correctly ex-

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

Figure 7: Performance on CSL-SHARE and UniMiB SHAR with four evaluated metrics. Averaged across 5-fold person-

independent cross-validation and rotating low-resourced activity.

tracted features indicates how well the feature extrac-

tion works. This percentage might differ from the f1

score, as the activity-level classiﬁer might compen-

sate for wrongly extracted features but might also ig-

nore correctly extracted features due to a bias learned

in training. The error metric complements this by in-

dicating how far off the feature extraction was from

the correct feature values.

Figure 7 shows the results of this evaluation. The

plot shows both the f1 score and correct extraction

increasing and the mean extraction error decreasing

with the number of shown samples, while the over-

all accuracy remains stable and is close to the scores

found in the classiﬁcation experiments (see Section

4). The latter, combined with the low f1 score, in-

dicates that all other activities perform well indepen-

dently of the low-resourced activity. The low f1 score

shows that this setup struggles with one-shot learning,

but can perform well with low-resourced activities be-

yond one-shot learning and can come close to full data

with only 10% of the original data.

Note that the mean error of the extracted feature

is at less than 20% for both datasets (CSL-SHARE

one-shot is at 14.2% and UniMiB SHAR at 19.5%).

Recall that the one-hot encoded feature space has 33

(CSL-SHARE) and 30 (UniMiB SHAR) dimensions,

meaning that on average less than three features are

extracted incorrectly (dimensionality times the error

divided by two, as an error in a feature results in

two errors in the one-hot encoding) in both datasets

in the one-shot experiment. Concluding, that while

the classiﬁcation of low-resourced activities should be

improved, the feature extraction errs in few, but activ-

ity distinguishing features and improving these would

improve few-shot learning.

6 DATASET COMBINATION

Another major advantage of the proposed method is

the universality and independence of the extracted

features from the sensor setup. Opening up the oppor-

tunity to combine and compare datasets in this high-

level feature space. Of course, there are limitations,

and not every feature makes sense for each dataset.

However, this can be accounted for when choosing the

features. The following experiment is a proof of con-

cept and combines the aforementioned datasets CSL-

SHARE and UniMiB SHAR in a stratiﬁed person-

dependent evaluation. The scheme was chosen to en-

sure each activity occurring in the training and test set,

which is challenging in a person-independent evalua-

tion across datasets with different participants. The

features listed in Section 2 for each of the datasets

are combined as displayed in Figure 1. Therefore,

more but not necessarily information-increasing fea-

tures are extracted compared to the classiﬁcation and

few-shot learning experiments. The feature extraction

is trained individually for the different datasets but

similar to the previous experiments within the evalu-

ation scheme. The activity-level classiﬁcation is then

done in the resulting feature space. Not unlike an in-

terpretable equivalent to swapping the ﬁrst layers of

a Neural Network depending on input while keep-

ing the last layers ﬁxed. The activity-level classiﬁca-

tion is, therefore, independent from the dataset, which

means, that now “V-Cuts” (CSL-SHARE) need to be

distinguished from “Falls” (UniMiB SHAR). At the

same time, both datasets have shared activities like

“Walk” resulting in more samples compared to each

individual dataset. The resulting 74% accuracy is in

the middle between the 85% on the CSL-SHARE and

the 64% on the UniMiB SHAR dataset (see Section

4). The confusion matrix is omitted here for brevity.

The errors made during classiﬁcation are the same as

in the previous individual dataset experiments, except

the “Walk” activity. “Walk” is now less confused with

the “Walk 90

◦

” activities and seems to have bene-

ﬁted from the additional samples or the additional fea-

tures not present in previous experiments. The effects

and possible beneﬁts of the data combination will be

investigated in future work. The dataset combina-

Interpretable High-level Features for Human Activity Recognition

tion demonstrates that combining datasets with this

method is possible, which creates a foundation to ﬁnd

commonalities and suited features for activities across

different settings.

7 CONCLUSIONS

This paper introduces a novel way of classifying hu-

man activities based on unique combinations of in-

terpretable high-level features for each activity. The

extracted features are interpretable and easily exten-

sible, and allow comparison and combination of dif-

ferent datasets in the high-level feature space. The

three approaches using HMMs, Random Forests, and

custom high-level feature functions for feature ex-

traction were proposed and evaluated, of which the

Hidden Markov Model performed best across the two

datasets CSL-SHARE and UniMiB SHAR. Classiﬁ-

cation experiments investigated how well the extrac-

tion combined with a Random Forest for ﬁnal classi-

ﬁcation could perform. In a person-independent 10-

fold cross-validation, they performed well, at 85% on

the CSL-SHARE compared to 93.6% state-of-the-art

and 64% compared to 77.0% state-of-the-art on the

UniMiB SHAR dataset (Hartmann et al., 2021). Fur-

thermore, few-shot learning experiments were con-

ducted, where one-shot learning did not succeed, but

low extraction error rates and increasing f1 scores in

few-shot learning are encouraging. Additionally, an

experiment combining the two datasets showed the

potential and promise of developing human activity

recognition systems across data sources. Remarkable

is that the errors in both classiﬁcation experiments and

few-shot learning experiments are attributable, and

the next steps for increased performance are clear:

deepened development and choice of features along

with their extraction methods.

Furthermore, the next two major steps are clear:

further investigate and develop high-level features and

extract these as sequences rather than as vectors to

enable online recognition. High-level features will

be developed borrowing from previous HAR work,

sports knowledge, and even utilizing ﬁndings and

criteria from dance from decades of previous work.

The main challenges for sequence extraction are cre-

ating the ground truth and addressing varying sam-

pling rates across datasets. The ground truth creation

likely requires manual data annotation and might only

be possible for certain high-level features and if the

dataset provides video examples. A sliding window

approach could address the sequence problem when

extracting features for a single dataset, but it does not

scale easily across datasets due to different sampling

rates. The slow nature of high-level features might en-

able re-sampling and should be investigated in future

work. These and other topics, including estimating

a performance ceiling with Neural Networks and ex-

tending to further datasets and modalities, are future

work for these high-level features.

REFERENCES

Amma, C., Gehrig, D., and Schultz, T. (2010). Airwrit-

ing recognition using wearable motion sensors. In

First Augmented Human International Conference,

page 10. ACM.

Arifoglu, D. and Bouchachia, A. (2017). Activity recog-

nition and abnormal behaviour detection with recur-

rent neural networks. Procedia Computer Science,

110:86–93.

Bakis, R. (1976). Continuous speech recognition via cen-

tisecond acoustic states. The Journal of the Acoustical

Society of America, 59(S1):S97–S97.

Chen, C., Liaw, A., and Breiman, L. (2004). Using Random

Forest to Learn Imbalanced Data. Technical report.

Dickinson, S. J., Leonardis, A., Schiele, B., and Tarr, M. J.

(2009). Object categorization: computer and human

vision perspectives. Cambridge University Press.

Guest, A. H. (1977). Labanotation: Or, Kinetography La-

ban : the System of Analyzing and Recording Move-

ment. Number 27. Taylor & Francis.

Hartmann, Y., Liu, H., and Schultz, T. (2021). Feature

space reduction for human activity recognition based

on multi-channel biosignals. In Proceedings of the

14th International Joint Conference on Biomedical

Engineering Systems and Technologies, pages 215–

222. INSTICC, SciTePress.

Liu, H., Hartmann, Y., and Schultz, T. (2021a). CSL-

SHARE: A multimodal wearable sensor-based human

activity dataset. Frontiers in Computer Science.

Liu, H., Hartmann, Y., and Schultz, T. (2021b). Motion

Units: Generalized sequence modeling of human ac-

tivities for sensor-based activity recognition. In EU-

SIPCO 2021 - 29th European Signal Processing Con-

ference. IEEE.

Meinel, K. and Schnabel, G. (1987). Bewegungslehre -

Sportmotorik: Abriß einer Theorie der sportlichen

Motorik unter p

adagogischem Aspekt. Meyer &

Meyer Verlag, Aachen, 12., erg

anzte auﬂage edition.

Micucci, D., Mobilio, M., and Napoletano, P. (2017).

UniMiB SHAR: A dataset for human activity recog-

nition using acceleration data from smartphones. Ap-

plied Sciences, 7(10):1101.

Oniga, S. and S

o, J. (2014). Human activity recognition

using neural networks. In Proceedings of the 15th

International Carpathian Control Conference, pages

403–406. IEEE.

Ord

nez, F. J. and Roggen, D. (2016). Deep convolutional

and LSTM recurrent neural networks for multimodal

wearable activity recognition. Sensors, 16(1):115.

BIOSIGNALS 2022 - 15th International Conference on Bio-inspired Systems and Signal Processing

Rebelo, D., Amma, C., Gamboa, H., and Schultz, T. (2013).

Human activity recognition for an intelligent knee or-

thosis. In BIOSIGNALS 2013 - 6th International Con-

ference on Bio-inspired Systems and Signal Process-

ing, pages 368–371.

Ronao, C. A. and Cho, S.-B. (2014). Human activity

recognition using smartphone sensors with two-stage

continuous hidden markov models. In ICNC 2014 -

10th International Conference on Natural Computa-

tion, pages 681–686. IEEE.

Ronao, C. A. and Cho, S.-B. (2016). Human activity recog-

nition with smartphone sensors using deep learning

neural networks. Expert systems with applications,

59:235–244.

Scheirer, W. J., Anthony, S. E., Nakayama, K., and Cox,

D. D. (2014). Perceptual annotation: Measuring hu-

man vision to improve computer vision. IEEE trans-

actions on pattern analysis and machine intelligence,

36(8):1679–1686.

Schultz, T. (2002). Globalphone: a multilingual speech and

text database developed at karlsruhe university. In IC-

SLP 2002 - 7th International Conference on Spoken

Language Processing.

Singh, D., Merdivan, E., Psychoula, I., Kropf, J., Hanke,

S., Geist, M., and Holzinger, A. (2017). Human ac-

tivity recognition using recurrent neural networks. In

CD-MAKE 2017 - International Cross-Domain Con-

ference for Machine Learning and Knowledge Extrac-

tion, pages 267–274. Springer.

Tuncer, T., Ertam, F., Dogan, S., Aydemir, E., and Pławiak,

P. (2020). Ensemble residual network-based gender

and activity recognition method with signals. The

Journal of Supercomputing, 76(3):2119–2138.

Uddin, M. Z., Thang, N. D., Kim, J. T., and Kim, T.-S.

(2011). Human activity recognition using body joint-

angle features and hidden markov model. Etri Jour-

nal, 33(4):569–579.

Yang, J., Nguyen, M. N., San, P. P., Li, X., and Krish-

naswamy, S. (2015). Deep convolutional neural net-

works on multichannel time series for human activity

recognition. In IJCAI, volume 15, pages 3995–4001.

Buenos Aires, Argentina.

Youngblood, G. M. and Cook, D. J. (2007). Data mining

for hierarchical model creation. IEEE Transactions on

Systems, Man, and Cybernetics, Part C (Applications

and Reviews), 37(4):561–572.

Interpretable High-level Features for Human Activity Recognition