Approaches towards Resource-saving and Explainability/Transparency

of Deep-learning-based Image Classiﬁcation in Industrial Applications

Constantin Rieder, Markus Germann, Samuel Mezger and Klaus Peter Scherer

Institute for Automation and Applied Informatics, Karlsruhe Institute of Technology,

Hermann-von-Helmholtz-Platz 1, Eggenstein-Leopoldshafen, Germany

Keywords:

Deep Learning, Information Systems, Intelligent Assistance, Transparency, Continuous Self Learning,

Explainability, Neural Networks, Resource Saving Image Classiﬁcation.

Abstract:

In the present work a new approach for the concept-neutral access to information (in particular visual kind)

is compiled. In contrast to language-neutral access, concept-neutral access does not require the need to know

precise names or IDs of components. Language-neutral systems usually work with language-neutral metadata,

such as IDs (unique terms) for components. Access to information is therefore signiﬁcantly facilitated for

the user in term-neutral access without required knowledge of such IDs. The AI models responsible for

recognition transparently visualize the decisions and they evaluate the recognition with quality criteria to be

developed (conﬁdence). To the applicants’ knowledge, this has not yet been used in an industrial setting. The

use of performant models in a mobile, low-energy environment is also novel and not yet established in an

industrial setting.

1 INTRODUCTION

Complex machines require a reliable and high-

performance process in daily operation. The technical

service of the respective manufacturer ensures sus-

tained performance with worldwide networks. Mal-

functions must be stopped quickly, and components

have to be exchanged and replaced in a safe manner.

For this purpose, manufacturers provide comprehen-

sive technical information systems that describe spare

parts management, diagnosis, maintenance and repair

of machines and systems. Fast and unambiguous ac-

cess to such information is therefore of crucial impor-

tance.

1.1 Initial Situation

Previous research projects have developed basic

methods to analyze existing information and use on-

tologies to ”semantify” the content. Often, a multi-

modal approach to semantic access to this informa-

tion is implemented, using both voice assistants, au-

dio recordings, and imaging to generate the query to

the system. Several case studies have shown that mul-

timodal access to the information generally means an

increase in efﬁciency. However, the prerequisite here

is that the user of the system clearly knows the terms

(of the defective component or the malfunction). The

corresponding technical terms are often unknown, so

that service technicians cannot access the appropri-

ate documentation, which leads to incorrect repairs

and/or increased downtimes. With the application

of semantic technologies and a set of requirements

that should be taken into account, accessibility to re-

sources in information systems can be drastically im-

proved (Baumeister, 2016). As another important en-

hancement, Deep Learning technologies can simplify

and support information access for the user. In this re-

gard, the methods presented in this paper address the

important aspects of transparency and resource con-

servation when using Deep Learning methods to sup-

port information access in information systems.

1.2 Motivation and Databases

The present research project aims at developing a

concept-neutral access to information by means of ar-

tiﬁcial intelligence (AI) image understanding meth-

ods and at exploiting this technology industrially in

semantic information systems. Two central aspects

are emerging:

1. Need for resource-saving models: The use of in-

formation systems “in the ﬁeld” is to take place

on mobile devices in perspective. For this pur-

164

Rieder, C., Germann, M., Mezger, S. and Scherer, K.

Approaches towards Resource-saving and Explainability/Transparency of Deep-learning-based Image Classiﬁcation in Industrial Applications.

DOI: 10.5220/0010575901640169

In Proceedings of the 2nd International Conference on Deep Learning Theory and Applications (DeLTA 2021), pages 164-169

ISBN: 978-989-758-526-5

pose, the learned recognition models have to be

resource-saving for memory and processor tech-

nology in order to be universally usable. Current

architectures of image-processing AI models are

typically memory and computationally expensive,

which does not allow responsiveness on low-end

devices.

2. Need for transparent recognition models, open for

critical feedback: Especially for industrial use,

in particular incorrectly recognized concepts are

very unsatisfactory for the technicians. A criti-

cizable recognition model can intuitively visual-

ize how it has detected a concept (object class)

on the current video image and which differenti-

ating concepts would also be possible (with corre-

spondingly lower recognition quality) (Kuwajima

et al., 2019). It is difﬁcult for the user to under-

stand which input data inﬂuenced the decisions of

a deep neural network, due to its black-box nature

(Xie et al., 2020). Such transparency increases the

user acceptance and can be utilized for the con-

tinuous self-learning of the model. Additionally

it provides approaches for further actions such as

the extension of the neural network with respect

to complexity or the collection of further train-

ing data to improve the existing neural network

(Kuwajima et al., 2019) (Fazi, 2020).

2 DATASETS AND ACQUISITION

For the development of the image recognition com-

ponent in the information system, test objects (con-

cepts) were speciﬁed on which raw data is recorded

and different recognition techniques can be tested.

During data acquisition for the training corpus, the

components of the test objects to be recognized were

recorded from different angles and against different

backgrounds using short ﬁlm techniques. As test ob-

jects, in a ﬁrst step a bicycle was chosen and after that

a more complex structure car is regarded. Here, in-

dividual components such as ”saddle”, ”rear wheel”

and in case of the car ”turn signal left” or ”steering

wheel” and so on can be comprehensibly labeled and

then recognized by solving a classiﬁcation problem

with neural networks.

The labels for the speciﬁc components of the bi-

cycle test objects were set beforehand in a deﬁned

structure (see Figure 1). A structure template with

an included hierarchical reﬁnement was also deﬁned

for the car (see Figure 2). The recorded training data

are associated with labels by sorting them into ﬁle

folders with the corresponding name or label. The

component structure of the car consists of about 100

Figure 1: Deﬁned bicycle concepts.

concepts and 100 corresponding folders. For exam-

ple, all clips from the handbrake are sorted into the

directory named ”Handbrake”. The directory struc-

ture then helps with the organisation of the clips and

the directory names are used as class labels. A rel-

atively large number of frames can now be extracted

and labeled from the sorted clips. In order for the

frames to be as versatile as possible, different angles

and lighting conditions should be taken into account

when ﬁlming.

Figure 2: Deﬁned vehicle concepts.

The concepts include various components of the

vehicle from the interior and exterior. To reﬁne the

structure, these are divided into categories, such as

control elements, body, engine compartment, elec-

tronics, etc.

3 CONCEPT AND MODELING

This section describes the concept of transparency

and resource saving in more detail.

Deep Learning models are only as good as the

data that was used for training. Bad or outlier data

can lead to unsuitable models. If, as in this case, a

model identiﬁes technical components incorrectly or

Approaches towards Resource-saving and Explainability/Transparency of Deep-learning-based Image Classiﬁcation in Industrial

Applications

165

not at all, we know that the decision is wrong and the

model must be corrected. To ﬁnd the reason for the

wrong decision, we need to analyze the input data.

But because there are too many of them, a manage-

able amount is selected to represent the input data set

and to be provided to the user for control. The con-

cept of resource saving refers to the autarkic use of

the image recognition models. For this, certain re-

quirements must be considered in order to implement

the execution in a corresponding way.

3.1 Explainablility/Transparency

The transparency can be considered, among other

things, on the basis of the following two criteria

(called transparency criterion in the following), which

a user should be able to assess:

• The reliability of the system (or the model) in

ﬁnding the solution i.e. the conﬁdence of the

recognition: A typical question here is: “Are there

other classes that come into question?” or which

are the next best classes in terms of recognition.

In addition, criteria for a capping are necessary.

• An explanation of why the chosen class was in-

ferred by the system.

Examples of possible methods are random sam-

pling, metrics such as difference measures, cluster-

ing, etc. Possible variants are a static pre-selection be-

fore delivery to the app or dynamic generation during

the runtime of the app on the device (possibly using

a currently captured image), heat maps and bounding

boxes. Offering and selecting alternative possible out-

comes could help improve prediction when, for exam-

ple, gradient noise or high dimensionality of the data

causes the model to overreact and many small pixel

effects accumulate and interfere with the model out-

put (Samek et al., 2021).

The structure for the transparency component of

the system is illustrated in the following ﬁgure 3.

The top component (Frame Extractor), extracts

frames from the video material and makes them avail-

able for the representative search. The representa-

tive search compares images and uses comparison al-

gorithms for this. Therefore it uses the component

for the comparison algorithms. This component pro-

vides algorithms from four different categories. The

following histogram algorithms from the OpenCV

(OpenCV, 2021a) library are used:

• Correlation: Computes the correlation between

the two histograms.

• Chi Square: Applies the Chi-Squared distance to

the histograms.

Figure 3: Architectural sketch for transparency unit.

• Intersection: Calculates the intersection between

two histograms.

• Relative Entropy: Bhattacharyya distance, used

to measure the “overlap” between the two his-

tograms.

Histograms represent the distribution of colors in

an image. The histogram can be displayed as a graph

that gives a good impression of the intensity distribu-

tion (pixel values). Assuming an RGB color space (8

Bit-Encoding), the pixel values range from 0 to 255.

For other color spaces, the pixel range looks corre-

spondingly different. The following image hashing

algorithms from OpenCV(OpenCV, 2021b) and Im-

ageHash (ImageHash, 2021) are used:

• Color Hash

• Average Hash

• Wavelet Hash

• Perceptual

• Difference

The color hash is an image hash based on color

moments. The color hash algorithm analyzes the

color distribution and black-gray components with-

out position information. The image hash algorithms

(Average, Perceptual, Difference, Wavelet) analyze

the image structure for luminance without color in-

formation(ImageHash, 2021).

Structural similarity index (SSIM) is another

method for measuring the similarity between two im-

ages. It is often used in television and cinema. It

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications

166

represents a so-called perception-based model and is

calculated over different parts of the image (Brunet

et al., 2012). Another measure of image quality and

established method for comparing images is the mean

square error (MSE) method (Gonzalez and Woods,

2018). It estimates the absolute error.

3.2 Resource Saving

In the future, information systems will be used ”in

the ﬁeld” on mobile devices. For this, the learned

recognition models must be resource-saving for mem-

ory and processor technology in order to be de-

ployable over a wide area. Current architectures of

image-processing AI models are typically extremely

memory- and computationally-intensive, making the

use on microdevices impractical. From this, this work

is characterized in the conﬁguration and realization of

image classiﬁcation with the following features:

• Usage-related Features

– Lowest possible memory consumption

– Battery-saving processor performance

– High performance (highest possible accuracy)

• Acquisition Technical Features

– Efﬁcient training of the network with as little

sample data as possible (short videos)

– How performance degrades with less sample

data due to degradation studies

For the realization and execution of the experi-

ments, a Raspberry Pi 4 was chosen as the hardware,

because it is a small computer with the size of a credit

card and matches our requirements. Unlike other mo-

bile devices of this size, it is capable of running a

full desktop operating system. By running Linux, it

proves to be extremely ﬂexible and reliable in use.

Therefore, it forms a suitable experimental platform

under the aspect of mobility and resource conserva-

tion.

4 REALISATION

It is not intended to train or build new neural networks

on the Raspberry Pi. Therefore, instead of the com-

plete TensorFlow package, it is sufﬁcient to install

only the signiﬁcantly smaller interpreter and so Ten-

sorFlow Lite (TensorFlow, 2021) has been installed

on the Raspberry Pi hardware platform.

4.1 Experimental Setup

For the ﬁrst experimental setup, MobileNetV2 was

used and trained on 13 classes. It is a convolutional

neural network architecture that uses, among other

things, depthwise separable convolutions. The re-

sult is a streamlined and lightweight deep neural net-

work that can perform well on mobile devices (San-

dler et al., 2019). This above mentioned neural net-

work served as a base reference, i.e. as the unopti-

mized model, and was then converted or quantized in

further steps with TensorFlow Lite into different for-

mats with the goal of further reducing and streamlin-

ing it, so that the following smaller versions of the

base model or setups were created:

• TF Lite Model

• Quantized Flat16 Model

• Quantized INT8 Model

• Quantized Dataset

4.2 Results

To compare the models (see ﬁgure 4) special soft-

ware was developed to process the entire test data set

through the networks and to evaluate the predictions.

Figure 4: Shrinking models.

Both the top-1 (see ﬁgure 5) and top-5 (see ﬁgure

6) hit rates were evaluated, and it was measured how

long it takes the Raspberry Pi to perform the inference

(see ﬁgure 7).

The model implemented with Keras has about 30

MB in SavedModel format and achieves a total accu-

racy of 97%. However, this model cannot be loaded

or executed with the TFLite interpreter on the Rasp-

berry Pi.

Running the TensorFlow Lite Converter in the de-

fault setting, the model is converted to the ﬂatbuffer

format, without changing the numerical values. The

weights are stored in the FLOAT32 data type, i.e.

as a ﬂoating point number with 32 bits. Therefore,

the accuracy hardly changes compared to the original

Approaches towards Resource-saving and Explainability/Transparency of Deep-learning-based Image Classiﬁcation in Industrial

Applications

167

Figure 5: Top-1 accuracy for bike components.

Figure 6: Top-5 Accuracy for bike components.

model. In order to recognize the component located

on an image with this model, the Raspberry Pi needs

472 ms (see ﬁgure 7).

Quantization is an optimization method that

achieves a smaller memory size in exchange for less

precise numerical values. Depending on the proces-

sor and data type, it can also provide shorter latency.

Quantizing to FLOAT16, or 16-bit ﬂoating point num-

bers, can cut the model size in half without seriously

affecting the accuracy of the predictions. In addition,

quantization reduced the time to compute the output

by about 20 ms. If the accuracy of the weights is re-

duced to only 8 bits, so that the model is only 3.5 MB

in size, a signiﬁcant decrease in the accuracy rate can

be observed (see ﬁgure 8). This obviously results

in the loss of important information that helps the arti-

ﬁcial neural network to classify the images correctly.

In order to further reduce the size of the model with-

out losing too much accuracy, a representative data

set can be used to distinguish more precisely between

relevant and unimportant information. The range of

Figure 7: Total inference time in seconds.

Figure 8: Total accuracy.

values that the variables assume when pixel informa-

tion is passed through is examined. Using this highly

effective method, it was possible to reduce the size of

the model to only 4 MB, while maintaining a hit accu-

racy that is no more than one percent lower than that

of the original model.

5 CONCLUSION

In many application areas, such as in technical ser-

vice locally in the ﬁeld, these resource-efﬁcient and

explainable AI models have proven to be extremely

useful and efﬁcient in the ﬁrst approach, because

• The necessary expertise on the machine is pro-

vided locally.

• According to the architecture, the system has a

high degree of self-sufﬁciency.

• The data and computational load is local.

• Energy resources are available in a limited way

and the system is capable of explanation.

DeLTA 2021 - 2nd International Conference on Deep Learning Theory and Applications

168

Neural network architectures have been designed,

developed, and implemented in this work to be efﬁ-

cient and resource-efﬁcient, and to assist the service

technician with image understanding procedures on

self-sufﬁcient and transportable devices. The second

important innovation is criticality and transparency of

the system. With the option to view further detec-

tion results and adjust them if necessary. This leads

to improved user acceptance and creates the basis for

continuous improvements.

6 OUTLOOK

This methodology of imaging AI methods under the

aspect of resource conservation and transparency is

also already relevant for the computer-aided develop-

ment of new materials (Virtual Materials Design) and

is planned in this application as a prototype. These

tools support the developer to evaluate experimen-

tal results and to identify structure-property relation-

ships.

ACKNOWLEDGEMENTS

The work presented in this article is supported and

ﬁnanced by Zentrales Innovationsprogramm Mittel-

stand (ZIM) of the German Federal Ministry of Eco-

nomic Affairs and Energy. The authors would like

to thank the project management organisation AiF in

Berlin for their cooperation, organisation and budget-

ing.

REFERENCES

Baumeister, J. (2016). Requirements of affective informa-

tion systems in the industrial domain. In Ezquerro, M.

T. H., Nalepa, G. J., and M

endez, J. T. P., editors, Af-

CAI, volume 1794 of CEUR Workshop Proceedings.

CEUR-WS.org.

Brunet, D., Vrscay, E. R., and Wang, Z. (2012). On

the mathematical properties of the structural similar-

ity index. IEEE Transactions on Image Processing,

21(4):1488–1499.

Fazi, M. B. (2020). Beyond human: Deep learning, explain-

ability and representation. Theory, Culture & Society.

Gonzalez, R. C. and Woods, R. E. (2018). Digital image

processing. Pearson, New York, NY, fourth edition

edition. Hier auch sp

ater erschienene, unver

anderte

Nachdrucke ; Literaturverzeichnis: Seite 1143-1155 ;

”This edition of Digital Image Processing is a major

revision of the book.” - Preface.

ImageHash (2021). ImageHash 4.2.0: an image hash-

ing library written in python. https://pypi.org/project/

ImageHash/. Last checked on Mar 17, 2021.

Kuwajima, H., Tanaka, M., and Okutomi, M. (2019). Im-

proving transparency of deep neural inference pro-

cess. CoRR, abs/1903.05501.

OpenCV (2021a). OpenCV: histogram com-

parison. https://docs.opencv.org/3.4/d8/dc8/

tutorial histogram comparison.html. Last checked on

Mar 17, 2021.

OpenCV (2021b). OpenCV: module of implementations

of different image hashing algorithms. https://docs.

opencv.org/3.4/d4/d93/group img hash.html. Last

checked on Mar 18, 2021.

Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J.,

and Muller, K.-R. (2021). Explaining deep neural net-

works and beyond: A review of methods and applica-

tions. Proceedings of the IEEE, 109(3):247–278.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2019). Mobilenetv2: Inverted residuals

and linear bottlenecks.

TensorFlow (2021). TensorFlow Lite: deploy machine

learning models on mobile and iot devices. https:

//www.tensorﬂow.org/lite. Last checked on Mar 17,

2021.

Xie, N., Ras, G., van Gerven, M., and Doran, D. (2020).

Explainable deep learning: A ﬁeld guide for the unini-

tiated.

Approaches towards Resource-saving and Explainability/Transparency of Deep-learning-based Image Classiﬁcation in Industrial

Applications

169