Reference Architecture for IoT Platforms towards Cloud Continuum

based on Apache Kafka and Orchestration Methods

Zolt

an Farkas

and R

obert Lovas

Institute for Computer Science and Control (SZTAKI), E

otv

os Lor

and Research Network (ELKH),

Kende u. 13-17, Budapest, 1111, Hungary

Keywords:

Cloud, IoT, Apache Kafka, Orchestration, Cloud Continuum, OpenStack, Azure, AWS.

Abstract:

Apache Kafka is a widely used, distributed, open-source event streaming platform, which is available as a

basic reference architecture for IoT use cases of the Autonomous Systems National Laboratory and other

initiatives in Hungary, e.g. related to development of cyber-medical systems. This reference architecture

offers a base for setting up a multi-node Kafka cluster on a Hungarian research infrastructure, ELKH Cloud.

However, the capacity, accessibility or the availability of a given deployment using a single data center might

not be sufﬁcient. In this case Apache Kafka can be extended with additional nodes provisioned in the given

cloud, but our solution also enables the expansion of the cluster by involving other cloud providers. In this

paper we present our proposed approach for enhancing the existing basic reference architecture towards cloud

continuum, i.e. allowing the supported IoT use cases to expand the resources of an already deployed Apache

Kafka cluster with resources allocated even in third-party commercial cloud providers, such as Microsoft

Azure and AWS leveraging on the functionalities of the Occopus cloud orchestrator.

1 INTRODUCTION

Cloud continuum reﬂects not only the rapidly grow-

ing penetration of cloud computing technologies but

also the diversity of various use cases that requires

(among others) agility to bring clouds to a wide

range of users by expanding the cloud technolo-

gies towards edge, fog and cloud robotics. In Hun-

gary, the use cases (F

enyes et al., 2021) deﬁned

by the Autonomous Systems National Laboratory

(ARNL, 2021) combined with the ELKH science

cloud (ELKH, 2021) and other commercial providers

might be considered as an illustrative example for

such cloud continuum related efforts. Our institute,

as the leader of this national lab and the ELKH Cloud

research infrastructure, took the opportunity and has

been gaining the expertise as well as the best practices

in order to provide the ﬁrst set of enhanced reference

architectures accessible and seamlessly deployable by

the national lab members.

The OpenStack-based ELKH Cloud offers a plat-

form for enabling and accelerating AI-related, indus-

trial IoT related and other current research activities.

One of the reference architectures offered is Apache

https://orcid.org/0000-0003-4459-9548

https://orcid.org/0000-0001-9409-2855

Kafka (Kafka, 2021b; Kreps et al., 2011), a widely

used event streaming platform, which can be applied

as the basis for many different applications, includ-

ing the new developments of cyber-medical systems

(Eigner et al., 2020).

The base platform for the Apache Kafka reference

architecture is the ELKH Cloud (ELKH, 2021). In

certain cases it may be possible to tune the perfor-

mance of Kafka for the application (Le Noac’h et al.,

2017), however it can happen that due to some reason

the capacity allocated for the given project inside the

ELKH Cloud is not enough to a larger Kafka clus-

ter, or geographical distribution of the Kafka cluster

is necessary (Kato et al., 2018).

In this paper we present our work to enable the

deployment of the Kafka reference architecture onto

cloud infrastructures beside the ELKH Cloud, as

well. Our work leveraged on the Occopus orches-

trator (Kov

acs and Kacsuk, 2018), which is capable

of accessing APIs of multiple cloud providers (like

OpenStack, OpenNebula (Kumar et al., 2014), AWS,

Azure), thus allows the building of hybrid cloud in-

frastructures (Lovas et al., 2018; Kov

acs et al., 2018)

in a seamless way. Occopus is also the integrated part

of the MiCADO-Edge framework (Ullah et al., 2021)

addressing the Cloud-to-Edge computing continuum.

Farkas, Z. and Lovas, R.

Reference Architecture for IoT Platforms towards Cloud Continuum based on Apache Kafka and Orchestration Methods.

DOI: 10.5220/0011071300003194

In Proceedings of the 7th International Conference on Internet of Things, Big Data and Secur ity (IoTBDS 2022), pages 205-214

ISBN: 978-989-758-564-7; ISSN: 2184-4976

205

The structure of the paper is as follows: ﬁrst we

give an overview on the available achievements re-

lated to the deployment of Kafka clusters in hybrid

cloud environments. Afterwards, in order to introduce

the existing reference architecture, we present both a

virtual machine-based and a container-based version.

Next, we compare the different possibilities to create

a hybrid cloud variant of the reference architecture,

and present two implementations, one using virtual

machines in Azure, and one using Azure Container

Instances (ACI). Afterwards, the startup times and the

message throughput performance will be evaluated on

the hybrid infrastructure applying the different tech-

nologies. Finally, we conclude our results, and high-

light the possible future work.

All of the infrastructure descriptions, node deﬁni-

tions and different log ﬁles mentioned in this paper

are available for download, examination and repro-

ducibility in a Gitlab repository dedicated for this pa-

per (Farkas, 2021).

2 RELATED WORK

As Kafka is a distributed platform, a number of so-

lutions to make it work on multiple cloud providers

already exist. In article (Waehner, 2021) authors de-

scribe the implementation a solution where existing

legacy (non-cloud) application deployments can be

modernized by involving a geographically distributed

Kafka cluster to extend the infrastructure serving

the application with resources coming from multiple

cloud providers.

When considering fresh Kafka deployments, mul-

tiple proprietary solutions are available. For exam-

ple, Canonical offers the deployment of a distributed

Kafka cluster onto various cloud providers (Canoni-

cal, 2021). Conﬂuent Cloud also offers the possibil-

ity to deploy a distributed Kafka cluster onto multiple

cloud resources (Conﬂuent, 2021)(du Preez, 2021).

Authors of (Torres et al., 2021) show an open-

source framework for serving AI applications over

cloud and edge clusters. The framework uses Kafka-

ML (Mart

ın et al., 2020), which allows the manage-

ment of machine learning and AI pipelines through

the data streams of Kafka. The authors evaluate the

performance of the hybrid cloud-edge architecture in

different scenarios, using on-premise, IoT and Google

Cloud Platform (GCP) resources.

Author of (Sabharwal, 2016) shows the deploy-

ment of a Kafka clusters in a hybrid cloud environ-

ment, mixing on-premise and publicly available cloud

resources. The article shows how Kafka MirrorMaker

(Kafka, 2021a) can be used to replicate existing Kafka

data onto a hybrid environment.

Although most of the related work shown here can

be used as the basis for building hybrid Kafka clus-

ters, the solution presented in this paper offers a user-

friendly and convenient way to build and manage such

a cluster. Its most important advantage is the encap-

sulation of the solution and best practices into a ref-

erence architecture description that might be the sub-

ject of an orchestrator tool - Occopus - for building or

extending the hybrid Kafka cluster infrastructure in a

highly automatized way.

3 APACHE KAFKA REFERENCE

ARCHITECTURE FOR ELKH

CLOUD USERS

Users of ELKH Cloud and supported researchers of

the Autonomous Systems National Laboratory can

access different sets of so called reference architec-

tures (RAs). RAs can be used as a base to build an

infrastructure in the cloud for supporting a given a

computation. Examples for such RAs are the Docker

Swarm cluster (Kov

acs et al., 2018), the Apache

Spark RA with Hadoop (Lovas et al., 2018), a Ku-

bernetes cluster, a Tensorﬂow-Keras-Jupyter stack or

the Apache Kafka RA.

The different reference architectures can be built

up in the ELKH Cloud with the help of a cloud or-

chestrator tool, like Terraform or Occopus. The ad-

vantage of using these tools is that the users do not

need to work with the APIs of the underlying cloud

providers; they simply have to specify the necessary

credentials and instruct the orchestrator tool to build

the infrastructure in the target cloud.

The Apache Kafka RA offered is implemented

with the help of Occopus. Occopus, as a cloud or-

chestrator can interface with a number of cloud re-

sources: AWS, Azure, OpenStack- or OpenNebula-

based clouds. The connection with the different tar-

get clouds is implemented through different plug-

ins, which provide different set of functionalities,

like starting virtual machines with a contextualization

script, or to start containers based on user-deﬁned im-

ages.

Infrastructures created with Occopus consist of

two parts: an infrastructure description and a set of

node deﬁnition, all written in YAML format. The in-

frastructure description describes the nodes of the in-

frastructure, and also deﬁnes deployment dependen-

cies between the nodes. On the other hand, node

deﬁnitions describe the deployment information of

the different nodes in the different target clouds. As

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

206

the ELKH Cloud is based on OpenStack, the Apache

Kafka RA is implemented with the help of the nova

plugin of Occopus.

The structure of the infrastructure created by the

Apache Kafka RA can be seen in Figure 1. As it can

be seen in the ﬁgure, the user is interacting with Oc-

copus in order to provision the infrastructure.

Occopus

orchestrator

Zookeeper

Kafka

Figure 1: Apache Kafka RA in the ELKH Cloud.

The Apache Kafka RA consists of two types of

nodes: a Zookeeper node, and a set of Kafka bro-

ker nodes. Zookeeper in this setup is used to keep

track of the status of the Kafka cluster nodes and to

keep track of Kafka topics and partitions. The infras-

tructure description of the Apache Kafka RA can be

found in the Gitlab repository (Farkas, 2021), in the

directory kafka.vm, called as infra-kafka.yaml).

The description is self-explanatory, but details can

be found in the Occopus documentation. There are

two nodes called as zookeeper and kafka, where the

kafka node must have at least 2, and at most 10 in-

stances. The kafka nodes depend on the zookeeper

node, thus Occopus will create the zookeeper node

ﬁrst. The respective node deﬁnitions are called

zookeeper node and kafka node. We are not plac-

ing the node deﬁnitions and the related contextual-

ization scripts in the paper, as they are quite long.

In a nutshell, the node deﬁnitions describe the nec-

essary properties to create an adequate VM for the

nodes in the OpenStack-based ELKH Cloud, and also

deﬁne the different health-checking methods (TCP

ports 2181 and 9092 should accept connections for

the zookeeper and kafka nodes, respectively). The

contextualization scripts of the different nodes are

cloudinit scripts, which deploy, conﬁgure, and start

the given software components on the nodes. In case

of the kafka nodes, the IP address of the zookeeper

node is also used in the contextualization script,

so that Kafka brokers know where to look for the

Zookeeper service.

The logs for starting up the infrastructure, scal-

ing the infrastructure, and destroying the infrastruc-

ture can be found in the log ﬁles 01 build elkh.log,

02 scale elkh.log and 04 destroy infra.log

respectively, in the directory kafka.vm/logs.

The deployed Kafka cluster can be examined

through different user interfaces for Kafka. We

choose Kafka UI (UI, 2021), as it can be set up

quickly through Docker, and is capable of giving a

quick overview of the cluster. Figure 2 shows the

scaled up cluster’s overview. Once a producer is con-

nected to the Kafka cluster, we can examine the mes-

sages produced, too.

Figure 2: Kafka UI for the cluster in ELKH Cloud.

3.1 Containerized Version of the

Reference Architecture

Although the deployment of Kafka services onto VM

is portable thanks to the cloudinit script used, ap-

plying container images to run the services offers

an even more portable way to bring up the archi-

tecture. Basically, any containerization environment

(like Docker or Kubernetes) can use them. In case of

the ELKH Cloud there is no native container support

yet, so in order to bring up the containerized version

of the Apache Kafka reference architecture, we have

to start virtual machines on the ELKH Cloud, make

sure Docker is deployed on them, and ﬁnally start up

the proper container images on the VM hosts.

One problem related to the networking of Kafka

can arise in this case: the advertised listener addresses

may be unreachable from other containers. Thus, in

this case we need to make sure that properly conﬁg-

ured listeners are set up for the given Kafka service.

Apache Kafka has the following conﬁguration prop-

erties related to its listeners:

• listeners: enumerates the different lis-

teners with a given name, for example

EXTERNAL://:9092,

• advertised.listeners: enumerates the

Reference Architecture for IoT Platforms towards Cloud Continuum based on Apache Kafka and Orchestration Methods

207

listeners the given Kafka node will adver-

tise towards it connected clients, for example

EXTERNAL://193.215.220.107:9092,

• inter.broker.listener.name: the name of the

listener to be used for inter-broker communica-

tion, for example EXTERNAL.

In case of the containerized deployment, if we do

not deﬁne the virtual machine’s IP address as the IP

address of the listeners, then the brokers will propa-

gate the IP address assigned to the Docker container,

which is not accessible from the other VMs, as a con-

sequence, the brokers will not be able to communicate

with each other.

In order to overcome this issue, in the container-

ized deployment, for each Kafka service running in-

side the container, we are specifying only one listener

(called as EXTERNAL), and set its advertised address

to include the IP address of the VM running the con-

tainer. This listener is also conﬁgured to be used as

the inter-broker listener. Figure 3 shows this architec-

ture.

Virtual

Machine

Docker

Kafka

Broker

IP address

Listener

Virtual

Machine

Docker

Kafka

Broker

IP address

Listener

Figure 3: Listeners and IP addresses in the containerized

architecture.

The containerized version of the reference archi-

tecture is available in the Gitlab repository (Farkas,

2021), in the kafka.container directory, contains

both the infrastructure descriptions and the node deﬁ-

nitions, and also includes logs ﬁles of starting up and

destroying the infrastructures.

4 ENABLING HYBRID CLOUD

OPERATION FOR THE

REFERENCE ARCHITECTURE

As mentioned in the previous section, Occopus is

using two descriptions for an infrastructure: the in-

frastructure description itself, and the node deﬁni-

tions for each node participating in the infrastruc-

ture. The infrastructure description is a resource-

independent YAML ﬁle, whereas the node deﬁni-

tions are the YAML ﬁles which actually contain the

cloud-speciﬁc provisioning information for the differ-

ent node.

In Occopus, one node may have multiple node

deﬁnitions, for example the Kafka node in the Apache

Kafka RA can have one deﬁnition for the OpenStack-

based ELKH Cloud, and one deﬁnition for Azure. In

this case, when Occopus is about to bring up an in-

stance of the Kafka node in the RA, then it will de-

cide randomly which node deﬁnition to use for the

given instance.

It follows from the above, that we have two op-

tions for creating the hybrid version of the Apache

Kafka RA:

1. leave the infrastructure description unmodiﬁed,

but add a node deﬁnition for the Kafka node to

use another cloud resource,

2. extend the infrastructure description with an ad-

ditional Kafka node, which also depends on the

Zookeeper node, and has a new node deﬁnition

for some other cloud resource.

Figure 4 shows the differences between two ap-

proaches. The top part of the ﬁgure shows the struc-

tures of the different infrastructures. The left hand

case shows where we have only one Kafka node,

but it has multiple node deﬁnitions: one for the

ELKH Cloud, and one for Azure. The case shown

on the right side shows that we have two different

Kafka nodes in the infrastructure, one for the ELKH

Cloud (with a node deﬁnition belonging to the ELKH

Cloud), and one for Azure (with the relevant node def-

inition attached).

Infrastructure

description

Node

definitions

Multiple node

definitions per node

Single node node

definition per node

Zookeeper Kafka Zookeeper

Kafka

ELKH

Kafka

Azure

Zookeeper

ELKH

Kafka

ELKH

Kafka

Azure

Zookeeper

ELKH

Kafka

ELKH

Kafka

Azure

Clouds

ELKH

Azure

Figure 4: Approaches for creating hybrid infrastructures.

As it was mentioned earlier, in case of the un-

modiﬁed infrastructure description extension method,

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

208

Occopus will decide randomly between node deﬁni-

tions when provisioning new node instances, thus we

propose to use the updated infrastructure description

method instead. In this case, the user can explicitly

deﬁne how many Kafka nodes should be created in

the ELKH Cloud, or in Azure, or in any other cloud

provider for which a Kafka node deﬁnition exists.

The updated infrastructure description providing

Kafka nodes in both the ELKH Cloud and Azure

can be found in the Gitlab repository for the paper

(Farkas, 2021), in the directory kafka.vm, the ﬁle is

called infra-kafka-hybrid.yaml. This infrastruc-

ture description shows the structure of the infrastruc-

ture: one Zookeeper node is used, and there are two

Kafka nodes, one (as earlier) for the ELKH Cloud,

and one for Azure. Both ELKH Cloud and Azure

variants should have at least 2, at most 10 instances

when scaling the nodes.

If an ELKH Cloud user has already created an in-

frastructure based on the ELKH Cloud-only Apache

Kafka reference architecture, then Occopus enables

the extension of the available infrastructure with the

new additional Azure-based Kafka nodes. The user

simply has to instruct Occopus to build an infrastruc-

ture, based on the existing one, and using the new

infrastructure description. Occopus will check if the

nodes of the existing infrastructure are still alive (if

not, will provision new ones), and will also create the

Azure-based Kafka node instances.

The node deﬁnition part belonging to the updated,

hybrid infrastructure depends on the target cloud

used. In the following subsections we will examine

how the extension can be implemented in Azure us-

ing virtual machines and in Azure using container in-

stances. Occopus, beside others, has support for han-

dling these cloud types.

4.1 Azure VM-based Kafka Nodes

Azure-based virtual machines in Occopus are handled

by the azure vm resource handler plugin. Through

this plugin users can instantiate virtual machines in

the different regions of Azure, cloudinit scripts are

available for contextualization, just like in case of the

OpenStack-based ELKH Cloud. The Occopus doc-

umentation provides detailed instructions on how to

prepare a node deﬁnition for the azure vm plugin, so

we are only enumerating the necessary steps in a nut-

shell here:

• set authentication data to be used with Azure (us-

ing Azure service principal),

• set node properties, like region, virtual machine

size, publisher,

• create contextualization script for the service.

As it was already mentioned, the Occopus doc-

umentation covers most cases. Also, as Azure pro-

vides cloudinit-based contextualization, very likely

the contextualization script used for ELKH Cloud-

based nodes can also be used for Azure.

However, a slight modiﬁcation is necessary. As

shown in Figure 3, the Kafka nodes are communicat-

ing with each other. As the ELKH Cloud-based in-

stances, and the Azure-based instances are running on

completely different networks, the different nodes in

this hybrid scenario must have a public IP assigned,

and this public IP address should be used as the ad-

vertised listener, and the listener used for inter-broker

communication, too. Most cloud providers offer a

way to query the public IP address associated to the

virtual machine from inside the virtual machine, but

this method differs for the clouds. Thus, the con-

textualization script of the Azure-based instances will

differ from the ELKH Cloud-based ones in this man-

ner: for querying the public IP address of the VM,

the Azure Instance Metadata Service(IMDS, 2021) is

used.

With the updated cloudinit script, we can start

to extend the existing infrastructure in the ELKH

Cloud with Kafka nodes coming from Azure, using

the azure vm plugin. The output of the Occopus

command to extend the infrastructure can be found

in the Gitlab repository’s kafka.vm directory, in the

ﬁle logs/03 extend hybrid.log.

After the extension of the infrastructure, we will

have one Zookeeper node running in the ELKH

Cloud, 5 Kafka nodes running in the ELKH Cloud,

and 2 additional Kafka nodes running in Azure.

At this point, we can start to attach new producers

and consumers to the extended cluster. They can con-

nect either to the ELKH Cloud-based or Azure-based

nodes, but it is important to note, that according to

the Kafka documentation new servers will not auto-

matically be assigned any data partitions, so unless

partitions are moved to them they will not be doing

any work until new topics are created.

4.2 Azure Container-based Kafka

Nodes

Containers offer a simple solution for running pre-

packaged applications in versatile environments. As it

was mentioned earlier, ELKH Cloud does not provide

a native service for running containers, but users are

required to deploy some sort of container service (for

example Docker Engine or Kubernetes) onto virtual

machines, and run containers on top of that service.

Azure includes a service called Azure Container

Instances (ACI), which enables Azure users to start

Reference Architecture for IoT Platforms towards Cloud Continuum based on Apache Kafka and Orchestration Methods

209

up containers without the need to allocate any host-

ing virtual machines beforehand. The functionality is

straightforward and simpliﬁed: the user has to spec-

ify some basic container properties, like vCPU, RAM

and GPU requirements, base image, networking re-

quirements, and ACI will make sure that a container

based on the speciﬁed image is started up, without the

need to allocate VMs. This enables quicker service

startups and easier migration when needed.

Occopus has support for exploiting the function-

alities of ACI through the azure aci plugin. In order

to start container instances in Azure with Occopus,

the user has to set the Azure credentials, and the basic

properties for the container to be started. Features like

specifying environment variables and the command to

be run inside the container are also supported.

Migrating the existing reference architecture

nodes to ACI is relatively easy, as the container image

is already available. The only problem is that in order

to set the advertised listening address for the service,

we need to query the public IP address as described

in the previous, VM-based deployment. In case of

ACI, we do not have a service similar to the Azure

Instance Metadata Service, so we cannot get the pub-

lic IP address of the container from inside the con-

tainer (for example, before the startup of the Kafka

service). However, when allocating an ACI container,

Occopus has the possibility to generate a base FQDN

for the container’s public IP address, through which

other service can access it. And this functionality can

be used to pass the container’s pre-allocated public

FQDN as an environment variable to the container,

so the Kafka listener addresses can be conﬁgured

when starting up the service. The dedicated environ-

ment variable is called OCCOPUS ALLOCATED FQDN.

The startup script of the existing Kafka container im-

age has been updated to check if this environment

variable is set, and if yes, then its value is used as

the external advertised listener address for the Kafka

service.

With these updates, ACI-based Kafka nodes can

also participate in a hybrid cluster. The ACI-based

hybrid infrastructure can be found in the Gitlab repos-

itory’s kafka.container directory. The log ﬁle

logs/01 build hybrid aci.log shows the output

of the Occopus infrastructure build command for

the infra-kafka-hybrid-aci.yaml infrastructure

description. This contains an ELKH Cloud-based

Zookeeper node, 2 ELKH Cloud-based Kafka nodes

(the Kafka service is run inside Docker containers),

and 2 Azure ACI-based Kafka nodes. Figure 5 shows

the Kafka UI for the hybrid infrastructure, after 2,5

million short messages have been sent to a message

topic with the replication factor of 4.

Figure 5: Kafka UI overview for hybrid cluster using con-

tainers.

5 PERFORMANCE EVALUATION

In this section of the paper we evaluate the perfor-

mance of a hybrid Kafka cluster deployment, running

Kafka brokers both in the ELKH Cloud and in AWS.

AWS was chosen to show that the presented hybrid

approach is not tied only to the ELKH Cloud and

Azure, but can also be adapted to AWS as well.

A number of papers discuss the performance eval-

uation of Apache Kafka clusters.

Authors of (Wu et al., 2019) propose a queuing-

based model to predict the performance of an Apache

Kafka cluster. The metrics the authors are able to pre-

dict with very high accuracy based on their measure-

ments includes message throughput (both for Kafka

producers and consumers) and end-to-end latency (for

real-time applications).

In paper (Hesse et al., 2020) authors present a

monitoring framework for examining different perfor-

mance metrics of Java Virtual Machine-based mes-

sage broker systems, like Apache Kafka. Although

the paper uses an evaluation in scenarios with only

one Kafka node, but the different results, like the need

to rely on multiple producers are applied in this paper

as well.

We already referenced paper (Le Noac’h et al.,

2017), in which authors overview how changing dif-

ferent parameters (like message size, batch size, repli-

cation properties for a topic) impacts different met-

rics of the Kafka cluster (like the number of messages

accepted per seconds, or the message accepting la-

tency). In our paper we do not have the possibility

to evaluate the overall hybrid performance from all

these aspects, but as we will show later, we are using

a given conﬁguration throughout the evaluation.

Beyond performance metrics, there are other as-

pects which can be examined while evaluating an

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

210

Apache Kafka cluster, like network fault tolerance

(Raza et al., 2021). However, these are out of the

scope of our paper, so we are not discussing them.

The Kafka cluster used for the evaluation was set

up based on the following nodes: one Zookeeper node

in the ELKH Cloud (2 vCPUs, 4 GB RAM, 100 GB

SSD storage with 300 MB/s throughput), two Kafka

nodes also in the ELKH Cloud (2 vCPUs, 4 GB RAM,

2 TB SSD storage with 300 MB/s throughput for each

node), and two Kafka nodes in AWS (c5ad.large in-

stance type - 2 vCPUs, 4 GB RAM, 400 GB gp3 vol-

ume attached for Kafka with 300 MB/s throughput

for each node). Every node had a public IP address

allocated, and the brokers were using their public IP

as the advertised listener. The version of Kafka used

was 3.0.0, without any special ﬁne-tuning in the de-

fault conﬁguration (except for the listener conﬁgura-

tion to use the public IP address).

To monitor the different metrics of the Kafka

cluster, we have set up a Grafana deployment with

Prometheus as described in an article (Ahuja, 2022).

The core of the monitoring solution is a JMX agent

for Prometheus, which is able to query Kafka broker-

speciﬁc metrics.

We have created the following topics for the eval-

uation (for each topic, the replication factor is 1, and

the number of minimum in sync replicas is also 1):

• ELKH: this topic was set up with 2 partitions allo-

cated to the two ELKH Cloud brokers,

• AWS: this topic was set up with 2 partitions allo-

cated to the two AWS brokers,

• Hybrid: this topic was set up with 4 partitions

allocated to the four available brokers (two from

ELKH Cloud and two from AWS).

For the message throughput evaluation, we ran the

stock kafka-producer-perf-test.sh script avail-

able in the Kafka distribution. The common conﬁg-

uration for the producers was to publish 500M mes-

sages with 100 random bytes each, not limiting the

throughput from the producer side, have 64 MB of

memory for buffering messages, and use a batch size

of 64000 (pack at most so many messages into one

request). The variable conﬁguration for the producers

was the topic to send the messages to, and the set of

bootstrap Kafka brokers. During testing, we ran mul-

tiple producers in parallel, each producer had 2 vC-

PUS and 4 GB of RAM available, producers running

in the ELKH Cloud were using the ELKH Cloud-

based brokers as the bootstrap brokers (using their

private IP addresses), whereas producers running in

AWS were using the AWS brokers as the bootstrap

brokers (using their public IP addresses).

The producers were started with the help of Occo-

pus, which has the feature to start independent infras-

tructure nodes in parallel. Each node of the infras-

tructure description were above-described producer

nodes, either run on the ELKH Cloud or AWS. All

the infrastructure descriptions, the measurements and

ﬁgures are available in the Gitlab repository of the pa-

per (Farkas, 2021), in the directory performance.

5.1 Evaluation 1: Maximum

Throughput of the ELKH

Cloud-based Brokers

In this scenario, we were interested in the maximum

message throughput of the ELKH Cloud-based bro-

kers. In order to measure this, we have started an in-

creasing number of producers in parallel in the ELKH

Cloud to produce messages to the ELKH topic. Table 1

summarizes the results.

Table 1: ELKH Cloud-based broker message throughput.

Number of producers Max message throughput

(messages/s)

6 2.861.774

12 3.540.749

18 3.458.745

The graphs for the monitored message throughput

for the 6, 12 and 18 producer cases can be seen in

Figures 6, 7 and 8, respectively.

Time

1000000

2000000

3000000

17:10:00 17:15:00 17:20:00 17:25:00 17:30:00

Messages

Figure 6: 6 ELKH Cloud-based producers.

Time

1000000

2000000

3000000

4000000

16:15:00 16:20:00 16:25:00 16:30:00 16:35:00 16:40:00

Messages

Figure 7: 12 ELKH Cloud-based producers.

We can state that the maximum throughput of the

two ELKH Cloud-based brokers, when messages are

produced by local producers is around 3,5 million

messages/seconds. As both the brokers, topics and

producers used were running on the ELKH Cloud,

this is the stand-alone performance of the ELKH

Cloud-based setup.

Reference Architecture for IoT Platforms towards Cloud Continuum based on Apache Kafka and Orchestration Methods

211

Time

1000000

2000000

3000000

4000000

17:40:00 17:50:00 18:00:00 18:10:00 18:20:00 18:30:00

Messages

Figure 8: 18 ELKH Cloud-based producers.

5.2 Evaluation 2/a: Hybrid Setup,

Per-data Center Topics

In this scenario, we have utilized the AWS-based bro-

kers, as well. We were running multiple producers

both in the ELKH Cloud and AWS, where ELKH

Cloud-based producers were using the ELKH Cloud-

based brokers as the bootstrap servers, and were send-

ing messages to the ELKH topic, while the AWS-based

producers were connecting to the AWS-based bro-

kers, and were sending messages to the AWS topic.

Thus, the cluster was set up in a hybrid manner, but

the topic conﬁguration allows us to stay inside a data

center when producing messages. Later of course the

topics can be mirrored to the other brokers, but now

we were interested in the message throughput capac-

ity of the setup. Table 2 summarizes the results.

Table 2: ELKH Cloud+AWS broker message throughput.

Producers ELKH AWS Aggregated

6-6 2.753.677 2.691.826 5.431.932

12-12 3.446.712 3.056.977 6.477.289

As it can be seen, aggregated message throughput

of the four brokers is close to the double of the mes-

sage throughput of two brokers as show in Table 1.

The message throughput of the different topics for the

6-6 and 12-12 producer cases is shown in Figures 9

and 10, respectively.

Time

2000000

4000000

6000000

19:10:00 19:15:00 19:20:00 19:25:00 19:30:00

AWS Messages ELKH Messages Total Messages

Figure 9: 6 ELKH Cloud- and 6 AWS-based producers.

5.3 Evaluation 2/b: Hybrid Setup, One

Topic

This scenario is similar to the previous one, but in-

stead of using dedicated topics in the different data

centers, we have used one topic called Hybrid, which

Time

2000000

4000000

6000000

8000000

19:50:00 20:00:00 20:10:00 20:20:00

AWS Messages ELKH Messages Total Messages

Figure 10: 12 ELKH Cloud- and 12 AWS-based producers.

had four partitions allocated to the four brokers par-

ticipating in the cluster. Our expectation was, that

due to the communication requirements between the

brokers, the message throughput will decrease in this

case.

Similarly to the previous case, we have started 6-

6 and 12-12 producers, all of the producers sending

messages to topic Hybrid, the ELKH Cloud-based

producers connecting to the ELKH Cloud-based bro-

kers as the bootstrap servers, while the AWS-based

producers connecting to the AWS-based brokers as

bootstrap servers.

Table 3 shows our measurements, while Figures

11 and 12 show the message throughput graph as dis-

played by Grafana during the experiments.

Table 3: Hybrid topic throughput.

Number of producers Max message throughput

(messages/s)

6-6 899.013

12-12 1.964.962

Time

250000

500000

750000

1000000

21:00:00 21:02:00 21:04:00 21:06:00 21:08:00 21:10:00 21:12:00

Messages

Figure 11: 6 ELKH Cloud- and 6 AWS-based producers.

Time

500000

1000000

1500000

2000000

21:25:00 21:30:00 21:35:00 21:40:00

Messages

Figure 12: 12 ELKH Cloud- and 12 AWS-based producers.

5.4 Evaluation Summary

As it can be seen from the results of Evaluation 2/a

and 2/b, that in case of a hybrid setup, the partition-

ing of topics has a notable impact on the message

throughput. If we are about to increase the perfor-

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

212

mance of a Kafka cluster with new brokers from other

data centers, then it is recommended to use new mes-

sage topics partitioned only in the new data center, as

in this case the overall message throughput is the sum-

mary of the local message throughput in the different

data centers. If a topic is partitioned between different

data centers, then we can expect performance degra-

dation, which is very likely caused by the inter-broker

communication.

6 SUMMARY AND FUTURE

WORK

In this paper we presented our approach of en-

abling the hybrid cloud deployment of an Apache

Kafka reference architecture, originally targeting the

OpenStack-based ELKH science cloud. The pre-

sented work shows how the existing VM-based archi-

tecture can be moved onto a container-based solution,

and how the hybrid operation is enabled in Azure us-

ing virtual machines or containers towards cloud con-

tinuum. We also performed a message throughput

evaluation of the hybrid architecture, running on the

ELKH Cloud and AWS resources.

The most important advantage of our approach is

the encapsulation of the entire method and the related

best practices into reference architecture descriptions

that might be the subject of cloud orchestration in-line

with the cloud continuum approach. For orchestra-

tion purposes the open-source Occopus tool has been

chosen but the presented results could be applied for

other higher level (Ullah et al., 2021) or third-party

orchestration tools as well.

Regarding future work, the need of performing ad-

ditional measurements of the hybrid infrastructures

has been already identiﬁed. During the evaluation

the focus was on the startup time of the infrastruc-

tures, and some basic message throughput character-

istics, but from the applications’ point of view addi-

tional metrics could also be considered, e.g. (i) how

effectively consumers can process messaging sent to

the different topics, or (ii) what is the performance

of topic mirroring between different data centers in a

hybrid setup. Following the results here, we aim at

evaluating other performance metrics of different ap-

plications on hybrid Kafka clusters, running on top

of VM- or container-based deployments, having con-

nected the cluster nodes in different setups, such as

connecting directly or through a Virtual Private Net-

work connection. Further validation of results are to

be performed with use case providers, involving ex-

perts from the ﬁeld of autonomous and cyber-medical

systems.

ACKNOWLEDGMENTS

The research was supported by the Ministry of Inno-

vation and Technology NRDI Ofﬁce, Hungary within

the framework of the Autonomous Systems National

Laboratory Program. The research was supported

by the E

otv

os Lor

and Research Network Secretariat

(Development of cyber-medical systems based on AI

and hybrid cloud methods) under Agreement ELKH

O-40/2020. On behalf of the Occopus Project we

thank for the usage of ELKH Cloud (https://science-

cloud.hu/) that signiﬁcantly helped us achieving the

results published in this paper. The presented work of

R. Lovas was also supported by the Janos Bolyai Re-

search Scholarship of the Hungarian Academy of Sci-

ences. This work was funded by European Union’s

Horizon 2020 project titled ”Digital twins bringing

agility and innovation to manufacturing SMEs, by

empowering a network of DIHs with an integrated

digital platform that enables Manufacturing as a Ser-

vice (MaaS)” (DIGITbrain) under grant agreement

no. 952071.

REFERENCES

Ahuja, M. (2022). Kafka monitoring using prometheus.

https://www.metricﬁre.com/blog/kafka-monitoring-

using-prometheus/. Accessed: 2022-01-10.

ARNL (2021). The national laboratory for autonomous sys-

tems website. https://autonom.nemzetilabor.hu/. Ac-

cessed: 2021-11-01.

Canonical (2021). Managed kafka on ubuntu.

https://ubuntu.com/managed/apps/kafka. Accessed:

2021-12-01.

Conﬂuent (2021). Conﬂuent multi-cloud - the complete

guide. https://www.conﬂuent.io/learn/multi-cloud/.

Accessed: 2021-12-01.

du Preez, D. (2021). Conﬂuent aims to make it easier to

set data in motion across hybrid and multi-cloud.

https://diginomica.com/conﬂuent-aims-make-it-

easier-set-data-motion-across-hybrid-and-multi-

cloud. Accessed: 2021-12-01.

Eigner, G., Nagy, M., and Kov

acs, L. (2020). Machine

learning application development to predict blood glu-

cose level based on real time patient data. In 2020

RIVF International Conference on Computing and

Communication Technologies (RIVF), pages 1–6.

ELKH (2021). The elkh cloud website. https://science-

cloud.hu/en. Accessed: 2021-11-01.

Farkas, Z. (2021). Gitlab repository for the paper.

https://gitlab.com/zfarkas/iotbds-hybrid-kafka. Ac-

cessed: 2021-11-28.

enyes, D., N

emeth, B., and G

asp

ar, P. (2021). A novel

data-driven modeling and control design method for

autonomous vehicles. Energies, 14(2).

Reference Architecture for IoT Platforms towards Cloud Continuum based on Apache Kafka and Orchestration Methods

213

Hesse, G., Matthies, C., and Uﬂacker, M. (2020). How fast

can we insert? an empirical performance evaluation of

apache kafka. 2020 IEEE 26th International Confer-

ence on Parallel and Distributed Systems (ICPADS).

IMDS, A. (2021). Azure instance metadata ser-

vice (linux). https://docs.microsoft.com/en-

us/azure/virtual-machines/linux/instance-metadata-

service?tabs=linux. Accessed: 2021-10-27.

Kafka (2021a). Mirroring data between clusters & geo-

replication. https://kafka.apache.org/documentation/

\#basic\ ops\ mirror\ maker. Accessed: 2021-12-

01.

Kafka, A. (2021b). Apache kafka.

https://kafka.apache.org/. Accessed: 2021-11-01.

Kato, K., Takefusa, A., Nakada, H., and Oguchi, M. (2018).

A study of a scalable distributed stream processing in-

frastructure using ray and apache kafka. In 2018 IEEE

International Conference on Big Data (Big Data),

pages 5351–5353.

Kov

acs, J. and Kacsuk, P. (2018). Occopus: a multi-cloud

orchestrator to deploy and manage complex scientiﬁc

infrastructures. Journal of Grid Computing, 16:1–19.

Kov

acs, J., Kacsuk, P., and Em

odi, M. (2018). Deploying

docker swarm cluster on hybrid clouds using occopus.

Advances in Engineering Software, 125:136–145.

Kreps, J., Corp, L., Narkhede, N., Rao, J., and Corp, L.

(2011). Kafka: a distributed messaging system for log

processing. netdb’11.

Kumar, R., Adwani, L., Kumawat, S., and Jangir, S. (2014).

Opennebula: Open source iaas cloud computing soft-

ware platforms. In National Conference on Compu-

tational and Mathematical Sciences (COMPUTATIA-

IV).

Le Noac’h, P., Costan, A., and Boug

e, L. (2017). A per-

formance evaluation of apache kafka in support of

big data streaming applications. In 2017 IEEE Inter-

national Conference on Big Data (Big Data), pages

4803–4806.

Lovas, R., Nagy, E., and Kov

acs, J. (2018). Cloud agnos-

tic big data platform focusing on scalability and cost-

efﬁciency. Advances in Engineering Software, 125.

Mart

ın, C., Langendoerfer, P., Zarrin, P. S., D

ıaz, M., and

Rubio, B. (2020). Kafka-ml: connecting the data

stream with ml/ai frameworks.

Raza, M., Tahir, J., Doblander, C., Mayer, R., and Jacobsen,

H.-A. (2021). Benchmarking apache kafka under net-

work faults. In Proceedings of the 22nd International

Middleware Conference: Demos and Posters, Middle-

ware ’21, page 5–7, New York, NY, USA. Association

for Computing Machinery.

Sabharwal, N. (2016). Kafka mirroring in hybrid cloud.

https://www.linkedin.com/pulse/kafka-mirroring-

hybrid-cloud-introduction-neeraj-sabharwal/. Ac-

cessed: 2021-12-01.

Torres, D. R., Mart

ın, C., Rubio, B., and D

ıaz, M. (2021).

An open source framework based on kafka-ml for dis-

tributed dnn inference over the cloud-to-things contin-

uum. Journal of Systems Architecture, 118:102214.

UI, K. (2021). Ui for apache kafka.

https://github.com/provectus/kafka-ui. Accessed:

2021-11-01.

Ullah, A., Dagdeviren, H., Ariyattu, R. C., DesLauriers, J.,

Kiss, T., and Bowden, J. (2021). Micado-edge: To-

wards an application-level orchestrator for the cloud-

to-edge computing continuum. Journal of Grid Com-

puting, 19(4):47.

Waehner, K. (2021). Bridge between edge / on premise

and multi-cloud with apache kafka. https://kai-

waehner.medium.com/bridge-between-edge-

on-premise-and-multi-cloud-with-apache-kafka-

e9196100fadf. Accessed: 2021-11-01.

Wu, H., Shang, Z., and Wolter, K. (2019). Performance

prediction for the apache kafka messaging system.

In 2019 IEEE 21st International Conference on High

Performance Computing and Communications; IEEE

17th International Conference on Smart City; IEEE

5th International Conference on Data Science and

Systems (HPCC/SmartCity/DSS), pages 154–161.

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

214