Deep-vacuity: A Proposal of a Machine Learning Platform based on

High-performance Computing Architecture for Insights on Government

of Brazil Ofﬁcial Gazettes

Leonardo R. De Carvalho

1 a

, Felipe S. Lopes

1 b

, Jefferson Chaves

1 c

, Marcos C. Lima

2 d

avio E. Gomes De Deus

3 e

, Alet

eia P. F. A. von Paungarthem

1 f

and Flavio De Barros Vidal

1 g

Department of Computer Science, University of Brasilia, Brasilia, DF, Brazil

Department of Federal Police, Brasilia, DF, Brazil

Department of Electrical Engineering, University of Brasilia, Brasilia, DF, Brazil

Keywords:

Machine Learning, High Performance Computing, Government of Brazil, Ofﬁcial Gazette.

Abstract:

Brazil publishes region information, public tenders for the hire of civil servants, and also government contracts

with companies in its Ofﬁcial Gazettes. All this volume of information can contain interesting relationships

that reveal unique characteristics of the government, such as the effectiveness of public policies and even the

existence of illegal schemes. Establishing these relationships is not a trivial task and requires great effort.

Therefore, this work proposes the Deep Vacuity platform, which, by using a High-Performance Computing

architecture along with Machine Learning techniques, can collect, depurate, consolidate and analyze the data,

offering a friendly interface for decision-makers.

1 INTRODUCTION

Information is power (Freund, 1982). Even in the

1980s, a researcher had already noted this fact that

today seems so obvious. However, only if it is trans-

formed into knowledge can information be effectively

transformed into action (Connolly and Matarazzo,

1998). Brazil is one of the major countries in Latin

America. With a population of around 210 million

people, as described in (Instituto Brasileiro de Ge-

ograﬁa e Estat

ıstica, 2020), spread across its 5,570

cities, this country has a governmental structure full

of information.

Considering that the government should be able

to guide the direction of a country, it is extremely im-

portant that the actions of this government are both

widely publicized and also understood at a deeper

level. In other words, it is necessary to understand

https://orcid.org/0000-0001-7459-281X

https://orcid.org/0000-0002-6568-4541

https://orcid.org/0000-0001-8459-5248

https://orcid.org/0000-0003-3565-9924

https://orcid.org/0000-0001-7953-6227

https://orcid.org/0000-0003-4645-6700

https://orcid.org/0000-0002-6317-218X

the repercussions of the acts, as well as the interrela-

tionships, be they occasional, accidental, intentional,

strategic or even fraudulent. Brazil presents an ex-

citing challenge of extracting knowledge about the

range of information that the Brazilian government

publishes in its state press structure. Each munic-

ipal, state and federal entity has a different way of

publicizing government acts. In this context, infor-

mation technology can be made available to all citi-

zens. Using several high-performance computer pro-

cessing techniques, coupled with the execution struc-

tures of models based on machine learning, this work

proposes a platform architecture for the collection and

processing of information published in the various

brazilian ofﬁcial gazettes, whether in the municipal,

state or federal ﬁeld. This processing uses artiﬁcial

intelligence models designed to explore speciﬁc con-

cepts according to the objectives of each usage proﬁle,

with the platform itself being agnostic in this respect.

The manuscript is organized as follows: In Sec-

tions 2 and 3 information about the ofﬁcial gazettes

in Brazil is presented, followed by a detailed com-

pilation of related works on fraudulent collusion in

public works contracts in ofﬁcial texts. Section 4 ex-

plains our proposed architecture. Section 5 describes

the preliminary results and a discussion about the pro-

136

R. De Carvalho, L., Lopes, F., Chaves, J., Lima, M., Gomes De Deus, F., von Paungarthem, A. and Vidal, F.

Deep-vacuity: A Proposal of a Machine Learning Platform based on High-performance Computing Architecture for Insights on Government of Brazil Ofﬁcial Gazettes.

DOI: 10.5220/0011532500003318

In Proceedings of the 18th International Conference on Web Information Systems and Technologies (WEBIST 2022), pages 136-143

ISBN: 978-989-758-613-2; ISSN: 2184-3252

posed architecture. Section 6 provides conclusion

points and further work.

2 BACKGROUND

Brazilian law deﬁnes the Ofﬁcial Press as the ofﬁ-

cial vehicle for public administration (Brasil, 1993) to

offer universal access, making the knowledge about

ofﬁcial acts available, and validating them. Ofﬁcial

Gazettes are very rich sources of governmental infor-

mation, as it is through them that ofﬁcial acts are not

only made public but are also considered valid (Mar-

tano, 2015). In other words, they are only applicable

from the moment they are published or, in cases spec-

iﬁed in the acts themselves, within periods that fol-

low from those publications. Naturally, this type of

publication has been impacted by the arrival of new

information technologies and communication, so that

most public agencies that publish Ofﬁcial Gazettes

in Brazil nowadays offer them both printed and on-

line. However, the digital version of these documents

is usually offered in PDF format, which makes auto-

mated reuse of published information difﬁcult.

The Diario Oﬁcial da Uni

ao (DOU) is one of the

means of communication through which the National

Press must make public any and all matters of fed-

eral scope (Rocha, 2011). The DOU, for more than

148 years, has fulﬁlled its governmental role by serv-

ing the interests of all stakeholders in ofﬁcial deci-

sions. The mission of DOU is to publicize and carry

out knowledge management on information from ofﬁ-

cial acts for society and provide strategic graphic ser-

vices to the Federal Public Administration (BRASIL,

2020). DOU is published in PDF format as shown in

Figure 1 and is divided into 3 parts: Section 1: in this

section, regulating acts with national coverage such

as laws and norms are published; Section 2: in this

section, acts related to public workers such as hiring,

promotions and resignations are published; Section

3: in this section, signed contracts between the gov-

ernment and other companies, including duration and

amount of money involved in the transaction are pub-

lished. Another way to obtain DOU data from the Na-

tional Press is shown in Figure 2. The National Press

delivers DOU data in XML format through open data

format.

Brazil is divided into 26 states, and plus the Fed-

eral District where the capital, Brasilia, is located.

Each state has its own way of organizing and pub-

lishing communications. The state S

ao Paulo, for ex-

ample, has a platform called Imprensa Oﬁcial. The

publications are divided into sections and subsections.

Unlike DOU, the Ofﬁcial Gazette of S

ao Paulo does

Figure 1: DOU PDF Example.

Figure 2: DOU XML Example.

not have a data format version, such as XML. In this

case, in order to obtain the data it is necessary to open

each PDF ﬁle provided and parse the ﬁle. Besides the

three most populous states, the other 23 states and the

Federal District have their own platforms for publish-

ing ofﬁcial gazettes. Each one is structured in a dif-

ferent format with several sections and subsections.

The municipal scenario is more complex than the

state ones. Brazil has 5,570 cities according to its

Geography and Statistic Institute (IBGE) (Instituto

Brasileiro de Geograﬁa e Estat

ıstica, 2020). Many

of these cities have their own web platform to pub-

lish Ofﬁcial Gazettes. Each of these platforms has

a different system for organizing the publications.

Besides this, some cities do not even have an elec-

tronic platform and publish their acts in paper format,

such as printed journals and in extreme cases there

are publications that appear only on bulletin boards

that are within the public bodies. That reinforces the

presumption of universal access possibilities, which

means that the Ofﬁcial Gazette is unquestionably pre-

sented as the organism in which ofﬁcial acts are pub-

lished, giving them legal validity, that is, existence in

the real world. In the Ofﬁcial Gazette of S

ao Paulo

city, about 800 articles are published per day, dis-

tributed in approximately 90 pages, and it is estimated

that, during its existence, approximately 30 million

pages have been published (Martano, 2015). When

projecting this for the other 5,569 cities, as well as

the 26 states and the Federal District, and also adding

DOU publications, there is a huge volume of infor-

mation related to the government being created daily.

This enormous volume of data contains precious in-

Deep-vacuity: A Proposal of a Machine Learning Platform based on High-performance Computing Architecture for Insights on Government

of Brazil Ofﬁcial Gazettes

137

formation that has relationships in a wide scope.

Collecting, purifying, consolidating and extract-

ing these relationships is, in itself, a great challenge.

The collection process alone is a great challenge due

to the high number of data sources, with different de-

livery formats of the data. Once in possession of the

data, the second step of data extraction can be almost

impossible, because of the different methods used to

publish the same information. This is due to the fact

that there is no standardization for these publications.

This allows the same company to be published in dif-

ferent ways between different journals and even in

different publications within the same journal. As-

suming that the challenges of collecting, purifying

and consolidating data have been overcome, then an

equally complex phase comes: ﬁnding relationships

between publications. For this activity, given the large

volume of data, a computational technique that aims

to extract intelligence from this mass of disconnected

data, Machine Learning, enters the scene.

2.1 High Performance Computing

The recent development of technology is recently be-

ing associated with the growth of the problem size

that has to be handled. The problem size is increas-

ing exponentially and physical machines cannot sup-

port a huge range of users on different operation en-

vironments simultaneously. That is one of the rea-

sons leading to the advent of cloud computing and

virtualization technology (Chung et al., 2016). The

complexity of application and system software con-

tinues to grow in several dimensions. Firstly, as

computational performance increases, scientists are

developing more complex applications that incorpo-

rate more sophisticated temporal and spatial scales,

more complex physics and (increasingly) data assim-

ilation (Geist and Reed, 2017).

A desirable feature of high-performance appli-

cations is performance portability, where using the

same application code can result in high performance

across a diversity of architectures (Balaprakash et al.,

2018). There are two models that allow virtualized

instances to be deployed. They are hypervisor-based

and container-based platforms. For virtual machines,

a hypervisor, known as a layer, deploys and allocates

the operation space of instances (Chung et al., 2016).

As a simple example, consider the special case of two

different job conﬁgurations. This could correspond to

either (a) two different node parallel levels or (b) use

of nodes with or without accelerators (e.g., GPUs).

In an energy- or cost-constrained environment, one or

the other might be preferred, based on the characteris-

tics of the jobs in the batch queue and the speedup as

a function of node type and number (Geist and Reed,

2017). Another relevant aspect related to HPC sys-

tems is resilience and fault tolerance. The amount of

data expected to be processed by the platform pro-

posed by this work can eventually reach the level

of petabytes. Given this data volume, only a high-

performance computing environment would be able

to process the data and generate the desired informa-

tion. Therefore, this proposal adopted an HPC ori-

ented approach in the elaboration of the architecture

for the solution. Concerns about scalability and elas-

ticity were taken into account when designing this

proposal.

3 RELATED WORKS

The Brazilian Federal Police is consistently aspiring

to improve its fraud detection mechanisms, there-

fore (Vallim, 2020) focused on paving works con-

tracts, which are one of the most budget consuming

services at state or city level and the focus of criminal

activities, to make a CBR model of paving services

in the Parana State. Procurement, enterprises, con-

tract, and georeferenced data were used, with the aim

of classifying collusion cases.

Another way to prove and identify procurement

collusion is by using statistics and probability. Those

methods were explored in several Federal Police

studies and were based on joint behavior analysis

of competitors who act to achieve bid-rigging. It

was successfully applied to oil-related contracts us-

ing Operac¸

ao Lava Jato information (Signor et al.,

2020a) and for infrastructure projects (Signor et al.,

2020b) with capped ﬁrst-price auctions. The Brazil-

ian General Controllership of the Union (CGU), a

national auditing public agency, also has several ini-

tiatives to reach a reliable classiﬁer for procurement

fraud. (Ralha and Silva, 2012) elaborated a not super-

vised evaluator that, using priori rules, computed the

possibility of a certain group being selected in a given

tender. The article (Balaniuk et al., 2012) focused

on the evaluation of fraud risk in government agen-

cies using Naive Bayes Classiﬁers for audit planning,

using structured data and fraudulent activity patterns.

In (Sun and Sales, 2018) traditional neural networks

and deep neural networks (DNN) are used to elabo-

rate an early alarm system.

In (Carvalho and Carvalho, 2016) the author

achieved reasonable results using Bayesian Models

with structured data of penal sanction of federal civil

servants, civil servants’ roles and income, number of

accounts judged irregular and number of regularity

certiﬁcates or an agency unit and afﬁliated civil ser-

WEBIST 2022 - 18th International Conference on Web Information Systems and Technologies

138

vants of each management unit.

The paper (Anysz et al., 2019) uses ANN and

structured data on Poland’s highway procurements.

They used the number of enterprises, price differ-

ences, contract orders in the same place, and set

of propositions to assess its fraud risk. Using TF-

IDF in procurement documentation, (Rabuzin and

Modru

san, 2019) tested Logistic Regression, SVM

and Naive Bayes on potential corruption. Their model

had no annotated data, so it was focused on ﬁnd-

ing one bid tenders which “could be potentially sus-

picious”. Natural Language Processing is not often

used to classify procurement documents for risk or

fraud (Lima et al., 2020). The technology is used for

assessing fraud risk in health care claims (Popowich,

2005; Van Arkel et al., 2013), and ﬁnancial re-

ports (Seemakurthi et al., 2015; Goel and Uzuner,

2016).

In the article (de Bessa Lins, 1976), the author

conducted research from August 1974 to March 1976

regarding the ofﬁcial gazettes of the Brazilian states.

Questionnaires were sent to the Ofﬁcial Press, pub-

lic libraries or archives, assembly libraries, librarians,

journalists, and teachers, looking for copies of each

of the gazettes at legislation concerning them and per-

sonal consultation with specialized technicians. The

analysis of the results shows that in the ﬁrst stage, as

much information as possible about each gazette was

collected without adequate concern. Subsequently,

the idea was improved and, through a comparative

study, the urgent need for renewal, a revision of “of-

ﬁcial journalism” in the country was shown, as many

ends up not reading or knowing this information, only

those who have an obligation (for example, imposed

by their work) read the ofﬁcial gazettes.

In the paper (Luz de Araujo et al., 2020) the

authors use Universal Language Model Fine-Tuning

(ULMFiT) to leverage information contained in a un-

labeled language model dataset consisting of 2,652

texts extracted from the Ofﬁcial Gazette of the Fed-

eral District of Brazil. They compare the perfor-

mance of ULMFiT with simple bag-of-words base-

lines and perform an ablation analysis to identify the

impact of gradual unfreezing, language model ﬁne-

tuning and the use of the ﬁne-tuned language model

as a text feature extractor. The results analysis shows

that the combination of language model ﬁne-tuning

and gradual unfreezing is extremely beneﬁcial. It also

suggests that language models, even after being ﬁne-

tuned on domain data, are not good feature extractors

and should be trained also on classiﬁcation data.

In the article (Xavier et al., 2015), the authors use

a hybrid architecture for indexing documents in the

Ofﬁcial Gazette of the Brazilian city of Cachoeiro

de Itapemirim located in the state of Espirito Santo,

Brazil. They use text mining techniques to identify

indexes that adequately represent the context of the

document. In addition, the architecture features com-

ponents of transactional systems for validating and

storing information, as well as elements of text min-

ing for transforming unstructured information into a

set of structured objects, capable of being maintained

in databases. For the indexing user, it is extremely

important that the proposed tool retrieves all the doc-

uments involved in the concept being sought, bringing

as few documents as possible that are not relevant to

the context.

In the article (de Sousa, 2014), the author exam-

ines the conditions in which the Ofﬁcial Gazette of

the State of Goias offers access to the information it

conveys and creates a proposal for information archi-

tecture requirements for ofﬁcial digital information

representation provided by the state. The author also

analyzes the usability of that gazette for what it pro-

poses, presenting requirements for organizing infor-

mation for the representation of its data, so that the

principle of publicity of ofﬁcial acts is effectively met.

4 THE DEEP VACUITY

ARCHITECTURE

Collecting, processing and inferring useful knowl-

edge about the whole mass of data generated by the

Brazilian government is a very audacious mission,

and, this paper proposes the Deep Vacuity platform.

Figure 3 shows the general ﬂow of Deep Vacuity.

It is possible to verify that the process begins with

the “Data Collection” phase in which entities called

“Crawlers” capture data in the various data sources

provided by the Brazilian government. Then artiﬁ-

cial intelligence models are applied to these data in

the “Machine Learning” phase. Eventually, from the

data analysis, the platform will automatically reach an

“Insight”, that is, it will obtain new knowledge about

the data, which was not previously known. This new

knowledge will be presented through a frontend to a

human, whose knowledge in the area where the plat-

form is being used is extensive, and this expert will

make a conclusive study of the discovery.

Figure 3: Deep Vacuity Workﬂow.

Deep-vacuity: A Proposal of a Machine Learning Platform based on High-performance Computing Architecture for Insights on Government

of Brazil Ofﬁcial Gazettes

139

Considering the magnitude of the challenge, Deep

Vacuity has an architecture that allows its construc-

tion to occur gradually in order to accommodate any

changes that occur in the data sources. As shown in

Figure 4, there are eleven components that make up

the platform’s architectural solution.

Orchestration Component: This component is re-

sponsible for controlling data collection, as well

as triggering the processing of artiﬁcial intelligence

models. It should be automatically triggered peri-

odically by means of routines that are triggered by

scheduling operating system tasks, such as crontab

records. Its operation will use parameters deﬁned in

the database. Area 1 in Figure 4 shows the Orchestra-

tion Component of Deep Vacuity Middleware.

Figure 4: Deep Vacuity Architecture.

Data Collection Component: This component ab-

stracts and isolates the speciﬁcs of each data source.

Applications denominated Crawlers obtain data from

sources. This process can be done by just request-

ing and receiving ﬁles in formats such as XML and

JSON, or parsing the web pages of the data sources.

There is a Crawler for each source and some sources

can share the same Crawler just by ﬁlling in their pa-

rameters. Therefore, in the case of sources that share

the same data delivery format, it is not necessary to

create a Crawler speciﬁcally for each source. When

requesting a collection, Data Broker must iterate over

the pool of active Crawlers as shown in Area 2 of Fig-

ure 4 and perform the extractions in parallel (respect-

ing an established concurrence limit). Considering

the diversity of ways to obtain primary data, it is es-

sential that the data collection ﬂow is ﬂexible enough

to suit the speciﬁcities of the data sources. In order to

achieve this, each type of data source will have its as-

sociated Crawler. The Crawler is the application that

knows the format of the data delivery by the source

and knows how to capture and properly handle that

data. Areas 2 and 3 of Figure 4 show the symbiotic

relationship between primary data sources and their

Crawlers inside a Deep Vacuity context.

Primary Data Sources Component: Each federa-

tive entity of the government of Brazil has its own

channel to publish its acts that need wide dissemina-

tion, such as calls for tenders, signing of contracts,

among other governmental acts. At national level,

these publications are made using the tool called

“Di

ario Of

ıcial da Uni

ao” (DOU), which generally

provides publications daily, except for weekends and

holidays. At state level, each state has a different plat-

form for making these publications available. Each

one has its own particular format, following the struc-

ture deﬁned by the government of each state. As well

as the structure, the way in which data is made avail-

able also varies between states. At municipal level,

there are several different realities. Brazil currently

has 5570 cities. Some of those cities publish their

gazettes electronically on their portals, however there

are still situations where notiﬁcations are made on pa-

per and even displayed on boards at city halls.

Storage Component: This component aims to

store the data collected by Crawlers for processing by

the Machine Learning Engine. It consists of several

standalone storage instances (Area 4 of Figure 4) that

are selected by Orchestrator to receive data at the time

of collection. Each storage component has a Network

File System (NFS) directory where the data is stored,

and which is mounted on the Machine Learning En-

gine instances. The data can be classiﬁed as “perma-

nent” or “transient”. If the data is transient, it will be

removed when the Machine Learning Engine ends its

activities on that data.

Machine Learning Component: This component

applies artiﬁcial intelligence models to the collected

data and generates new data that are loaded into the

database. A model can load pre-existing models in

the database to use in its processing, as well as other

essential parameters for the model’s execution. Each

model is related to many data sources. When Or-

chestrator triggers this component and there is a new

collection awaiting processing, the models related to

the respective source are loaded. Each model will re-

ceive an instance of the pool to perform its work. If

there are no available instances or the limit of com-

peting instances is reached, the model will wait in a

queue for processing. The parameters of pool limits

are stored in the database and queried by Orchestrator

during its process. Once in possession of the instance

for execution, the model obtains its parameters from

the database, as well as the respective related models.

WEBIST 2022 - 18th International Conference on Web Information Systems and Technologies

140

If related models exist, they will be created in Stor-

age. Then the model will be processed. At the end

of this processing, the model’s output will be stored

in the database. As shown in Area 5 of Figure 4 this

component is composed of several instances in which

the machine learning process occurs.

Database Component: Considering the expecta-

tion of data growth, this component is composed of

a cluster of some servers, so that its scalability is vi-

able over time as shown Figure 4 (Area 6). As the

maximum load of the servers is reached, new servers

can be added to the cluster in order to expand its ca-

pacity. The selection of models for analysis and re-

porting can involve a large volume of data and in or-

der to make the processing of these requests more ag-

ile, this component has data distributed in the cluster.

This mechanism uses high parallelism in loading data

to process large volumes of data more quickly than

occurs in centralized database systems. On the other

hand, queries involving smaller data volumes will be

penalized due to the need to perform additional tasks

related to data distribution that would not be neces-

sary in centralized systems.

Presentation Component: This component is the

interface from which the end user can obtain informa-

tion in several formats such as: HTML pages, XML

and JSON ﬁles. HTML pages have dynamic con-

tent processed via JavaScript using data loaded from

a REST API. It is prepared for an eventual increase

in number of users, as it has a load balancer service

that distributes requests among the various applica-

tion servers. It can be noticed in Area 7 of Figure 4

that all requests are made to the load balancer acting

as a reverse proxy to the instances running the fron-

tend application. Considering the diversity of ways to

deliver the data produced by the platform, the backend

REST API is stateless. Figure 5 shows a prototype of

a conﬁguration page for artiﬁcial intelligence model

listing and upload of Deep Vacuity.

Figure 5: Deep Vacuity Frontend Prototype.

Request Processing Component: This component

(Area 9 of Figure 4) is responsible for receiving, han-

dling and responding to requests from the frontend

through a REST API. Some heavy processing can

be sent to the Request Queue before they are effec-

tively dealt with in order to prevent eventual overload

events in this component. It is prepared for an even-

tual growth in number of users, as it has a load bal-

ancer service that distributes requests among the vari-

ous application servers. Figure 4 shows, in Area 8, the

pool of instances that composes the backend. Those

instances run the backend application and are state-

less. The backend application uses data queried in the

Database to process the requests. The instances are

reachable through a reverse proxy that is responsible

for balancing the load allocated to each instance.

Request Queue Component: This component acts

as a solution processing scheduler. It queues requests

and triggers their execution at the appropriate time

following previously established metrics. An addi-

tional component is part of those processes where the

processing time is usually long. This mechanism al-

lows the user to request a processing and follow its ex-

ecution asynchronously and be notiﬁed when its pro-

cessing is ﬁnished. Additionally, it prevents backend

downtime caused by overload coming from excessive

frontend requests.

Monitoring Component: Considering the number

of components that make up the solution, this el-

ement monitors the functioning of the components,

their integration and prevents possible disaster sce-

narios. It has acceptable thresholds for each moni-

tored aspect. Once this limit is exceeded, the alert

system sends notiﬁcations to the service´s responsi-

ble in order to call attention to that component. The

Area 10 of Figure 4 illustrates the composition of this

component. There are three instances: Metric col-

lector: service responsible for obtaining and storing

established metrics from each monitored component;

Monitoring Dashboard: graphical application to vi-

sualize and analyze monitored component behavior

over time; Alert manager: service in which thresh-

olds are inputted and used to automatically monitor

component metrics.

Authentication and Authorization Component:

Deep Vacuity platform offers user interfaces through

its main frontend and the monitoring system. In both

cases it is necessary to have a system of authentication

and authorization of users based on roles, since there

will be segregation of functions both in the presenta-

tion component and in the monitoring tool. For this,

the platform has a segregated infrastructure whose

speciﬁc function is to house an application for the

Deep-vacuity: A Proposal of a Machine Learning Platform based on High-performance Computing Architecture for Insights on Government

of Brazil Ofﬁcial Gazettes

141

management of these authorizations, as well as meet-

ing authentication requests from the frontend and the

monitoring system. In Area 11 of Figure 4, it is possi-

ble to observe that in addition to the instance responsi-

ble for executing the application of this system, there

is also an instance in which the solution database is al-

located, which is completely separated from the main

platform database. In addition to authentication and

authorization, this system allows single sign on, that

is, once authenticated from one of the tools (frontend

or monitoring system) the user will be able to access

the other tool without having to perform a new login

process. within an established time window.

5 PRELIMINARY RESULTS

Deep Vacuity can be used for a number of purposes,

as once the data obtaining process is ready, it is sim-

pler for a work group to invest in the building of their

intelligence models, load them into the platform and

then obtain their “insights”. Some use cases that can

be performed using this tool will be described here.

For example, law enforcement ofﬁcials responsible

for combating the crime of corruption in public con-

tracts could consolidate their knowledge of how a par-

ticular type of crime works and then upload to Deep

Vacuity for training on the data collected by the plat-

form, and from the insights collected, tools to con-

duct investigations into suspicious cases. Another use

case is the validation of the execution of public pol-

icy strategies. In this use case, a public manager, in

a broader scope, such as a federal one, for example,

could verify whether a particular public policy is be-

ing passed on to lower levels by applying smart mod-

els that verify the occurrence of local government acts

in that location. Still for public managers, another use

case would be an automated price quote. In this use

case, a ﬁnance manager from a public agency could

use Deep Vacuity to establish the value of a particular

item they wish to purchase or service they intend to

hire by applying an intelligent model on the data col-

lected by the platform, whether globally or regionally,

depending on interest.

Undue transfers of public ofﬁcials that aim to re-

ceive compensation money, can be detected through

Deep Vacuity. For this, human resources managers

would need to implement intelligence models capa-

ble of isolating these cases in the universe of transfers

published across the country, including using the plat-

form itself as a training environment. These are just a

few examples of use cases that serve to illustrate the

potential that this architecture has.

6 CONCLUSIONS

This article showed the proposal for the Deep Vacu-

ity, which proposes to collect data from the Brazilian

government, process it and offer insights through a

friendly graphical interface so that they can be ana-

lyzed by experts. Although the challenge of this plat-

form seems arduous, the description of the architec-

ture, based on HPC, allows the conclusion that its vi-

ability is guaranteed. Likewise, the strategy of using a

generic artiﬁcial intelligence model for data process-

ing and generating insights has proved to be an appro-

priate approach to make architecture adhere to a wide

range of scenarios.

Deep Vacuity presents itself as a prominent tool

for analyzing government data, especially in the

Brazilian context. Beyond that, this tool can serve as

a starting point for a transformation in the Brazilian

management model, which has a history of reactivity.

Following the adoption of Deep Vacuity, it can reach

a level of predictability of events that could provide

government entities the time needed to prepare com-

bat or contour actions. The ﬂow and the architecture

of the platform are already consolidated, as well as the

prototype of the frontend. The next steps are to build

the data collection structures, the integration track for

processing the smart models and the application back-

end. From that point on, tests can be carried out with

real data, which will allow the performance of the

platform to be evaluated in a productive environment.

Once reaching the aforementioned point, several arti-

ﬁcial intelligence models can be developed that meet

the different use cases to which the platform will be

applied. The role of these models may be increased as

the use of the platform changes the demand for new

intelligent ways of data processing.

ACKNOWLEDGEMENTS

The authors are grateful for supporting this research

through the Applied Research of Technological In-

novations project in the Federal Criminal Forensics

(DITEC/PF) through Public Call UNB-DITEC 01-

2021 and University of Brasilia.

REFERENCES

Anysz, H., Foremny, A., and Kulejewski, J. (2019). Com-

parison of ann classiﬁer to the neuro-fuzzy system for

collusion detection in the tender procedures of road

construction sector. In IOP Conference Series: Ma-

terials Science and Engineering, volume 471, page

112064. IOP Publishing.

WEBIST 2022 - 18th International Conference on Web Information Systems and Technologies

142

Balaniuk, R., Bessiere, P., Mazer, E., and Cobbe, P. (2012).

Risk based Government Audit Planning using Na

ıve

Bayes Classiﬁers. In Advances in Knowledge-Based

and Intelligent Information and Engineering Systems,

Spain.

Balaprakash, P., Dongarra, J., Gamblin, T., Hall, M.,

Hollingsworth, J. K., Norris, B., and Vuduc, R.

(2018). Autotuning in high-performance computing

applications. Proceedings of the IEEE, 106(11):2068–

2083.

Brasil (1993). Lei 8666. http://www.planalto.gov.br/ccivil

03/leis/l8666cons.htm, visited 2020-05-16.

BRASIL (2020). Di

ario oﬁcial da uni

ao. http://www.in.gov.

br, visited 2020-05-16.

Carvalho, R. S. and Carvalho, R. N. (2016). Bayesian mod-

els to assess risk of corruption of federal management

units. In BMA@ UAI, pages 28–35.

Chung, M. T., Quang-Hung, N., Nguyen, M., and Thoai, N.

(2016). Using docker in high performance computing

applications. In 2016 IEEE Sixth International Con-

ference on Communications and Electronics (ICCE),

pages 52–57.

Connolly, S. and Matarazzo, J. (1998). Knowledge and Spe-

cial Libraries. Routledge.

de Bessa Lins, M. I. (1976). Di

arios oﬁciais dos estados

brasileiros. https://www.brapci.inf.br/ repositorio/

2010/02/pdf d7a5189ebe 0008270.pdf, visited 2020-

05-16.

de Sousa, S. R. (2014). Di

ario oﬁcial do estado de

goi

as: uma proposta de requisitos de arquitetura da

informac¸

ao para representac¸

ao da informac¸

ao oﬁcial

digital.

Freund, G. E. (1982). Impactos da tecnologia da

informaC¸

Ao.

Geist, A. and Reed, D. A. (2017). A survey of high-

performance computing scaling challenges. The In-

ternational Journal of High Performance Computing

Applications, 31(1):104–113.

Goel, S. and Uzuner, O. (2016). Do sentiments matter in

fraud detection? estimating semantic orientation of

annual reports. Intelligent Systems in Accounting, Fi-

nance and Management, 23(3):215–239.

Instituto Brasileiro de Geograﬁa e Estat

ıstica

(2020). Estimativas da populac¸

ao. https:

//www.ibge.gov.br/estatisticas/sociais/populacao/

9103-estimativas-de-populacao.html?=&t=

resultados, visited 2020-05-16.

Lima, M., Silva, R., Lopes de Souza Mendes, F., R. de

Carvalho, L., Araujo, A., and de Barros Vidal, F.

(2020). Inferring about fraudulent collusion risk on

Brazilian public works contracts in ofﬁcial texts us-

ing a Bi-LSTM approach. In Findings of the Associ-

ation for Computational Linguistics: EMNLP 2020,

pages 1580–1588, Online. Association for Computa-

tional Linguistics.

Luz de Araujo, P. H., de Campos, T. E., and Magalh

aes

Silva de Sousa, M. (2020). Inferring the source of

ofﬁcial texts: Can svm beat ulmﬁt? In Quaresma,

P., Vieira, R., Alu

ısio, S., Moniz, H., Batista, F.,

and Gonc¸alves, T., editors, Computational Process-

ing of the Portuguese Language, pages 76–86, Cham.

Springer International Publishing.

Martano, A. M. R. (2015). Di

ario livre: co-criac¸

de uma ferramentapara publicac¸

ao de um di

ario

oﬁcial em formato aberto. Master’s thesis,

Escola de Artes, Ci

encias e Humanidades.

https://teses.usp.br/teses/disponiveis/100/100131/

tde-21122015-091757/publico/dissertacao.pdf,

visited 2020-05-16.

Popowich, F. (2005). Using text mining and natural lan-

guage processing for health care claims processing.

ACM SIGKDD Explorations Newsletter, 7(1):59–66.

Rabuzin, K. and Modru

san, N. (2019). Prediction of public

procurement corruption indices using machine learn-

ing methods.

Ralha, C. G. and Silva, C. V. S. (2012). A multi-agent data

mining system for cartel detection in brazilian govern-

ment procurement. Expert Systems with Applications,

39(14):11642 – 11656.

Rocha, J. P. L. (2011). Intelig

encia de fontes abertas: um es-

tudo sobre descoberta de conhecimento no di

ario oﬁ-

cial da uni

ao. https://bdtd.ucb.br:8443/jspui/handle/

123456789/1336, visited 2020-05-16.

Seemakurthi, P., Zhang, S., and Qi, Y. (2015). Detection

of fraudulent ﬁnancial reports with machine learning

techniques. In 2015 Systems and Information Engi-

neering Design Symposium, pages 358–361. IEEE.

Signor, R., Love, P. E., Belarmino, A. T., and Al-

fred Olatunji, O. (2020a). Detection of collusive ten-

ders in infrastructure projects: Learning from opera-

tion car wash. Journal of Construction Engineering

and Management, 146(1):05019015.

Signor, R., Love, P. E., Oliveira Jr, A., Lopes, A. O.,

and Oliveira Jr, P. S. (2020b). Public infrastruc-

ture procurement: Detecting collusion in capped ﬁrst-

priced auctions. Journal of Infrastructure Systems,

26(2):05020002.

Sun, T. and Sales, L. J. (2018). Predicting Public Pro-

curement Irregularity: An Application of Neural Net-

works. Journal of Emerging Technologies in Account-

ing, 15(1):141–154.

Vallim, J. J. d. C. B. (2020). Uso do Modelo de Racioc

ınio

Baseado em Casos Para Monitoramento de Conluio

em Licitac¸

oes de Obras de Pavimentac¸

ao Urbana.

Master’s thesis, Universidade Federal do Paran

a, Cu-

ritiba.

Van Arkel, J. H., Wagner, J. J., Schweyen, C. L., Mahone,

S. M., Curtis, T. J., HAGINS, S., et al. (2013). Pre-

dictive modeling processes for healthcare fraud detec-

tion. US Patent App. 13/536,414.

Xavier, B. M., da Silva, A. D., and Gomes, G. R. R. (2015).

Uma arquitetura hibrida para a indexac¸

ao de docu-

mentos do di

ario oﬁcial do munic

ıpio de cachoeiro de

itapemirim.

Deep-vacuity: A Proposal of a Machine Learning Platform based on High-performance Computing Architecture for Insights on Government

of Brazil Ofﬁcial Gazettes

143