Reinforcement Learning-based Real-time Fair Online Resource

Matching

Pankaj Mishra

1,2

and Ahmed Moustafa

Department of International Collaborative Informatics, Nagoya Institute of Technology, Gokiso, Naogya, Japan

School of Computing and Information Technology, University of Wollongong, Wollongong, Australia

Keywords:

Resource Matching, Reinforcement Learning, Fairness, Online Markets, Contest Success Function.

Abstract:

Designing a resource matching policy in an open market paradigm is a challenging and complex problem. The

complexity is mainly due to the conﬂicting objectives of the independent resource providers and dynamically

arriving online buyers. In speciﬁc, providers aim to maximise their revenue, whereas buyers aim to minimise

their resource costs. Therefore, to address this complex problem, there is an immense need for a fair matching

platform. In speciﬁc, the platform must optimise the pricing rule on behalf of resource providers to maximise

their revenue at one end. Then, on the other hand, the broker must fairly match the resource request on behalf

of buyers. Owing to this we propose a two-step unbiased broker based resource matching mechanism in

the auction paradigm. In the ﬁrst step, the broker computes optimal trade prices on behalf of the providers

using a novel reinforcement learning algorithm. Then, in the second step appropriate provider is matched

with the buyer’s request based on a novel multi-criteria winner determination strategy. Towards the end, we

compare our online resource matching approach with two existing state-of-the-art algorithms. Then, from the

experimental results, we show that the novel matching algorithm outperforms the other two baselines.

1 INTRODUCTION

Resource matching in online settings with multiple

data providers and dynamically arriving buyers is a

widely studied problem. Speciﬁcally, in such market

settings, two types of competition co-exist, i.e., com-

petition among the providers and competition among

the buyers. In speciﬁc, providers compete amongst

each other to maximise their total revenue by offering

their available resources. On the other end, dynami-

cally arriving buyers compete to fulﬁl their resource

demands at the minimum possible price. Also, with

the highest possible quality preferences. Usually, auc-

tion paradigms (Krishna, 2009) are widely adopted

(Samimi et al., 2016; Zaman and Grosu, 2013) for

various market setting. These auction paradigms are

owned by an auctioneer, which intermediates each re-

source matching. Speciﬁcally, it is the auctioneer’s

responsibility to design optimal allocation and pric-

ing rules (Myerson, 1981) in the market. In addi-

tion, these rules should be fair and maintain stabil-

ity in the market to create a trustworthy environment

(Krishna, 2009) for strategic participants. Besides, a

model should address and effectively handle the con-

ﬂicting objectives of the participants.

Therefore, there is a need for a fair and truth-

ful auctioneer based matching platform to address all

the above constraints. Brieﬂy, designing an optimal

matching platform, i.e., auctioneer, requires the over-

coming of three major challenges, as follows: (1)

designing an optimal pricing rule for the resource

providers, (2) designing an efﬁcient allocation rule

for the online buyers, and (3) maintaining equilib-

rium in the auction paradigm. However, in the litera-

ture, these three challenges are addressed as indepen-

dent problems. For instance, there are optimal pric-

ing rules for static markets (Samimi et al., 2016; Za-

man and Grosu, 2013) as well as gradually changing

market (Kong et al., 2015; Li et al., 2016). However,

these rules fail to adapt to the dynamically changing

and time-sensitive resource requests.

Further, to adapt to the dynamics of such mar-

kets, learning-based approaches are more appropriate.

In this context, many learning-based approaches have

been proposed (Lee et al., 2013; Prasad et al., 2016;

Kumar et al., 2019). In speciﬁc, considering such

uncertain open markets, where multiple providers at-

tempting to negotiate with multiple buyers in an on-

line setting, a learning-based dynamic pricing algo-

rithm becomes immensely needed. Because ﬁxed

Mishra, P. and Moustafa, A.

Reinforcement Learning-based Real-time Fair Online Resource Matching.

DOI: 10.5220/0010834600003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 1, pages 34-41

ISBN: 978-989-758-547-0; ISSN: 2184-433X

pricing policies (Samimi et al., 2016; Zaman and

Grosu, 2013) or statistical model-based dynamic pric-

ing policies (Kong et al., 2015; Li et al., 2016) fail

to perform in such real-time scenarios. Also, con-

sidering the real-time setting and uncertain change in

supply or demand with a rare repeating pattern, su-

pervised learning is not suitable which learns from

a pattern. Therefore, reinforcement learning (Sut-

ton et al., 1998) becomes the most appropriate choice

for real-time pricing in the considered uncertain and

time-critical open markets. In the literature, recent

advancements in reinforcement-learning (RL) (Char-

pentier et al., 2021) for computational economics pro-

motes its applicability in such real-time market sce-

narios. For instance, the remarkable performance

of actor-critic RL-techniques in real-time bid optimi-

sation for online display advertisements (Cai et al.,

2017; Yuan et al., 2013).

However, these approaches fail to introduce a

comprehensive mechanism that handles the conﬂict-

ing criteria of each buyer when selecting suitable

providers. Besides, addressing equilibrium in open

cloud markets requires promoting three main char-

acteristics, which are competitiveness (Toosi et al.,

2016), truthfulness (Myerson, 1981) and fairness

(Murillo et al., 2008). In speciﬁc, the optimal pric-

ing rules need to maintain competitiveness by dynam-

ically modelling the resource selling prices based on

supply/demand in the market. Meanwhile, the allo-

cation rule needs to observe fairness and truthfulness

and gives equal winning opportunities to all the po-

tential providers in the market. However, to the best

of our knowledge, none of the existing resource al-

locations approaches (Myerson, 1981; Samimi et al.,

2016; Cai et al., 2017) focuses on addressing these

three challenges simultaneously.

To summarise, existing resource matching ap-

proaches are either provider-oriented or buyer-

oriented, but not both. Owing to this, in this work, we

propose a fair broker based online resource matching

mechanism in a double auction paradigm. In this re-

gard, the contributions of this research are as follows:

• First, a novel reinforcement learning-based online

pricing rule, which optimises the selling price as

per the supply and demand in the market.

• Second, a multi-preference provider matching

rule (allocation rule) is proposed, to maximise the

utility of the online buyers.

The rest of this paper is organised as follows:

Section 2 models the resource matching problem as

Markov decision process. Then, Section 3 introduces

a novel online pricing algorithm. In Section 4 presents

the provider matching mechanism. Then, in Section

5, the experimental results are presented for evaluat-

ing the proposed approach. Finally, the paper con-

cludes in Section 6.

2 THE MODEL

In this section, we model the online market setting

and deﬁne all the participants. As mentioned before,

we aim to design an optimal broker to carry out a

stable resource matching process in online markets.

In this context, we consider a set P of n providers

and set B

of m buyers at time-step t. Further, each

provider i ∈ P has a set of l types of resources de-

notes as, A

i,t

≡ {a

i,t

,...,a

i,t

}, where a

i,t

repre-

sents the quantity of resource type j ∈ [1,l] available

with i ∈ P at time-step t. Similarly, the m arriving

buyers at time-step t report their proﬁle θ

, denoted

as θ

≡ (R

,bid

), where, R

≡ {r

,...,r

}

is the bundle of requested k types of resources, s

de-

notes the size of request in each time-step, w

rep-

resents the maximum waiting time and bid

denotes

the maximum budget of the buyer. Also let θ

set

of all the proﬁle reported at time step t, i.e., θ

≡

{θ

,θ

,...,θ

In the above setting, resource requests from a set

of buyers B

are matched dynamically to an appropri-

ate provider from a set of available providers. Figure

1 represents the architecture of the proposed auction-

eer. Also, α

,α

,...,α

, denotes the agents, bidding

on behalf of the providers.

From the ﬁgure, auctioneer comprises two mod-

ules: (1) online price optimisation module and (2)

provider matching module. In this regard, we call the

whole novel resource matching mechanism as real-

time resource matching (RTRM), which employs a

double-auction paradigm to match the resources in an

open market setting. Firstly, upon receiving the set of

resource requests from m online buyers at time-step t,

the RTRM platform optimises the offering price for all

the potential providers through autonomous agents.

These agents optimise the offered price for their re-

spective providers using the price optimisation mod-

ule. Finally, based on the provider matching module,

the appropriate provider is matched with buyers at an

optimal price.

Further, to adopt reinforcement-learning (Sutton

et al., 1998) in resource matching problem, we for-

mulate this matching problem as Stochastic Game,

i.e., Markov Decision Process (MDP) (Fink, 1964).

In this context, n autonomous agents bid the offered

price on behalf of n potential providers. In this re-

gard, the MDP has as a set of states S, which denotes

the possible status of all the agents. Then, there is a

Reinforcement Learning-based Real-time Fair Online Resource Matching

Figure 1: The proposed architecture auctioneer RTRM.

set of actions A, which represents the action space of

the agents. Finally, there is a set of rewards, which

the agent receives on acting. Therefore, we need to

deﬁne these three entities to formulate the resources

matching problem as MDP. Since this particular set-

ting involves multiple agents, it is called as Multi-

Agent MDP (MMDP).

In the proposed MMDP-based model, the joint

state-space S

represents the status of all the agents

bidding on behalf of providers at time-step t. In this

research, we formulate this joint state-space S

concatenating the status of all the agents for each re-

source request from buyers at time-step t. The state

of all the agents is updated every time-step t. And it

is reset at the end of the episode at t

max

. However, the

agents are not re-initialised from scratch at the end of

each episode. Also, the base prices (minimum sell-

ing price) of the resources are ﬁxed by the providers.

The base price is denoted as bp

, ∀i ∈ P,r ∈ [1,l].

Also, these values are disclosed in a seal-bid fashion,

so the other providers are not aware of these values.

Then, upon receiving the reported base price bp

, the

agents optimise the base prices based on exclusive ad-

justment multipliers for each resource request. These

adjustment multipliers are the action values that are

computed using a proposed RL-based algorithm. In

speciﬁc, at time-step t for online buyer j ∈ B

, ad-

justment multiplier act

i, j

is computed ∀i ∈ P. Finally,

these adjustment multipliers are utilised to compute

the optimal offered price denoted as op

i, j

, in Equa-

tion (1):

i, j

= bp

×(1 + act

i, j

) ×Γ

i, j

) (1)

where Γ

i, j

represents the utilisation of the provider for

next T time-step based on the current requested pro-

ﬁle θ

and all the past allocated requests, computed

using Equation (2):

i, j

∑

a∈1

η( j,θ

) ∗A

i,t

(2)

where, A

i,t

represents availability of the resources at

time-step t, whereas, η(.) is the contest success func-

tion (Skaperdas, 1996) computed using Equation (3):

η( j,θ

) =

bid

∑

b∈θ

bid

(3)

where, 0 < σ ≤ 1 represents the noise in the contest

in an auction paradigm, s.t. it captures the probability

of winning with increase in the budget values (Shen

et al., 2019). Intuitively, contest success function de-

notes the probability of a particular buyer getting al-

located based on their bid values as compared to other

buyers at time-step t. In this regard, the action space

for n agents are represented as Act ≡ {Act

,...Act

where Act

represents the action space for agent i,

where i ∈ P. Further, based on agent i’s pricing pol-

icy π

: s

7→ Act

, the action value act

i, j

∈ Act

is de-

termined. Then, after executing the set of chosen ac-

tions for all the agents (i.e allocating the requested re-

sources), the proposed MMDP model transfers to the

next state S

t+1

. This state change occurs based on the

transition function τ ∗S

×Act

×···×Act

7→ Ω(S

S),

where Ω(S

S) represents the set of probability distribu-

tions.

Toward this end, on behalf of providers, agents

are competing amongst each other to maximise their

revenue. In this context, the revenue of a certain

agent is maximised by winning the highest possible

number of auctions with higher values while simul-

taneously minimising its loss

and non-participation

Not able to participate because of unavailability of re-

sources

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

rates. Therefore, the reward function represents the

social welfare of the providers at the end of each al-

location. In this context, the episodic reward Rwd is

computed based on the allocation rule and then the

individual rewards rwd

are computed based on their

corresponding action values. In speciﬁc, at the end

of each episode, all the bidding agents receive re-

wards based on their chosen actions, such that; rwd

×Act

×···×Act

7→ Rwd. Furthermore, to reduce

the complexity by not updating the reward values af-

ter each auction; on the contrary, these reward values

are updated only at the end of each episode.

3 ONLINE PRICE

OPTIMISATION

The online price optimisation is performed by the

RTRM auctioneer on behalf of the potential providers.

In this regard, agents collect base-price from the po-

tential providers. Then on behalf of the providers,

communicate with the online pricing optimisation

module. Further, using the online pricing optimi-

sation module, RTRM optimises the base prices of

all the potential providers, aiming to maximise their

revenue. In this way, agents in RTRM platform dy-

namically updates the offered price of the requested

resources considering the limited volume of avail-

able resources. In this setting, the agent’s job is

to optimise the resource prices within the allocation

deadline. Therefore, the proposed online price opti-

misation algorithm adopts a reinforcement learning

scheme (Konda and Tsitsiklis, 2000) to handle the

presumed real-time scenario. The primary objective

of the proposed real-time pricing algorithm is to opti-

mise the offered base price.

In this context, the initial joint state S

of all the

agents represents the initial available resources. Each

agent aims to select a certain adjustment multiplier

(action) and leverage it to maximise its total expected

future revenues. These future revenues are discounted

by the factor γ each time step. In this regard, the future

reward at time-step t for agent i is denoted as Rwd

∑

max

t=0

rwd

, where t

max

is the time-step at which the

bidding process ends, i.e., episode length. Then, the Q

function (Sutton et al., 1998) for agent i is computed

using Equation (4):

(S,a

t) = E

π,τ

[

max

∑

t=0

rwd

= S,a

t] (4)

where, π = {π

,...,π

} is the set of joint-policies

of all the agents, and a

t = {act

(i, j)

,...,act

(i, j)

} de-

notes the list of bid multipliers (actions) for all the

agents. Further, the next state S

and the next action

are computed using Bellman equation as shown

in Equation (5):

(S,a

t) = E

rwd,s

[rwd(S,act)+ γE

∼π

)]

(5)

On the other hand, the mapping function µ

maps

the shared state S ≡[A,θ

] of each agent i to a selected

action act

(i, j)

, based on Equation (6). This mapping

function µ represents the actor in the adopted actor-

critic architecture (Konda and Tsitsiklis, 2000). Fur-

ther, we derive Equation (7) from Equations (5) and

(6) as follows:

act

(i, j)

= µ(S) = µ

([A,θ

]) (6)

(S,act

(1, j)

,...,act

(n, j)

) = E

rwd,S

[rwd(S,act

(1, j)

,...,

act

(n, j)

) + γQ

,µ

,...,µ

))]

(7)

As shown in Equation (7), µ = {µ

,...,µ

} is the

set of the joint deterministic policies for all the agents.

In this regard, the goal of the proposed algorithm be-

comes to learn an optimal policy for each agent to

attain Nash equilibrium (Hu et al., 1998). In addition,

in such stochastic environments, each agent learns to

behave optimally by learning an optimal policy µ

which also depends on the optimal policies of the

other co-existing providers. Further, in the proposed

algorithm, the equilibrium of provider is achieved by

gradually reducing the Loss function Loss(ϑ

) of the

critic Q

with the parameter ϑ

as denoted in Equa-

tions (8) and (9). In speciﬁc, in Equations (8) and (9),

= {µ

),...,µ

)} represents the set of learning

policies of the target actors; each of these actors has a

delayed parameter ϑ

Loss(ϑ

) = (y −γQ

(S,act

(1, j)

,...,act

(n, j)

))

(8)

y = rwd

+ γQ

,µ

),...,µ

)) (9)

In this context, Q

represents the learning pol-

icy of the target critic, which also has a correspond-

ing set of delayed parameters ϑ

for each actor, and

(S,act

(1, j)

,...,act

(n, j)

,rwd

) represents a transition

tuple that is pushed into a replay memory ∆. Further,

Reinforcement Learning-based Real-time Fair Online Resource Matching

each agent’s policy µ

, with parameters ϑ

, is learned

based on Equation (10).

∇

≈

∑

∇

([A, j])∇

act

(i, j)

(S,act

(1, j)

,...,act

(n, j)

). (10)

Algorithm 1 depicts the novel online price optimi-

sation algorithm. The proposed algorithm takes three

inputs, which are (i) the buyers’ proﬁle θ

, (ii) the

concatenated resource state for agents A

, ∀i ∈ D, and

(iii) the cumulative rewards from all the providers.

Then, the proposed algorithm provides adjustment

multipliers for each pair of agent i and buyer j ∈ θ

as output.

Algorithm 1: Online Price Optimisation.

1: Input: A, θ

2: Output: a

t  bid multipliers

3: Initialise: Q

(S,act

(1,b

)

,...,act

(n,b

)

|ϑ

), replay

memory ∆

4: Initialise: actor µ

, target actor µ

5: Initialise: target network Q

with θ

← θ

6: ϑ

← ϑ

∀i ∈ n.

7: for episode = 1 to E do

8: Initialise: S

9: for t = 1 to T

max

10: for arriving θ

within T

max

, j ∈ θ

11: Get act

(i,b

)

using Equation (6)

12: Calculate Γ using Equation (2)

13: Get rwd

, where i ∈ n and A

: ∀i ∈ D

14: end for

15: rwd

∑

i=1

∑

t=1

rwd

rewards within

max

16: Push (S,act

(i,b

)

,rwd

) into ∆  S’

denotes the next state

17: S

← S

18: for agent i = 1 to n do

19: Sample mini batch (S, act

(1,b

)

, . . . ,

act

(1,b

)

, rwd

, S

) from ∆

20: Update critic using Equation (8)

21: Update actor using Equation (10)

22: Update target network: θ

←τθ+(1−

τ)θ

23: end for

24: end for

25: end for

In this regard, these trained agents submit the of-

fered price for every resource requests on behalf of

their corresponding provider. Then, based on the ob-

tained optimal offered bids, provider matching ap-

proach matches the requested resources to an appro-

priate provider, as discussed in the following section.

4 PROVIDER MATCHING

In this subsection, we present the proposed fairness

based provider matching mechanism. In this mecha-

nism, the auctioneer matches each buyer’s request to

a suitable provider from a pool of available providers.

In this context, ﬁrstly upon receiving buyers proﬁle

at time-step t, based on the bid bid

∀b ∈ θ

the appropriate preference of the buyer is estimated

based on a novel multi preference factor. Speciﬁcally,

the multi-preference factor considers different quality

preferences for the buyers while matching their re-

quests with an appropriate provider. In this context,

the set of q quality preference parameters of provider

p ∈ P is denoted as K

≡ (k

,...k

), for example:

p,utilisation

represents the value of the quality prefer-

ence parameter utilisation. Further, based on these

values, we compute preference factor υ(p, b) ∀p ∈ P

and ∀b ∈ θ

using Equation (11).

υ(p,b) =

|K|

∑

q∈K

N(k

)

√

bid

(11)

where, |K| denotes the total number of quality prefer-

ence parameters and N(k

) is the normalised value of

the quality parameter q, which is computed using sim-

ple additive weighting mechanism (Zeng et al., 2003)

as follows:

N(p,q) =







re f

−k

)×ψ

max

−k

min

, if k

max

−k

min

6= 0;

1, if k

max

−k

min

= 0.

(12)

wherein, ψ = 1, if the higher value of the quality pa-

rameter is favourable for matching or else ψ = −1.

For example, ψ = 1 for utilisation of the provider,

whereas ψ = −1 for offered selling price. In this re-

gard, we compute the preference factor for each buyer

at time-step t for all the available providers. Then,

computes the preference factor for each pair of buy-

ers and providers. Finally, the resource request from

buyer b is matched with a provider if the highest pref-

erence factor is within the buyer’s budget. In speciﬁc,

the winning provider w

for buyer b ∈ θ

is computed

using Equation (13).

≡ argmax

p∈P

υ(p,b) ∗1(p,b) (13)

where, 1(.) is the indicator function, s.t., 1(p, b) = 1

if b’s budget is within the offered price of the winning

provider, i.e.,

∑

i∈[1,l]

∗p

p,b

≤bid

. Finally, in order

to observe truthfulness (Myerson, 1981), we adopt the

general second price (Krishna, 2009). Therefore, in

this context, the payment is computed using Equation

(14):

pay(w

,b) ≡

∑

i∈[1,l]

∗op

(14)

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

Table 1: Types of providers.

# Type Processing(MIPS) Memory (MBs) Storage(MBs) Bandwidth

6950 12032 26000 8000

3450 6144 84000 2650

4700 7168 48000 1750

7500 3840 30000 4000

6100 10752 60000 4700

4900 8704 47000 3600

3700 6144 36000 3200

where p

≡ argmax

p∈P\w

υ(p,b) ∗1(p,b). In this re-

gard, all the online buyers’ requests in θ

are matched

with the appropriate available providers. In the next

section, we present the results of the extensive exper-

iments that were conducted to evaluate the proposed

online resource matching approach for open markets.

5 EXPERIMENTAL SETUP AND

RESULTS

This section presents the experimental results and

evaluations performed using Google cluster trace data

(Chen et al., 2010; Buchbinder et al., 2007) to inves-

tigate the performance of the novel online resource

matching approach. In the experiments setting, we

set the hyper-parameters as: (discount factor) λ = 0.9;

learning-rate= 3e

−4

; σ = 0.5 as these values gave

comparatively better results. Finally, we compare

the novel RTRM algorithm with the following bench-

marks:

• Simple Allocation: the combinatorial double

auction resource allocation approach (CDARA)

(Samimi et al., 2016). This approach implements

a ﬁxed pricing strategy, wherein the provider with

the lowest bid is the winner.

• Dynamic Allocation: the combinatorial auction-

based approach based on (ICAA) (Kong et al.,

2015). This approach implements a demand-

based dynamic pricing strategy. Also, the

provider with the lowest bid is the winner.

5.1 Experimental Setup

In this experimental setting, we target building an on-

line resource allocation among the pool of seven in-

dependent providers. Conﬁgurations of all the seven

providers are listed in Table 1. In this context, we

sample the resource requests for all the online buy-

ers as extracted from the task events tables of Google

Cluster Trace (Wilkes, 2011). Further, the buyers’

bid values corresponding to their resource requests

are sampled from the uniform distribution [150,500].

Similarly, we preset the base prices of each of the

resources by uniformly sampling from a predeﬁned

uniform distribution. In speciﬁc, we choose the base

prices for per unit of Processing, Memory Usage

and Storage Usage from [120, 150]$, [10,70]$, and

[40,100]$, respectively. Further, the time step t at

which buyer’s job is sampled from the dataset is

buyer’s arrival-time, whereas its execution-length and

deadline are set using two random generators which

take values [1,24] time-steps and [1,12] time-steps,

respectively. Towards this end, with such an exper-

imental setting, we initially train the RTRM approach

for 12000 episodes, each of length t

max

= 2000 time-

steps. All the approaches are implemented in Python

3 and the experiments are performed on Intel Xeon

3.6GHz 6 core processor with 32 GB RAM. Also, we

consider three quality preference parameters, namely

price, resource utilisation and average waiting time.

We also set the maximum number of online buyers at

each time step, i.e., m = 24.

There are two primary objectives of this research:

(1) to evaluate the performance of the providers and

buyers in the market, and (2) to evaluate the bidder

drop in the market. In this context, the performance of

the providers and buyers, measured in terms of their

utilities. Computed using Equation (15) and Equation

(16), as follows:

U(p) =

∑

∀b∈B

pay(p,b) −

∑

∀r

(15)

U(b) = bid

−pay(.,b) (16)

Further, in order to evaluate the bidder (buyer)

drop in the market, similar to (Hassanzadeh et al.,

2016), we also compute buyer’s drop based on the

number of times a particular buyer looses in each

episode. Speciﬁcally, we deﬁne that, a particular

buyer drops from the market, if it looses ﬁve times

in consecutive auction episodes.

Reinforcement Learning-based Real-time Fair Online Resource Matching

5.2 Evaluation

In this section, we would evaluate and discuss the

performance of the novel RTRM mechanism based

on different characteristics. To begin with, Figure

2, depicts the average episodic utility of the seven

providers for 12000 episodes. From Figure 2, it can

be established that the novel RTRM maximises the

utility of all the providers in the open market. One

of the interesting observations here is that certain

providers with negligible utilities in the benchmarks

have higher utilities in the novel RTRM algorithm.

For instance, provider d5 and provider d6, have the

least utility in CDARA mechanism, however it almost

tripled in RTRM. This shows that introduced fairness

mechanisms help the lowest-performing providers to

enhance their utilities. Overall, for all the providers,

the utilities increased in RTRM as compared to the

other two benchmarks.

PROVIDERS

Figure 2: The utilities of the providers.

Further, Figure 3 depicts the average episodic util-

ity of the 24 buyers who arrived at each time step.

From Figure 3, it is evident the buyers in the proposed

RTRM approach pay marginally less prices, i.e., their

utility is higher as compared to other benchmarks.

Speciﬁcally, the utilities of the buyers in RTRM mech-

anism is increased by at least 40% as compared to the

other two mechanisms.

Figure 3: The utilities of the buyers.

Furthermore, to exclusively evaluate the impact

of fairness mechanism on winning of the providers

in the market in RTRM mechanism, we evaluate the

RTRM mechanism with and without fairness mecha-

nism. Figure 4 depicts the episodic average cumula-

tive sum of number of allocations in 2000 episodes

each of length (rounds) t

max

= 100. From the ﬁg-

ure, it is evident unlike RTRM, providers in RTRM-

Fairness wins at least once in the episode. In speciﬁc,

provider D4 has zero wins in RTRM. This shows that

the fairness mechanism ensures the fairness amount

the provider’s winning. Some providers have compar-

atively fewer wins in RTRM-Fairness. However, this

ensures social welfare in the market by giving chance

to non-performing lesser competitive providers.

Figure 4: Impact of fairness on providers win.

Towards the end, we evaluate the bidder drop in

the market, by comparing the total number of buyers

dropped in RTRM and RTRM-Fairness. Figure 5 il-

lustrates the average cumulative sum of the number

of buyers drops.

Figure 5: The utilities of the buyers.

Overall, from Figure 5, we can conclude that num-

ber of drops gradually increases with the number of

episodes. However, drop-in RTRM is exponential

more as compared to RTRM-Fairness. This means the

fairness mechanism improves resource utilisation in

the market. Brieﬂy, from the above results and discus-

sions, the proposed approach outperforms the bench-

marks. Such that, the proposed algorithm maximises

the utility of the participants. Also, minimises the bid-

der drop problem in the market based on a novel fair-

ness mechanism.

6 CONCLUSION

In this paper, we introduce a real-time resource

matching mechanism for online open markets. In

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

such an open market setting, the proposed RTRM

matches the resource requests from the buyers to

providers using a double-auction paradigm. Specif-

ically, RTRM implements a multi-agent environment

that optimises the offered selling prices for all the in-

dependent providers based on the online pricing algo-

rithm. On the other hand, RTRM implements a fair

matching algorithm to dynamically match the buy-

ers’ resource demands in the open market. In this re-

gard, the proposed approach enables both participants

to maximise their utilities and the participation rate.

Besides, the proposed mechanism enhances resource

utilisation to minimise the bidder drop problem based

on the novel fairness mechanism. The experimental

results evaluate the efﬁciency of the RTRM by com-

paring the utilities of both the participants. In the fu-

ture, we aim to develop a mechanism that encourages

cooperative behaviour in such a competitive market

by designing a resource sharing mechanism among

the different providers to fulﬁl resource requests.

REFERENCES

Buchbinder, N., Jain, K., and Naor, J. S. (2007). Online

primal-dual algorithms for maximizing ad-auctions

revenue. In European Symposium on Algorithms,

pages 253–264. Springer.

Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y.,

and Guo, D. (2017). Real-time bidding by reinforce-

ment learning in display advertising. In Proceedings

of the Tenth ACM International Conference on Web

Search and Data Mining, pages 661–670. ACM.

Charpentier, A., Elie, R., and Remlinger, C. (2021). Rein-

forcement learning in economics and ﬁnance. Com-

putational Economics, pages 1–38.

Chen, Y., Ganapathi, A. S., Grifﬁth, R., and Katz, R. H.

(2010). Analysis and lessons from a publicly available

google cluster trace. EECS Department, University of

California, Berkeley, Tech. Rep. UCB/EECS-2010-95,

94.

Fink, A. M. (1964). Equilibrium in a stochastic n-person

game. Journal of science of the hiroshima university,

series ai (mathematics), 28(1):89–93.

Hassanzadeh, R., Movaghar, A., and Hassanzadeh, H. R.

(2016). A multi-dimensional fairness combinatorial

double-sided auction model in cloud environment. In

2016 8th International Symposium on Telecommuni-

cations (IST), pages 672–677. IEEE.

Hu, J., Wellman, M. P., et al. (1998). Multiagent rein-

forcement learning: theoretical framework and an al-

gorithm. In ICML, volume 98, pages 242–250. Cite-

seer.

Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algo-

rithms. In Advances in neural information processing

systems, pages 1008–1014.

Kong, Y., Zhang, M., and Ye, D. (2015). An auction-based

approach for group task allocation in an open network

environment. The Computer Journal, 59(3):403–422.

Krishna, V. (2009). Auction theory. Academic press.

Kumar, D., Baranwal, G., Raza, Z., and Vidyarthi, D. P.

(2019). Fair mechanisms for combinatorial reverse

auction-based cloud market. In Information and Com-

munication Technology for Intelligent Systems, pages

267–277. Springer.

Lee, C., Wang, P., and Niyato, D. (2013). A real-time group

auction system for efﬁcient allocation of cloud inter-

net applications. IEEE Transactions on Services Com-

puting, 8(2):251–268.

Li, X., Ding, R., Liu, X., Liu, X., Zhu, E., and Zhong, Y.

(2016). A dynamic pricing reverse auction-based re-

source allocation mechanism in cloud workﬂow sys-

tems. Scientiﬁc Programming, 2016:17.

Murillo, J., Mu

noz, V., L

opez, B., and Busquets, D. (2008).

A fair mechanism for recurrent multi-unit auctions. In

German Conference on Multiagent System Technolo-

gies, pages 147–158. Springer.

Myerson, R. B. (1981). Optimal auction design. Mathemat-

ics of operations research, 6(1):58–73.

Prasad, G. V., Prasad, A. S., and Rao, S. (2016). A combi-

natorial auction mechanism for multiple resource pro-

curement in cloud computing. IEEE Transactions on

Cloud Computing, 6(4):904–914.

Samimi, P., Teimouri, Y., and Mukhtar, M. (2016). A com-

binatorial double auction resource allocation model

in cloud computing. Information Sciences, 357:201–

216.

Shen, W., Feng, Y., and Lopes, C. V. (2019). Multi-winner

contests for strategic diffusion in social networks. In

Proceedings of the AAAI Conference on Artiﬁcial In-

telligence, volume 33, pages 6154–6162.

Skaperdas, S. (1996). Contest success functions. Economic

theory, 7(2):283–290.

Sutton, R. S., Barto, A. G., et al. (1998). Introduction to

reinforcement learning, volume 2. MIT press Cam-

bridge.

Toosi, A. N., Vanmechelen, K., Khodadadi, F., and Buyya,

R. (2016). An auction mechanism for cloud spot mar-

kets. ACM Transactions on Autonomous and Adaptive

Systems (TAAS), 11(1):2.

Wilkes, J. (2011). More Google cluster data. Google re-

search blog. Posted at http://googleresearch.blogspot.

com/2011/11/more-google-cluster-data.html.

Yuan, S., Wang, J., and Zhao, X. (2013). Real-time bidding

for online advertising: measurement and analysis. In

Proceedings of the Seventh International Workshop on

Data Mining for Online Advertising, page 3. ACM.

Zaman, S. and Grosu, D. (2013). Combinatorial auction-

based allocation of virtual machine instances in

clouds. Journal of Parallel and Distributed Comput-

ing, 73(4):495–508.

Zeng, L., Benatallah, B., Dumas, M., Kalagnanam, J., and

Sheng, Q. Z. (2003). Quality driven web services

composition. In Proceedings of the 12th interna-

tional conference on World Wide Web, pages 411–421.

ACM.

Reinforcement Learning-based Real-time Fair Online Resource Matching