Data Partitioning Strategies for Simulating non-IID Data Distributions in the DDM-PS-Eval Evaluation Platform

Mikołaj Markiewicz, Jakub Koperwas

2022

Abstract

Nowadays, the size of the various datasets collected worldwide is growing rapidly. These data are stored in different data centres or directly on IoT devices, and are thus located in different places. Data stored in different locations may be uniformly distributed and convergent in terms of the information carried, and are then known as independent and identically distributed (IID) data. In the real world, data collected in different geographic regions tend to differ slightly or have completely different characteristics, and are then known as non-IID data. Increasing numbers of new algorithms have been implemented to work with such distributed data without the need to download all the data to one place. However, there is no standardised way of validating these, and such algorithms are typically tested on IID data, which are uniformly distributed. The issue of non-IID data is still an open problem for many algorithms, although the main categories of ”non-IID-ness” have been defined. The purpose of this paper is to introduce new data partitioning strategies and to demonstrate the impact of non-IID data on the quality results of distributed processing. We propose multiple strategies for dividing a single dataset into multiple partitions to simulate each of the major non-IID data category problems faced by distributed algorithms. The proposed methods of data partitioning integrated with the DDM-PS-Eval platform will enable the validation of future algorithms on datasets with different data distributions. A brief evaluation of the proposed methods is presented using several distributed clustering and classification algorithms.

Download


Paper Citation


in Harvard Style

Markiewicz M. and Koperwas J. (2022). Data Partitioning Strategies for Simulating non-IID Data Distributions in the DDM-PS-Eval Evaluation Platform. In Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT, ISBN 978-989-758-588-3, pages 307-318. DOI: 10.5220/0011290200003266


in Bibtex Style

@conference{icsoft22,
author={Mikołaj Markiewicz and Jakub Koperwas},
title={Data Partitioning Strategies for Simulating non-IID Data Distributions in the DDM-PS-Eval Evaluation Platform},
booktitle={Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT,},
year={2022},
pages={307-318},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011290200003266},
isbn={978-989-758-588-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT,
TI - Data Partitioning Strategies for Simulating non-IID Data Distributions in the DDM-PS-Eval Evaluation Platform
SN - 978-989-758-588-3
AU - Markiewicz M.
AU - Koperwas J.
PY - 2022
SP - 307
EP - 318
DO - 10.5220/0011290200003266