Scoring-based DOM Content Selection with Discrete Periodicity Analysis

Thomas Osterland, Thomas Osterland, Thomas Rose, Thomas Rose

2022

Abstract

The comprehensive analysis of large data volumes forms the shape of the future. It enables decision-making based on empiric evidence instead of expert experience and its utilization for the training of machine learning models enables new use cases in image recognition, speech analysis or regression and classification. One problem with data is, that it is often not readily available in aggregated form. Instead, it is necessary to search the web for information and elaborately mine websites for specific data. This is known as web scraping. In this paper we present an interactive, scoring based approach for the scraping of specific information from websites. We propose a scoring function, that enables the adaption of threshold values to select specific sets of data. We combine the scoring of paths in a web pages DOM with periodicity analysis to enable the selection of complex patterns in structured data. This allows non-expert users to train content selection models and to label classification data for supervised learning.

Download


Paper Citation


in Harvard Style

Osterland T. and Rose T. (2022). Scoring-based DOM Content Selection with Discrete Periodicity Analysis. In Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-569-2, pages 280-289. DOI: 10.5220/0011116300003179


in Bibtex Style

@conference{iceis22,
author={Thomas Osterland and Thomas Rose},
title={Scoring-based DOM Content Selection with Discrete Periodicity Analysis},
booktitle={Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2022},
pages={280-289},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011116300003179},
isbn={978-989-758-569-2},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Scoring-based DOM Content Selection with Discrete Periodicity Analysis
SN - 978-989-758-569-2
AU - Osterland T.
AU - Rose T.
PY - 2022
SP - 280
EP - 289
DO - 10.5220/0011116300003179