vide an approach to implement information extraction
for arbitrary documents or data formats, e.g., tables.
Additionally, ARTIFACT addresses more com-
plex IE problems because it includes tasks like con-
verting non-machine-readable into machine-readable
documents, e.g., PDF to text.
Due to the application in a real-world project, we
have shown that our pattern supports companies to
automate their information extraction process succes-
sively and gains business value.
In the course of future development, we would like
to add a classification mechanism to the information
extraction processes. We assume that there are sev-
eral document classes with different characteristics.
Possible pipelines could perform differently to single
document classes.
Additionally, we would like to add caching mech-
anisms to the different pipeline runs. As shown in
Section 5, some pipeline parts are recurring when pro-
cessing a specific document. Due to performance rea-
sons, the system could cache intermediate results of
specific steps.
Beyond that, we would like to optimize the choice
of possible pipelines if the results were nearly equal.
The system should be able to take other metrics like
the expected pipeline performance into account when
