• DJ aka. dj-on-github/sp800 22 tests (Johnston,
2017) (114 stars, 29 forks): The earliest im-
plementation in the list was created by David
Johnston, Engineer at Intel and author of ”Ran-
dom Number Generators, Principles and Prac-
tices” (Johnston, 2018). This implementation is
”simplistic” in terms of coding concepts.
• LP aka. InsaneMonster/NistRng (Pasqualini,
2019) (18 stars, 8 forks): This work by Luca
Pasqualini (SAILab) from the University of Siena
is inspired by the work of David Johnston. The
implementation is the ”most advanced” in terms
of coding concepts and is available as pip pack-
age thus, it is the most convenient to use and was
the starting point for this work.
• SA aka. stevenang/randomness testsuite (Ang,
2018) (50 stars, 12 forks): The Independent in
Comparison, implemented by Steven Kho Ang.
This implementation is characterised by the pro-
vision of an additional graphical user interface.
4.2 Datasets
This work is not focused on evaluating a single source
of data of a given length. Instead, results are acquired
by feeding data to each implementation and validating
the number of passed tests in relation to input length,
which results in a two-dimensional representation per
implementation with input length on the x-axis, and
the sum of S-values or P-values on the y-axis. But
there is also a z-axis, by means of different datasets,
which are staggered by the amount of expected ”struc-
ture”. Employed datasets are large blobs with more
than 100 MB of data of a common ”language” from
which at random a sample is extracted for a specific
test run. The datasets are chosen to cover all interest-
ing cases, not to focus on any specific scenario; there-
fore, the datasets are introduced without greater gen-
eration details. Depending on the generation process,
employed datasets are categorized into classes: ran-
dom, cipher, or encoding. Finally, the datasets are or-
dered as listed from expected ”randomness” to ”struc-
ture”. Many different types of data may be, and have
been, compared using our method. This list reflects
only a small selection of data sets with noteworthy at-
tributes spanning over the whole spectrum:
1. RND: ”True” randomness acquired via
RANDOM.ORG (Haahr, 2018) gained from
atmospheric noise, as the epitome of randomness.
2. DES: Weak block cipher in ECB mode using a
weak key to add structural distortion compared to
other cipher-modes, generated with OpenSSL.
3. ZIP: File archive of the aclImdb (Maas et al.,
2011) dataset to evaluate the amount of chaos in-
troduced by compression.
4. TXT: English text consisting of printable-ASCII,
based on a large movie review dataset (Maas et al.,
2011), which is further used as basis for genera-
tion of the DES, and ZIP datasets.
5. NUL: Straight binary sequence of all zeros (Null-
Bytes) which together with the all-ones sequence
can be argued as the epitome of structure.
4.3 Framework
A custom framework has been developed to overcome
considerable differences in how each implementation
handles inputs and provides output. These differences
make it cumbersome to operate and to compare the re-
sults manually. For this reason, a common interface
was designed for all implementations, and wrapper
functions were developed to enable large-scale auto-
matic tests to be carried out. This interface was then
embedded into custom applications based on Jupyter
notebooks, providing the easy-to-use ability to bench-
mark, plot, and analyze implementation characteris-
tics like quality and time. The open-source Python li-
braries pandas (McKinney et al., 2010; Pandas Devel-
opment Team, 2020) and matplotlib (Hunter, 2007)
were of priceless help here.
All randomness tests return one straightforward
metric, namely the probability, or P-value in short,
in the range of zero to one, which summarizes the
strength of the evidence against the null hypothesis.
Tests are then run against a decision rule based on a
significance level of 10% for each test in the SP 800-
22 test suite, which means that any input for which
a test returns a P-value above the decision mark is
considered to be random by passing the test, with
a P-value of 1 stating perfect randomness. The re-
sult of this comparison is hereafter defined as S-Value
(Success-Value). Some tests may have lower signif-
icance levels or additional checks for a positive S-
Value. This work mainly focuses on S-Values as
Key Performance Indicators (KPI) to ease compar-
isons over the whole spectrum, rather than gathering
enough P-Values for generic P-Graphs. P-Values are
suitable for evaluating data sources by generating vast
amounts of data while being too volatile to evaluate
isolated sequences, as can be seen in Figure 2.
The test spectrum covers different scenarios in
which the test suite may be applied, including the
number of tests, the quality of the results, and the cal-
culation speed in relation to different data lengths to
be tested. One would think the number of tests is al-
ways the same when using only the same test suite,
Putting Chaos into Perspective: Evaluation of Statistical Test Suite Implementations on Isolated Sequences of Arbitrary Length
261