Overall, 597 CVEs remain where the scoring sys-
tem must be applied. For 179 CVEs, the true patch
commits were not included in the retrieved com-
mits. The main reason was that the version range
was wrongly specified in the CVE. These CVEs are
skipped because the preceding steps are obligatory to
evaluate the scoring system. The evaluation of the
scoring system is conducted on commit level. In to-
tal, 195,507 commits are included and out of these,
549 patches must be identified showing that finding
patch commits is a highly imbalanced problem.
For the scoring system, appropriate scores for the
five heuristics must be determined. Except for the
CWE Classifier heuristic, each heuristic score is rep-
resented by its F-measure value because it reflects the
heuristic’s precision and reliability. By creating three
different scoring systems using the F0.25-, F0.5-, or
F1-Score, the users can decide how much they value
precision and recall.
For the Description Fit heuristic, a commit re-
ceives the points if the fit is above or equal to a certain
threshold. We evaluated different absolute and rela-
tive thresholds and found that the absolute threshold
0.2 with a precision of 86.29% and recall of 12.54%
should be used if precision is highly important. Other-
wise, both the F0.5- and F1-score are best at threshold
0.1 with a precision of 58.41% and recall of 32.35%.
This shows that the heuristic works well in a few cases
but is limited in its applicability because commit mes-
sages are short and often use abbreviations for terms
occurring in the CVE description.
For the heuristics git-vuln-finder, URLs, and Files,
no thresholds exist. The heuristics either succeed for a
commit or not. Thus precision, recall, and F-measure
are calculated from the results of the entire data set.
The performance values are listed in Table 2.
Table 2: Performance of the Heuristics URLs, Files and git-
vuln-finder.
Num. Precision Recall
CVEs in % in %
URLs 7 100.00 100.00
Files 50 49.68 93.75
git-vuln-finder 295 22.70 41.67
The last heuristic is the CWE classifier which
was evaluated in further experiments. Three common
classification algorithms suitable for labeled data with
a small data set were evaluated using 10-fold cross
validation. The Gradient Boosting classifier outper-
formed the Random Forest and the Support Vector
Machine (SVM). The average precision of the cross
validation is 0.85 and the recall is 0.73. Further-
more, undersampling techniques applied to the major-
ity class were evaluated because the class CWE-79 in-
cluded approx. 700 CVEs while the other classes in-
cluded between 150 to 300 CVEs. The Tomek Links
method performed best and increased the precision by
2% and the recall by 1%. Thus, the final classifier
uses Gradient Boosting with Tomek Links undersam-
pling for CWE-79.
Since the CWE classifier only categorizes vulner-
ability types, it is used for discarding commits unre-
lated to the vulnerability type. As a metric, only per-
formance values from the CWE classifier evaluation
can be used because there are too few CVEs in the
test data set with corresponding CWEs to reevaluate
the classifier. We use recall as score because it reflects
how reliable the classifier discards negative commits.
The following recall values apply: 0.97 (CWE-79),
0.90 (CWE-119), 0.61 (CWE-22), 0.46 (CWE-89).
The CVE classifier can be applied to 22% of CVEs
included in the commit scoring system evaluation.
After creating the scores for all heuristics, the final
threshold, below which commits are discarded, needs
to be determined by applying the heuristics to all com-
mits. For each CVE, precision and recall are calcu-
lated so that all CVEs are equally weighted. Finally,
the F-measures are calculated to select the best thresh-
olds. To evaluate how well this approach performs
on unseen data, 10-fold cross validation is conducted.
The heuristic scores and the thresholds are selected
based on the training set and the performance is eval-
uated on the test set. The F.025 strategy shows a pre-
cision of 80.17% and a recall of 22.70% whereas the
F1 strategy has a precision of 47.55% and 47.41%.
The metrics of F0.5 lie between these values. Con-
sidering that the data set is highly imbalanced with
549 patches and 195,507 non-patches, a precision of
80% is quite good. Moreover, it is reasonable to of-
fer different precision levels as there are remarkable
differences in the metrics.
The approach was applied to the entire data set to
create the final scoring system. Figure 1 shows the
precision-recall curves of the three precision levels.
For the highest precision level with the F0.25-score as
well as the F0.5-score precision level, the final thresh-
old is 0.3 and for the F1-score 0.2. The plot illustrates
that a threshold above zero results in a recall decrease
of 40%, meaning that there are many patch commits
where no heuristic succeeds.
In summary, the results show that it is not trivial to
identify the patch commits. The biggest challenge is
that patches are not always described appropriately in
the commit message and might even be included in
a bigger merge commit. Additionally, it is hard to
identify the correct patch if multiple vulnerabilities of
the same type exist in the repository.
ICSOFT 2022 - 17th International Conference on Software Technologies
646