ers. The similarity was relatively low for the three
detection layers in the head.
Comparing CKA similarity values for layers vs
layers showed a block-like structure resembling the
different parts of the YOLOv3 architecture.
Model F-Synthetic with frozen backbone and
model U-Synthetic with unfrozen backbone that were
further trained on synthetic data had comparable mAP
with each other, on both BDD and GTAV datasets. No
particular difference could be seen in the CKA analy-
sis between frozen and unfrozen backbone.
No difference was found for the U-Synthetic un-
frozen model or the F-Synthetic frozen model in
terms of average similarity with the unfrozen model
U-Real. Thus, there was no overall impact of frozen
or unfrozen according to CKA similarity.
The largest difference between model U-Synthetic
and model F-Synthetic according to CKA was in the
head part. Hence models U-Synthetic and F-Synthetic
were more similar to each other in the backbone part
than in the head part, even though their backboneshad
different training settings.
With this similarity analysis, we want to give in-
sights on how training synthetic data affects each
layer and to give a better understanding of the inner
workings of complex neural networks. A better un-
derstanding is a step towards using synthetic data in
an effective way and towards explainable and trust-
worthy models.
REFERENCES
Alain, G. and Bengio, Y. (2016). Understanding intermedi-
ate layers using linear classifier probes. In ICLR 2017
workshop.
Cabon, Y., Murray, N., and Humenberger, M. (2020). Vir-
tual KITTI 2. CoRR, abs/2001.10773.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,
M., Benenson, R., Franke, U., Roth, S., and Schiele,
B. (2016). The cityscapes dataset for semantic urban
scene understanding. In Proc. of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
Dosovitskiy, A., Ros, G., Codevilla, F., L´opez, A., and
Koltun, V. (2017). CARLA: An open urban driving
simulator. In Proceedings of the 1st Annual Confer-
ence on Robot Learning, pages 1–16.
Fong, R. C. and Vedaldi, A. (2017). Interpretable explana-
tions of black boxes by meaningful perturbation. In
The IEEE International Conference on Computer Vi-
sion (ICCV).
Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. (2016). Vir-
tual worlds as proxy for multi-object tracking analy-
sis. In Proceedings of the IEEE conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
4340–4349.
Ge, Y., Xiao, Y., Xu, Z., Zheng, M., Karanam, S., Chen,
T., Itti, L., and Wu, Z. (2021). A peek into the reason-
ing of neural networks: Interpreting with structural vi-
sual concepts. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 2195–2204.
Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013). Vi-
sion meets robotics: The KITTI dataset. International
Journal of Robotics Research (IJRR).
Golub, G. H. and Reinsch, C. (1971). Singular Value De-
composition and Least Squares Solutions, pages 134–
151. Springer Berlin Heidelberg, Berlin, Heidelberg.
Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. (2004).
Canonical correlation analysis: An overview with ap-
plication to learning methods. Neural Computation,
16(12):2639–2664.
Hattori, H., Lee, N., Boddeti, V. N., Beainy, F., Kitani,
K. M., and Kanade, T. (2018). Synthesizing a scene-
specific pedestrian detector and pose estimator for
static video surveillance. International Journal of
Computer Vision, 126:1027–1044.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Hermann, K. and Lampinen, A. (2020). What shapes fea-
ture representations? Exploring datasets, architec-
tures, and training. In Advances in Neural Information
Processing Systems, volume 33, pages 9995–10006.
Curran Associates, Inc.
Hinterstoisser, S., Lepetit, V., Wohlhart, P., and Konolige,
K. (2018). On pre-trained image features and syn-
thetic images for deep learning. In Proceedings of the
European Conference on Computer Vision (ECCV)
Workshops.
Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S. N.,
Rosaen, K., and Vasudevan, R. (2017). Driving in the
matrix: Can virtual worlds replace human-generated
annotations for real world tasks? In IEEE Interna-
tional Conference on Robotics and Automation, pages
1–8.
Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. E.
(2019). Similarity of neural network representations
revisited. In Proceedings of the 36th International
Conference on Machine Learning, ICML, volume 97
of Proceedings of Machine Learning Research, pages
3519–3529. PMLR.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P.
(2017). Focal loss for dense object detection. In
Proceedings of the IEEE International Conference on
Computer Vision (ICCV).
Liu, T. and Mildner, A. (2020). Training deep neu-
ral networks on synthetic data. http://lup.lub.lu.se/
student-papers/record/9030153. Master’s Thesis.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E.,
Fu, C., and Berg, A. C. (2016). SSD: Single shot
multibox detector. In Computer Vision – ECCV 2016,
pages 21–37. Springer International Publishing.
Morcos, A., Raghu, M., and Bengio, S. (2018). Insights
on representational similarity in neural networks with