Fusion of Different Features by Cross Cooperative Learning for
Semantic Segmentation
Ryota Ikedo and Kazuhiro Hotta
Meijo University, Japan
Keywords: Semantic Segmentation, Cooperative Learning, Multiple Backbones, Fusion of Different Features.
Abstract: Deep neural networks have achieved high accuracy in the field of image recognition. Its technology is
expected to use the medical, autonomous driving and so on. Therefore, various deep learning methods have
been studied for many years. Recently, many studies used a backbone network as an encoder for feature
extraction. Of course, the extracted features are changed when we change backbone networks. This paper
focused on the differences in features extracted from two backbone networks. It will be possible to obtain the
information that cannot be obtained by a single backbone network, and we can get rich information to solve
a task. In addition, we use cross cooperative learning for fusing the features of different backbone networks
effectively. In experiments on two kinds of datasets for image segmentation, our proposed method achieved
better segmentation accuracy than conventional method using a single backbone network and the ensemble
of networks.
1 INTRODUCTION
Convolutional Neural Network (Krizhevsky, A.,
2012) achieved high accuracy in various kinds of
image recognition problems such as image
classification (Szegedy, C.,2015)( Wang, F., 2017),
object detection (Redmon, J. 2016)( Liu, W.,2016),
pose estimation (Cao, Z., 2018) etc. In addition,
semantic segmentation assigns class labels to all
pixels in an input image. This task recognizes various
classes at pixel level. Semantic segmentation using
CNN is also applied to cartography (Isola, P., 2017)(
Ronneberger, O., 2015), automatic driving (Chen,
L.C., 2018) (Yang, M., 2018), medicine and cell
biology (Ji, X., 2015)(Havaei, M., 2017). Especially
in autonomous driving, it is necessary to instantly
predict various classes such as people, cars and signs
from in-vehicle images. Therefore, semantic
segmentation technology is important to realize
autonomous driving. In this paper, we work on
semantic segmentation task for autonomous driving.
We proposed cooperative learning method (Ryota, I.
2021) as conventional study. Neural network is
derived from the human brain structure. Cooperative
learning was based on the group learning of humans.
We used the learning method in a neural network.
Basic cooperative structure is showed Figure 1.
Figure 1: The structure of one-way cooperative network.
In Figure 1, we prepare two CNNs with the same
structure. Then, we introduce paths between two
networks for sending feature maps. Due to this
structure, bottom CNN can obtain new feature maps
from top network. Previous study used the exactly
same CNN structure. In other words, previous
cooperative learning is consulted with the same
person. There is a problem that bottom CNN cannot
get completely new information from top CNN.
Therefore, we propose to give completely different
feature maps in cooperative learning. We use two
kinds of backbone networks and extract different
features. Then, we use cross cooperative learning to
effectively fuse those features obtained from different
backbone networks. By sending the features
mutually, each network has rich features and
improves the accuracy.
582
Ikedo, R. and Hotta, K.
Fusion of Different Features by Cross Cooperative Learning for Semantic Segmentation.
DOI: 10.5220/0010889800003124
In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages
582-589
ISBN: 978-989-758-555-5; ISSN: 2184-4321
Copyright
c
2022 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
Figure 2: The overview of our cross cooperative network method.
We conducted the experiments on two kinds of
famous datasets. The first dataset is the Pascal VOC
2012 (Everingham, M., 2010). The second one is the
Cityscapes(Cordts, M., 2016) which is captured by
in-vehicle camera. We see that the proposed method
achieved higher accuracy than “single network”,”
previous cooperative network” and ”the ensemble of
networks”.
This paper is organized as follows. In section 2, we
describe related works. The details of proposed
method are explained in section 3. In section 4, we
evaluate our proposed cross cooperative learning on
segmentation tasks. Finally, we describe conclusions
in section 5.
Figure 3: The structure of DeepLabV3+.
2 RELATED WORKS
The state-of-the-art approaches for semantic
segmentation are based on CNNs. The famous
approach is based on Fully Convolutional Network
(FCN) such as SegNet (Badrinarayanan, V., 2017),
U-net (Ronneberger, O., 2015) and so on. They had
the simple structure of FCN but sharp accuracy
improvements have been achieved by new
architectures in recent years. One of the problems in
semantic segmentation is that CNN lost spatial
information by reducing the resolution in feature
extraction process. Dilated convolution was proposed
to solve this problem. It can extract the features while
preserving spatial information by expanding
receptive fields sparsely without reducing resolution.
In the other works, PSPNet (Zhao, H., 2017) and
DeepLab (Chen, L.C., 2018) proposed ASPP module.
This module aggregates feature information at
multiple scales. Thus, these works can get multi-scale
contextual information and achieved high accuracy.
Figure 4: Cross cooperative connection in our method.
Fusion of Different Features by Cross Cooperative Learning for Semantic Segmentation
583
In the latest semantic segmentation, they used
deep and large backbone network such as ResNet
(He, K., 2016), VGG (Simonyan, K., 2014),
Xception(Chollet, F., 2017). By using those backbone
networks, recent works achieved high accuracy. For
example, PSPNet used the ResNet101, and it showed
high accuracy in in-vehicle dataset. DeepLabV3+
used ResNet101 or Xception for feature extraction at
encoder. On the other hand, when we want to reduce
inference time, we should use light backbone
networks such as MobileNet (Sandler, M., 2018) and
EfficientNet (Tan, M., 2019). As described above,
many works use backbone network that suits their
purpose.
Basically, features are different when we changed
backbone networks. There is information which are
easy to extract and difficult to extract by the kind of
backbone network architecture. Therefore, it is
important for us to select backbone architecture. In
this paper, we focused on various information
obtained from different backbone networks. We
improved the segmentation accuracy by effectively
fusing the features of different backbones.
3 PROPOSED METHOD
3.1 Overview
Conventional cooperative learning (Ryota, I., 2021)
was used completely same two networks like figure1.
In other words, this structure is learning by two
exactly same persons. Thus, there was a problem that
networks have only similar information even if we
share feature maps between two CNNs. To overcome
the problem, we propose that not use the same
person's information but use the information of other
persons in new cooperative learning. The aim of new
cooperative learning is making a cooperative learning
that two networks are different like another person
each other. Therefore, we considered use different
backbones in cooperative network to obtain different
information in each network. In addition, we integrate
different features from each backbone network by
cooperative learning to solve a segmentation task.
In our method, we introduce different backbones.
Different backbone networks can extract different
features. But, there are easy to extract information
and hard to extract information for backbone
network. If we can supplement each with information
from two different backbones by using cooperative
learning, we can overcome this weakness.
In addition, using various kinds of features from
two backbones, our method can use the features that
a single backbone network cannot extract. From this
above, we thought our method improves the
segmentation accuracy.
We explain the details of networks in section 3.2.
We explain the connection methods between two
backbone networks in section 3.3.
3.2 Details of Network
Our proposed method was created based on
DeepLabV3+. We use the ResNet and the Xception
as backbones because the ResNet is used as the
backbone network in many tasks and the Xception is
higher extraction ability than the Resnet by separating
channel convolution and spatial convolution. We
show the overview of the proposed method in Figure
2. In particular, backbone1 is the Xception-65 and
backbone2 is the ResNet-101. We used two backbone
networks pretrained by ImageNet.
Next, we explain the structure of our cooperative
learning in Figure 2. Previous cooperative connection
was only one-way path from top network to bottom
network (Cordts, M., et. al.,2016). Our method uses
cross connections which can send feature maps each
other. We introduce the cross connection because we
would like to fuse two different information in top
and bottom networks. We used this connection at all
layers in decoders and ASPP module of
DeepLabV3+. By using the connection, top and
bottom networks are expected to obtain information
that single network cannot have. Therefore, cross
connection is more effective than conventional one-
way cooperative network. Finally, we obtain two
outputs from both CNNs for calculating losses. We
use these two losses to let the network learn
simultaneously for cooperative learning. We use
SoftMax Cross Entropy (CE) as a loss function.
Loss = Loss1 + Loss2 ・・ (1)
where Loss1 is the CE loss for Top CNN and Loss2
is that for Bottom CNN. Both losses are optimized
simultaneously.
3.3 Connection Method
The structure of DeepLabv3+ is shown in Figure 3.
This model has backbone and ASPP as encoder, and
uses a decoder for predicting segmentation result. We
introduce cross cooperative connections to each layer
in the ASPP module and Decoder to effectively use
the features from different backbones.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
584
Table 1: Accuracy on the PASCAL VOC2012 dataset.
Figure 5: Ensemble output and ensemble method.
Connection1(Red Line) and Connection2(Blue
Line) in Figure 3 are the outputs from backbone
network and ASPP module. These connections have
original information extracted from the encoder.
Thus, if we also introduce cross cooperative
connections after Connection 1 and 2 as shown in
Figure 4, we can improve the accuracy further.
Table 2: Comparison with other models on the
PASCALVOC2012 val set.
Therefore, we also add cross cooperative
connections to the output of Connection 1 and 2.
Cross cooperative connection after Connection 1
gives the information from two different backbones
to each ASPP module. Cross cooperative connection
after Connection 2 gives the information that
enhanced by each ASPP to each decoder. This
structure can be expected to provide more useful
information for learning.
In experiments, we also evaluate the proposed
method without Connection 1 and 2 to investigate the
effectiveness of them.
4 EXPERIMENTS
In this section, we show experimental results. Section
4.1 describes the details of the dataset. Section 4.2
explains the implementation details. Section 4.3
shows the results on the PASCAL VOC dataset and
section 4.4 presents the results on the Cityscapes
dataset. Finally, in section 4.5, we show the
comparison results about the connections.
4.1 Datasets
In this paper, we evaluate the proposed method using
the PASCALVOC2012 and Cityscapes datasets. We
describe the two datasets as follows.
4.1.1 Pascal Voc2012
This dataset includes various kinds of images. There
are 10,582 images in training set, 1,449 images in
validation set and 1,456 images in test set. These
images involve 20 foreground object classes and one
background class. In this study, we use validation set
to get the best model. We evaluate the best model
determined by validation set for test set. In addition,
we randomly cropped the images of 513 × 513 pixels
from training set, and we cropped a center region in
validation and test phase.
Fusion of Different Features by Cross Cooperative Learning for Semantic Segmentation
585
Figure 6: Segmentation result on the PASCALVOC 2012 dataset (val).
4.1.2 Cityscapes
This dataset includes the images captured by in-
vehicle camera in Germany. All images are 2048 ×
1024 pixels in which each pixel has high quality 19
class labels. There are 2,979 images in training set,
500 images in validation set. In this paper, we
randomly cropped images of 768×768 pixels and
used them for training.
4.2 Implementation Details
We implement our method by the Pytorch library and
cross cooperative learning based on DeepLabV3+. To
do fair comparison, we evaluated DeepLabV3+ and
the proposed method under the same conditions on
the same PC. We used single Deeplabv3+ (Resnet101
and Xception-65) and ensemble method model
(figure 5) as a baseline for comparison. In our
method, the batch size was fixed to 6 and SGD was
used as the optimizer. The learning rate was set to
0.007 for PASCALVOC and 0.035 for Cityscapes.
We used intersection over union (IoU) and mean IoU
(mIoU) as evaluation measures.
4.3 Evaluation Result on
Pascalvoc2012 Dataset
We evaluated the accuracy on validation set in the
PASCAL VOC dataset. We compared four methods;
a single network, the conventional cooperative
learning, our proposed method, and our method
without all cross cooperative paths between two
networks. Our method has two outputs from top and
bottom network. We have shown the results of each
output in the Table 1.
The red number in Table 1 represents the
maximum accuracy. We see that our proposed
method achieved the highest accuracy in Table 1. The
accuracy was improved more than 1.4% in
comparison with a single network (Deeplabv3+).
This result shows the effectiveness of the proposed
method which fuses the features extracted from
different backbones.
In conventional one-way cooperative learning
(Cordts, M., et. al.,2016), we send the feature maps in
top network to only bottom network. However, the
method induced the accuracy difference between top
and bottom networks, because top network cannot
receive additional feature maps. Here we introduce
cross connection to overcome the problem. Top
network can get information of bottom networks
Therefore, we achieved high accuracy in both
networks. We see that the proposed method
overcomes the weakness of the conventional method
and improved the accuracy. Next, we reveal the effect
of cooperative connection by comparing with the
ensemble of two networks as shown in Figure 5. The
ensemble method is just adding the final output to the
outputs of two networks. Table 1 showed that our
proposed method is 1.0% higher than the ensemble.
This result indicated more effective than standard
ensemble of two networks. Thus, our proposed cross
connection is useful for fusing the feature maps of
different backbone networks.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
586
Figure 7: Segmentation results on the Cityscapes dataset (val).
Table 3: Accuracy on the Cityscapes dataset using
DeepLabV3+.
Table 4: Comparison with other models on the Cityscapes
val set.
We show the comparison results with other
segmentation models in Table 2. Conventional
segmentation models have only one backbone
network. Our method with two backbone networks
achieved higher accuracy than those methods. These
results demonstrated the effectiveness of the
proposed cross cooperative learning method.
Figure 6 shows the segmentation results. In the
first row, the red area is recognized correctly by
single Deeplabv3+ using Xception, but not
recognized by that with ResNet101. On the other
hand, the blue area is recognized correctly by that
with ResNet101 though Xception based Deeplabv3+
cannot recognize well. Many segmentation results
showed the improvement of our method though we
discovered a few results with bad influence. Thus,
qualitative results also demonstrated that our
proposed method could incorporate two feature maps
effectively.
4.4 Evaluation Result on Cityscapes
Dataset
We also evaluated the proposed method on the
Cityscapes dataset. For fair comparison under the
same condition, single Deeplabv3+ network was
evaluated with own implementation.
Comparison results with Deeplabv3+ are shown
in Table 3. Our method which uses DeepLabv3+ as a
baseline improved over 2% on mIoU than single
Deeplabv3+ using each backbone. Table 4 shows
comparison results with the other segmentation
models. Table 4 show that our method is higher
accuracy than the other models which use ResNet-
101 or Dilated-ResNet as backbone. These results
showed that our method using feature fusion is more
effective.
Figure 7 shows the segmentation results on the
Cityscapes dataset. Similarly with the PASCALVOC
2012 dataset, the proposed method can incorporate
the advantages of each backbone network. We can
show that our method was also useful for improving
the accuracy on another dataset.
4.5 Ablation Study
As introduced in Section 3-3, the proposed cross
cooperative learning contains two additional cross
cooperative connection at Connection 1 and 2.
Therefore, we study their contributions on
Fusion of Different Features by Cross Cooperative Learning for Semantic Segmentation
587
PASCALVOC2012 dataset. As shown in Table 5,
when we did not introduce additional connections, the
accuracy was 80.36%. The gain of cross cooperative
connection at Connection1 is 0.16%. When we add
cross connection at Connection2 to our method, it
boosted 1.15% in comparison with the proposed
method without additional connections. Especially,
ASPP improved the feature extraction ability by
performing some dilated convolutions and pooling,
and it can obtain beneficial feature maps. Therefore,
cross connection at Connection1 brings good effect to
ASPP and cross connection at Connection2 brings
good effect in decoding the extracted information.
These results demonstrated the effectiveness of the
additional cross cooperative connection.
5 CONCLUSION
In this paper, we proposed new cooperative learning
method by fusing the features of different backbone
networks for semantic segmentation. Especially, we
used cross cooperative learning with two different
backbones, and our method improved the
conventional cooperative learning. We confirmed
that our method improved the segmentation accuracy
on the PASCAL VOC2012 dataset and the Cityscapes
dataset.
The proposed cross cooperative network used
much calculation resource because our method needs
multiple backbone networks. Therefore, we would
like to realize the cross cooperative learning with
lower computational cost and high accuracy. This is
a subject for future works.
ACKNOWLEDGEMENTS
This paper is partially supported by JSPS KAKENHI
18K11382.
REFERENCES
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet
classification with deep con-volutional neural
networks. In:Advances in Neural Information
Processing Sys-tems. 1097–1105 (2012)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich,
A.: Going deeper with convolutions. In: Proceedings of
the IEEE conference on Computer Vision and Pattern
Recognition. pp. 1–9 (2015)
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H.,
Wang, X., Tang, X.: Residual attention network for
image classication. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition. pp. 3156–3164 (2017)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only
look once:unified, real-time object detection. In:
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 779–788 (2016)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox
detector. In: Proceedings of the European Conference
on Computer Vision. pp. 21–37. Springer (2016)
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.:
Openpose: realtime multi-person 2d pose estimation
using part affinity fields. arXiv preprint
arXiv:1812.08008 (2018)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-
person 2d pose estimation using part affinity fields. In:
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 7291–7299 (2017)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image
translation with conditional adversarial networks. In:
Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition. pp. 1125–1134 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net:
Convolutional networks for biomedical image
segmentation. In: International Conference on Medical
Image Computing and Computer-Assisted
Intervention. pp. 234–241. Springer (2015)
Chen, L.C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B.,
Schroff, F., Adam, H., Shlens, J.: Searching for efficient
multi-scale architectures for dense image prediction. In:
Advances in Neural Information Processing Systems.
pp. 8699–8710 (2018)
Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp
for semantic segmentation in street scenes. In:
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 3684–3692 (2018)
Havaei, M., Davy, A., Warde-Farley, D., Biard, A.,
Courville, A., Bengio, Y., Pal, C., Jodoin, P.M.,
Larochelle, H.: Brain tumor segmentation with deep
neural networks. Medical image analysis 35, 18–31
(2017)
Ji, X., Li, Y., Cheng, J., Yu, Y., Wang, M.: Cell image
segmentation based on an improved watershed
algorithm. In: 2015 8th International Congress on
Image and Signal Processing. pp. 433–437. (2015)
Ryota, I. and Kazuhiro, H.: Feature Sharing Cooperative
Network for Semantic Segmentation. In: Proceedings
of the 16th International Joint Conference on Computer
Vision, Imaging and Computer Graphics Theory and
Applications, pp. 577-584. (2021)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,
M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The
cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition. pp.
3213–3223 (2016)
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
588
Everingham, M., Van Gool, L., Williams, C.K., Winn, J.,
Zisserman, A.: The pascal visual object classes (voc)
challenge. International journal of computer vision
88(2), 303–338 (2010)
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A
deep convolutional encoder-decoder architecture for
image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence 39(12), 2481–2495
(2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net:
Convolutional networks for biomedical image
segmentation. In: International Conference on Medical
Image Computing and Computer-Assisted
Intervention. pp. 234–241. Springer (2015)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene
parsing network. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition. pp. 2881–2890 (2017)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K.,
Yuille, A.L.: Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and
fully connected crfs. IEEE Transactions on Pattern
Analysis and Machine Intelligence 40(4), 834–848
(2017)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.:
Rethinking atrous convolution for semantic image
segmentation. arXiv preprint arXiv:1706.05587 (2017)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.:
Encoder-decoder with atrous separable convolution for
semantic image segmentation. In: Proceedings of the
European Conference on Computer Vision. pp. 801–
818 (2018)
He, K., Zhang, X., Ren, S., Sun, J: Deep residual learning
for image recognition. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition. pp. 770–778 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional
networks for large-scaleimage recognition. arXiv
preprint arXiv:1409.1556 (2014)
Chollet, F.: Xception: Deep learning with depthwise
separable convolutions. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition. pp. 1251–1258 (2017)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen,
L. C. Mobilenetv2: Inverted residuals and linear
bottlenecks. In Proceedings of the IEEE conference on
computer vision and pattern recognition pp. 4510-4520
(2018)
Tan, M., Le, Q. V: Efficientnet: Rethinking model scaling
for convolutional neural networks. In Proceedings of
the International conference on machie learning. pp.
6105-6114 (2019)
Artacho, B., Savakis, A.: Waterfall atrous spatial pooling
architecture for efficient semantic segmentation.
Sensors, 19(24), 5361 (2019).
Takikawa, T., Acuna, D., Jampani, V., Fidler, S. Gated-
scnn: Gated shape cnns for semantic segmentation. In
Proceedings of the IEEE International Conference on
Computer Vision pp. 5229-5238 (2019)
Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., Liang, J.
Unet++: A nested u-net architecture for medical image
segmentation. In Deep learning in medical image
analysis and multimodal learning for clinical decision
support pp. 3-11(2018)
Bai, S., Koltun, V., Kolter, J. Z. Multiscale deep
equilibrium models. In Neural Information Processing
Systems, pp.33, (2020)
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.
Learning a discriminative feature network for semantic
segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition (pp. 1857-
1866) (2018)
Nirkin, Y., Wolf, L., Hassner, T. Hyperseg: Patch-wise
hypernetwork for real-time semantic segmentation. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition pp. 4061-
4070 (2021)
Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J. Large kernel
matters--improve semantic segmentation by global
convolutional network. In Proceedings of the IEEE
conference on computer vision and pattern recognition
pp. 4353-4361 (2017)
Fusion of Different Features by Cross Cooperative Learning for Semantic Segmentation
589