distinguish the training set and the test set. The model 
proposed in this paper is compared with the single-
modal  ResNet,  ConvNet,  and  CNN-Transformer  in 
the experiments. 
 
Figure 4: The average test accuracy of emotion detection. 
The  average  test  results  of  these  methods  on 
annotated datasets are shown in Figure 4. As can be 
seen from the figure, the Sad and Angry of ConvNet 
and ResNet are relatively low. The reason is that Sad 
and Angry are indistinguishable. The Happy of CNN-
Transformer has obvious advantages over the former 
two, and there is still the problem that Sad and Angry 
are indistinguishable; In contrast, the accuracy rate of 
Sad  and  Angry  of  the  multi-modal  transformer 
structure proposed in this paper is slightly improved. 
It  can  be  considered  that  due  to  the  fusion  of 
multimodal  information  and  the  ability  of  the 
transformer to map to the same space to calculate the 
similarity,  coupled  with  the  abstraction  ability  of 
semantic  topology,  the  discrimination  ability  of 
emotion  classification  in  these  four  emotions  is 
further improved. 
5  CONCLUSION 
The field of deep learning based on computer audio 
and video is developing rapidly. Combining language 
and visual information to carry out emotion detection 
and recognition is a new research hotspot in the field 
of artificial intelligence. In addition to detecting and 
recognizing  human  emotions,  emotion  recognition 
and  management  for  pets  has  begun  to  receive 
attention.  By  collecting  and  processing  pet  sounds 
and corresponding facial expressions, it is possible to 
understand  the  pet's  current  condition  and  respond 
positively to these conditions. This paper proposes a 
pre-trained  multi-modal  transformer  emotion 
detection  system,  which  is  first  pre-trained  on  a 
human  emotion  detection  dataset  including  speech 
and facial expression data, and then takes the labelled 
animal  speech  and  expression  data  as  small-sample 
task  data, The method utilizes an unlabelled  corpus 
for pre-training, which  satisfies the  requirements of 
adequately training model parameters and preventing 
model overfitting, and finally uses the representations 
of these models for few-shot tasks. Experimental 
results  on  video  datasets  show  that  the  proposed 
multimodal Transformer structure has good accuracy 
compared to other algorithms. 
REFERENCES 
Bennett,  V.,  Gourkow,  N.,  &  Mills,  D.  (2017).  Facial 
correlates  of  emotional  behaviour  in  the  domestic  cat 
(Felis  catus).  Behavioural Processes. 141. 
10.1016/j.beproc.2017.03.011.  
Cheng, W. K., Leong, W. C., Tan,  J. S., Hong, Z. W., & 
Chen,  Y.  L.  (2022).  Affective  Recommender  System 
for  Pet  Social  Network. Sensors (Basel, 
Switzerland), 22(18),  6759. 
https://doi.org/10.3390/s22186759 
Garcia-Garcia,  J.,  &  Penichet,  V.,  &  Lozano,  M.  (2017). 
Emotion  detection:  a  technology  review.  1-8. 
10.1145/3123818.3123852.  
Ho,  J.,  Hussain,  S.,  &  Sparagano,  O.  (2021).  Did  the 
COVID-19  Pandemic  Spark  a  Public  Interest  in  Pet 
Adoption?. Frontiers in veterinary science, 8, 647308. 
https://doi.org/10.3389/fvets.2021.647308 
Hussain,  A.,  Ali  S,  S.,  Abdullah,  M., &  Kim,  H.  (2022). 
Activity  Detection  for  the  Wellbeing  of  Dogs  Using 
Wearable  Sensors  Based  on  Deep  Learning.  IEEE 
Access. 10. 10.1109/ACCESS.2022.3174813.  
Ju, X., Zhang, D., Li, J., & Zhou, G. (2020). Transformer-
based  Label  Set  Generation  for  Multi-modal  Multi-
label Emotion Detection. Proceedings of the 28th ACM 
International Conference on Multimedia. 
Kim, D. H., & Song, B.  C. (2018). Multi-modal Emotion 
Recognition  using  Semi-supervised  Learning  and 
Multiple  Neural  Networks  in  the  Wild. Journal of 
Broadcast Engineering, 23(3), 351–360. 
https://doi.org/10.5909/JBE.2018.23.3.351 
Kim,  E.S.,  Bryant,  D.G.,  Srikanth,  D.,  &  Howard,  A.M. 
(2021). Age Bias in Emotion Detection: An Analysis of 
Facial  Emotion  Recognition  Performance  on  Young, 
Middle-Aged,  and  Older  Adults. Proceedings of the 
2021 AAAI/ACM Conference on AI, Ethics, and Society. 
Mendl,  M.T.,  Burman,  O.H.,  Parker,  R.M.,  &  Paul,  E.S. 
(2009).  Cognitive  bias  as  an  indicator  of  animal 
emotion  and  welfare:  Emerging  evidence  and 
underlying  mechanisms. Applied Animal Behaviour 
Science, 118, 161-181. 
Meng, L., Liu, Y., Liu, X., Huang, Z., Jiang, W., Zhang, T., 
Deng, Y., Li, R., Wu, Y., Zhao, J., Qiao, F., Jin, Q., & 
Liu, C. (2022). Multi-modal Emotion Estimation for in-
the-wild Videos. ArXiv, abs/2203.13032. 
Paul, E. S., Sher, S., Tamietto, M., Winkielman, P., & 
Mendl, M. T. (2020). Towards a comparative science of 
emotion:  Affect  and  consciousness  in  humans  and