possible  to  easily  visualize  which  position  each 
position of the sequence can focus on. 
After going through this process, it enters the input 
of  the  Transformer  encoder.  Unlike  the  original 
vanilla  Transformer,  the  Transformer  encoder  is 
designed  to  facilitate  learning  even  in  deep  layers, 
performing  n  times  before  Attention  and  MLP.  As 
components  of  the  Transformer  encoder,  it  can  be 
seen  from  Figure  2  that  the  Multi-head  Attention 
mechanism, MLP block, Layer Norm (LN) before all 
blacks,  and  Residual  connection  are  applied  to  all 
ends  of  the  blacks.  Apply  NL  and  Multi-head 
Attention, NL and MLP to each block of the encoder, 
and add up respective residual connections. Here, the 
MLP is composed of FCN-GELU-FCN. 
๏  Multi-head  Attention:  This  function  requires 
Query,  Key,  and  Value  for  each  head.  Since  the 
embedded input is a tensor with a size of [batch size, 
patch  length  +  1,  embedding  dimension], 
rearrangement is required to distribute the embedding 
dimension  to  each  head.  Subsequently,  the  Multi-
head  Attention  output  is  one-dimensional  and 
coupled  to  project  in  the  form  of  a  new  specific 
dimension  [number  of  heads,  pitch  length  +  1, 
embedding dimension/number of heads]. In order to 
calculate Attention, a value multiplied by the Query 
and  Key  tensors  and  a  result  multiplied  by  the 
Attention  map  and  the  Value  tensor  are  required. 
Finally, the Multi-head Attention output is made one-
dimensional  and  coupled  to  project  it  into  a  new 
feature dimension, again creating a tensor in the form 
of  [batch  size,  sequence  length,  embedding 
dimension]. 
๏  MLP: It is simply a form of going back and forth 
in  a  hidden  dimension.  Here,  the  Gaussian  Error 
Linear Unit(GELU) is used as an activation function, 
which  is  characterized  by  faster  convergence  than 
other algorithms. 
๏  LayerNorm(LN): LayerNorm obtains operations 
only  for  C(Channel),  H(Height),  and  W(Width),  so 
the  mean  and  standard  deviation  are  obtained 
regardless of N(Batch). Normalize features from each 
instance at once across all channels. 
In  the  case  of using  Attention  techniques in  the 
conventional  imaging  field,  it  has  been  used  to 
replace  certain  components  of  CNN  while 
maintaining the CNN structure as a whole. However, 
ViT does not rely on CNN as a mechanism for 
viewing  the  entire  data  and  determining  where  to 
attract.  In  addition, Transformer, which  uses  image 
patch  sequences  as  input  value,  has  shown  higher 
performance  that  conventional  CNN-based  models. 
Since  the  original  Transformer  structure  is  used 
almost  as  it  is,  it  was  verified  that  it  has  good 
scalability and excellent performance for large-scale 
learning. 
4  ESTIMATE OBJECT DEPTH 
MAP OF INTEREST 
All  input  images  are  images  extracted  through  the 
proposed architecture. These images are again entered 
as input images of the Transformer encoder. It divides 
the  non-overlapping  patches  16  x  16  and  converts 
them into tokens by the linear projection of the platen. 
These blocks, including independent read token (DRT, 
a dotted block in Figure 4, are transferred together to 
the  Transformer  encoder  and  go  through  several 
stages,  as  shown  in  Figure  3).  First,  Reconstitution 
modules reconstruct the representations of the images, 
and Fusion modules gradually fuse and resample these 
representations  to  make  more  detailed  predictions. 
The  input/  output  of  blocks  to  the  module  proceeds 
sequentially from the left to the right. 
๏  Read  block:  The  input  the  length  patch  is 
mapped to the presentation of the size patch. Here, the 
input  value  is  read  by  adding  a  DRT  to  the 
presentation learning this mapping. Adding DRT to 
patch embeddings helps to perform more accurately 
and flexible in Read block. 
๏  Concatenate  block:  It  consists  of  simply 
connecting  tokens  in  patch  order.  After  passing 
through this block, the feature map and the image are 
expressed the same way.
  
๏  Resample  block:  1x1  convolution  is  applied  to 
project the input image into 256-dimensional space. 
This  convolutional  block  leads  to  different 
convolutions  depending  on  the bit  encoder layer,  at 
which time the internal parameters are changed.  
The proposed method is shown in Figure 3.