possible to easily visualize which position each
position of the sequence can focus on.
After going through this process, it enters the input
of the Transformer encoder. Unlike the original
vanilla Transformer, the Transformer encoder is
designed to facilitate learning even in deep layers,
performing n times before Attention and MLP. As
components of the Transformer encoder, it can be
seen from Figure 2 that the Multi-head Attention
mechanism, MLP block, Layer Norm (LN) before all
blacks, and Residual connection are applied to all
ends of the blacks. Apply NL and Multi-head
Attention, NL and MLP to each block of the encoder,
and add up respective residual connections. Here, the
MLP is composed of FCN-GELU-FCN.
๏ Multi-head Attention: This function requires
Query, Key, and Value for each head. Since the
embedded input is a tensor with a size of [batch size,
patch length + 1, embedding dimension],
rearrangement is required to distribute the embedding
dimension to each head. Subsequently, the Multi-
head Attention output is one-dimensional and
coupled to project in the form of a new specific
dimension [number of heads, pitch length + 1,
embedding dimension/number of heads]. In order to
calculate Attention, a value multiplied by the Query
and Key tensors and a result multiplied by the
Attention map and the Value tensor are required.
Finally, the Multi-head Attention output is made one-
dimensional and coupled to project it into a new
feature dimension, again creating a tensor in the form
of [batch size, sequence length, embedding
dimension].
๏ MLP: It is simply a form of going back and forth
in a hidden dimension. Here, the Gaussian Error
Linear Unit(GELU) is used as an activation function,
which is characterized by faster convergence than
other algorithms.
๏ LayerNorm(LN): LayerNorm obtains operations
only for C(Channel), H(Height), and W(Width), so
the mean and standard deviation are obtained
regardless of N(Batch). Normalize features from each
instance at once across all channels.
In the case of using Attention techniques in the
conventional imaging field, it has been used to
replace certain components of CNN while
maintaining the CNN structure as a whole. However,
ViT does not rely on CNN as a mechanism for
viewing the entire data and determining where to
attract. In addition, Transformer, which uses image
patch sequences as input value, has shown higher
performance that conventional CNN-based models.
Since the original Transformer structure is used
almost as it is, it was verified that it has good
scalability and excellent performance for large-scale
learning.
4 ESTIMATE OBJECT DEPTH
MAP OF INTEREST
All input images are images extracted through the
proposed architecture. These images are again entered
as input images of the Transformer encoder. It divides
the non-overlapping patches 16 x 16 and converts
them into tokens by the linear projection of the platen.
These blocks, including independent read token (DRT,
a dotted block in Figure 4, are transferred together to
the Transformer encoder and go through several
stages, as shown in Figure 3). First, Reconstitution
modules reconstruct the representations of the images,
and Fusion modules gradually fuse and resample these
representations to make more detailed predictions.
The input/ output of blocks to the module proceeds
sequentially from the left to the right.
๏ Read block: The input the length patch is
mapped to the presentation of the size patch. Here, the
input value is read by adding a DRT to the
presentation learning this mapping. Adding DRT to
patch embeddings helps to perform more accurately
and flexible in Read block.
๏ Concatenate block: It consists of simply
connecting tokens in patch order. After passing
through this block, the feature map and the image are
expressed the same way.
๏ Resample block: 1x1 convolution is applied to
project the input image into 256-dimensional space.
This convolutional block leads to different
convolutions depending on the bit encoder layer, at
which time the internal parameters are changed.
The proposed method is shown in Figure 3.