which could result in similar descriptions or even out
of context. In this step, we filter out those captions.
Initially, we convert all captions to lower case, remove
punctuation and duplicated blank spaces, and apply a
tokenization method in the words. Then, to reduce the
amount of similar dense captions, we create a high-
dimensional vector representation for each sentence
using word embedding and calculate the cosine sim-
ilarity between all sentence representations. Dense
captions with cosine similarity greater than threshold
T
text sim
are discarded.
Next, we remove descriptions out of our context,
i.e., captions that are not associated with the speaker.
Linguistic annotation attributes are verified for each
token in the sentences. If no tokens related to nouns,
pronouns or nominal subjects are found, the sentence
is discarded. Then, a double check is performed over
the sentences that were kept after this second filtering.
This is done using WordNet cognitive synonym sets
(synsets) (Miller, 1995), verifying whether the tokens
are associated to the concept person, being one of its
synonyms or hyponym (Krishna et al., 2017).
It is important to filter out short sentences since
they are more likely to have obvious details and to be
less informative for visually impaired people (ABNT,
2016; Lewis, 2018), for example, “a man has two
eyes”, “the man’s lips”, etc. We remove all sentences
shorter than the median length of the image captions.
On the set of kept sentences, we standardize the
subject/person in the captions using the most frequent
one, aiming to achieve better results in the text gener-
ator. At the end of this step, we have a set of semanti-
cally diverse captions, related to the concept of person
and the surroundings. Due to the filtering process, the
kept sentences tend to be long enough to not contain
obvious or useless information.
External Classifiers. To add information regard-
ing relevant characteristics for vision-impaired peo-
ple, complementing the Dense Captioning results, we
apply people-focused learning models, such as age
detection and emotion recognition, and a scene clas-
sification model.
Due to the challenge of predicting the correct age
of a person, we aggregate the returned values by the
age group proposed by the World Health Organiza-
tion (Ahmad et al., 2001), i.e., Child, Young, Adult,
Middle-aged, and Elderly. Regarding emotion recog-
nition and scene classification models, we use their
outputs in cases where the model confidence is greater
than a threshold T
model con fidence
. To avoid inconsis-
tencies in the output, age and emotion models are only
applied if a single person is detected in the image.
Sentence Generation and Text Concatenation.
From the output produced by the external classifiers,
coherent sentences are created to include information
about age group, emotion, and scene in the set of fil-
tered descriptions for the input image. The gener-
ated sentences follow the default structure: “there is
a/an <AGE> <NOUN>”, “there is a/an <NOUN>
who is <EMOTION>”, and “there is a/an <NOUN>
in the <SCENE>”, where AGE, EMOTION and
SCENE are the output of the learning methods, and
the <NOUN> is the frequent person-related noun
used to standardize the descriptions. Example of gen-
erated sentences, “there is a middle-aged woman”,
“there is a man who is sad”, and “there is a boy in the
office”. These sentences are concatenated with the set
of previously filtered captions into a single text.
3.2 Context Generation
In this phase, a neural linguist model is fed with the
output of the first phase and generates coherent and
connected summary, which goes through a new clean-
ing process to create a quality image context.
Summary Generation. To create a human-like sen-
tence, i.e., a coherent and semantically connected
structure, we apply a neural language model to pro-
duce a summary from the concatenated descriptions
resulting from the previous phase. Five distinct sum-
maries are generated by the neural language model.
Summary Cleaning. One important step is to ver-
ify if the summary produced by the neural language
model achieves high similarity with the input text pro-
vided by the Image Analyzer phase. We assure the
similarity between the language model output and its
input, by filtering out phrases inside the summary
when cosine similarity is less than threshold α. A
second threshold β is used as an upper limit to the
similarity values between pairs of sentences inside the
summary to remove duplicated sentences. In pairs of
sentences in which the similarity is greater than β, one
of them is removed to decrease redundancy. As a re-
sult, after the application of the two threshold-based
filters, we have summaries that are related to the input
with low probability of redundancy.
Quality Estimation. After generating and cleaning
the summaries, it is necessary to select one summary
returned by the neural language model. We model
the selection process to address the needs of vision-
impaired people in understanding an image context
by estimating the quality of paragraphs.
Describing Image Focused in Cognitive and Visual Details for Visually Impaired People: An Approach to Generating Inclusive Paragraphs
529