alone try a sequence of queries. Thus, the onus is on
the query synthesis algorithm to provide a single
query that meets the user’s information needs well
when a searcher asks for help. This paper gives such
an algorithm to generate a query. Only an integrated
query provides access to resources that meet all
aspects of the user’s information needs as opposed
to some aspects of the needs.
Oyama et al. (2004) have suggested a different
approach to improve the quality of the web search
experience. They identify a domain specific keyword
spice to augment query terms so that only the
resources from the relevant domain are targeted by
the search. They illustrate their technique by
searching for beef cooking recipes. While a single
word query beef returns few links to the useful
resources, the keyword spiced query beef &
((ingredients & !season & !description) |
tablespoon) has good success in meeting the user’s
information needs. The main limitation of the
approach is that one needs to develop a keyword
spice for each information domain a user may be
interested in. Even if such a collection of keyword
spices could be developed, the problem remains
unabated as the users need to select correct keyword
spices for their information needs.
This paper presents an algorithm to construct
web queries that fit the query interface of Goolge
search engine. The algorithm uses Incremental
Learning (IL) algorithm (Sanchez, Triantaphyllou,
Chen and Liao, 2002) as modified in Malhotra et al.
(2005) to define an initial set of minterms. These
minterms collectively select all relevant examples
and each minterm rejects examples marked as
irrelevant by the viewing searcher. The algorithm
chooses and organizes these minterms to devise
query that accesses the best resources as measured
by the precision and recall characteristics and yet are
small to meet the size limit of the search engine.
Cohen et al. (1996) also generate a set of minterms
using a rule-learning technique RIPPER (Cohen,
1995). However, the RIPPER expressions are
already optimized to minimize misclassification
errors. This makes the expression less amenable to
further transformations to reduce their sizes to meet
limitations of search engine query interfaces.
In section 2, we briefly introduce relevance
feedback approaches used in text-categorization and
explain why we have chosen to use Boolean
expression based query synthesis approach. The
section also summarizes IL algorithm to construct a
set of minterms. This set of minterms constitutes the
primary input for our query synthesis algorithm
described in Section 3. The main goal of the
algorithm would be to synthesize web query that fits
the query interface of the search engine without
unduly compromising its access to the best web
resources. Section 4 presents results from a user
survey to establish the effectiveness of the
synthesized queries. In section 5 we conclude with a
description of planned work to integrate the
algorithm with a browser and also other
applications.
2 RELATED BACKGROUND
Relevance feedback has been studied extensively in
the context of text categorization (Baeza-Yates,
1991, Sebastiani, 2002). Given a corpus of
documents, certain terms are chosen as
discriminators. A query is a vector assigning weights
to the terms. Relevance feedback and query
expansion are used to adjust the terms and their
weights so that query is more aligned to documents
considered relevant and avoids documents
considered irrelevant (Ruthven and Lalmas, 2003).
Notwithstanding their success and usefulness in
text-categorization the vector based queries are little
used for web searching. Vector queries do not
express information needs in a way humans can
easily interpret. Thus, search engines use Boolean
expression based user interface for web searching.
Vector based approaches also use a large number of
terms in a query. On the other hand, it is important
for the search engines to limit the terms in queries to
deliver results efficiently and within an acceptable
time frame.
A web query also differs from the text
categorization in regards to its aims. An ideal text
categorization query for an information need is
required to locate all relevant documents without
retrieving any irrelevant document. The practical
algorithms – for example, RIPPER – aim for
misclassification minimization using a cost model
for errors. A web searcher typically views only a
few (usually, one) documents – clearly, one would
like these documents to meet the information need
perfectly.
A web search query is a list of terms (words)
punctuated by Boolean operators AND (&), OR (|)
and NOT (!). Following on from Google
conventions, AND is an implied operator and not
explicitly written. Operator NOT applies to a single
term and has highest precedence. The operator AND
(&) has lowest precedence. For a set of textual
documents, D, and a search query, Q, expression D σ
Q denotes the results of search by query Q over the
document set D with the following interpretation:
WEBIST 2006 - WEB INTERFACES AND APPLICATIONS
288