Abstract:
This thesis investigates the refinement of web search results with a special
focus on the use of clustering and the role of queries. It presents a
collection of new methods for evaluating clustering methods, performing
clustering effectively, and for performing query refinement.
The thesis identifies different types of query, the situations where refinement
is necessary, and the factors affecting search difficulty. It then
analyses hard searches and argues that many of them fail because users
and search engines have different query models.
The thesis identifies best practice for evaluating web search results and
search refinement methods. It finds that none of the commonly used evaluation
measures for clustering meet all of the properties of good evaluation
measures. It then presents new quality and coverage measures that
satisfy all the desired properties and that rank clusterings correctly in all
web page clustering situations.
The thesis argues that current web page clustering methods work well
when different interpretations of the query have distinct vocabulary, but
still have several limitations and often produce incomprehensible clusters.
It then presents a new clustering method that uses the query to guide
the construction of semantically meaningful clusters. The new clustering
method significantly improves performance.
Finally, the thesis explores how searches and queries are composed of
different aspects and shows how to use aspects to reduce the distance between
the query models of search engines and users. It then presents fully
automatic methods that identify query aspects, identify underrepresented
aspects, and predict query difficulty. Used in combination, these methods
have many applications — the thesis describes methods for two of
them. The first method improves the search results for hard queries with
underrepresented aspects by automatically expanding the query using semantically
orthogonal keywords related to the underrepresented aspects.
The second method helps users refine hard ambiguous queries by identifying
the different query interpretations using a clustering of a diverse set
of refinements. Both methods significantly outperform existing methods.