Introspective knowledge acquisition for case retrieval networks in textual case base reasoning.
Abstract
Textual Case Based Reasoning (TCBR) aims at effective reuse of information
contained in unstructured documents. The key advantage of TCBR over traditional
Information Retrieval systems is its ability to incorporate domain-specific knowledge
to facilitate case comparison beyond simple keyword matching. However, substantial
human intervention is needed to acquire and transform this knowledge into a form
suitable for a TCBR system. In this research, we present automated approaches that
exploit statistical properties of document collections to alleviate this knowledge
acquisition bottleneck. We focus on two important knowledge containers: relevance
knowledge, which shows relatedness of features to cases, and similarity knowledge,
which captures the relatedness of features to each other. The terminology is derived
from the Case Retrieval Network (CRN) retrieval architecture in TCBR, which is used
as the underlying formalism in this thesis applied to text classification.
Latent Semantic Indexing (LSI) generated concepts are a useful resource for
relevance knowledge acquisition for CRNs. This thesis introduces a supervised LSI
technique called "sprinkling" that exploits class knowledge to bias LSI's concept
generation. An extension of this idea, called Adaptive Sprinkling has been proposed to
handle inter-class relationships in complex domains like hierarchical (e.g. Yahoo
directory) and ordinal (e.g. product ranking) classification tasks. Experimental
evaluation results show the superiority of CRNs created with sprinkling and AS, not
only over LSI on its own, but also over state-of-the-art classifiers like Support Vector
Machines (SVM).
Current statistical approaches based on feature co-occurrences can be utilized to
mine similarity knowledge for CRNs. However, related words often do not co-occur in
the same document, though they co-occur with similar words. We introduce an
algorithm to efficiently mine such indirect associations, called higher order
associations. Empirical results show that CRNs created with the acquired similarity
knowledge outperform both LSI and SVM.
Incorporating acquired knowledge into the CRN transforms it into a densely
connected network. While improving retrieval effectiveness, this has the unintended
effect of slowing down retrieval. We propose a novel retrieval formalism called the
Fast Case Retrieval Network (FCRN) which eliminates redundant run-time
computations to improve retrieval speed. Experimental results show FCRN's ability to
scale up over high dimensional textual casebases.
Finally, we investigate novel ways of visualizing and estimating complexity of
textual casebases that can help explain performance differences across casebases.
Visualization provides a qualitative insight into the casebase, while complexity is a
quantitative measure that characterizes classification or retrieval hardness intrinsic to a
dataset. We study correlations of experimental results from the proposed approaches
against complexity measures over diverse casebases.