Hybrid models for combination of visual and textual features in context-based image retrieval.
MetadataShow full item record
Visual Information Retrieval poses a challenge to intelligent information search systems. This is due to the semantic gap, the difference between human perception (information needs) and the machine representation of multimedia objects. Most existing image retrieval systems are monomodal, as they utilize only visual or only textual information about images. The semantic gap can be reduced by improving existing visual representations, making them suitable for a large-scale generic image retrieval. The best up-to-date candidates for a large-scale Content-based Image Retrieval are models based on the “Bag of Visual Words” framework. Existing approaches, however, produce high dimensional and thus expensive representations for data storage and computation. Because the standard “Bag of Visual Words” framework disregards the relationships between the histogram bins, the model can be further enhanced by exploiting the correlations between the “visual words”. Even the improved visual features will find it hard to capture an abstract semantic meaning of some queries, e.g. “straight road in the USA”. Textual features, on the other hand, would struggle with such queries as “church with more than two towers” as in many cases the information about the number of towers would be missing. Thus, both visual and textual features represent complementary yet correlated aspects of the same information object, an image. Existing hybrid approaches for the combination of visual and textual features do not take these inherent relationships into account and thus the combinations’ performance improvement is limited. Visual and textual features can be also combined in the context of relevance feedback. The relevance feedback can help us narrow down and “correct” the search. The feedback mechanism would produce subsets of visual query and feedback representations as well as subsets of textual query and textual feedback representations. A meaningful feature combination in the context of relevance feedback should take the inherent inter (visual-textual) and intra (visual-visual, textualtextual) relationships into account. In this work, we propose a principled framework for the semantic gap reduction in large scale generic image retrieval. The proposed framework comprises development and enhancement of novel visual features, a hybrid model for the visual and textual features combination, and a hybrid model for the combination of features in the context of relevance feedback, with both fixed and adaptive weighting schemes (importance of a query and its context). Apart from the experimental evaluation of our models, theoretical validations of some interesting discoveries on feature fusion strategies were also performed. The proposed models were incorporated into our prototype system with an interactive user interface.