DELOS Task 2.6 Advanced Access Structures for Complex Similarity Measures

oint work of CNR, MUNI and UMIT

The Task on Advanced Access Structures for Complex Similarity Measures, which is part of the research cluster on Information Access and Personalization, has investigated different approaches to the problem of indexing and retrieving objects based on metric indexes and similarity measures. Three different approaches have been implemented as demonstrators, and they are now available for testing. Comments and feedback are mostly welcome.

These demonstrators are also the fulfillment of Deliverable 2.6.1 (Prototype implementation of indexing and processing algorithms for similarity search), which is part of the second Joint Program of Activities (JPA2) in the contract with the Commission.

Quantization Techniques in Metric Indexing

The objective of this activity was to analyze metric indexing approaches that extend the filter-and-refinement approach towards metric indexing. The outcome was an indexing strategy for metric spaces that (1) efficiently supports nearest neighbor, range, and ranking queries, (2) operates on any kind of multimedia data, (3) is generic in the metric distance measure to be employed, and (4) allows for straightforward parallelization of query processing that comes with good scalability. Our approach investigates ways to integrate (1) existing metric indexing approaches, and (2) VA-file like quantization approaches into a novel indexing approach. We intended to tackle two problems of metrics-based similarity retrieval simultaneously. On the one hand, we addressed the I/O issue which is to reduce the amount of hard disk accesses. On the other hand, we saved computational expenses by rarely invoking a possibly time-consuming distance measure between media objects. We implemented an indexing (Java) and retrieval (mixed Java/C++) prototype framework focusing on the Earth Movers Distance between colour signatures of images. We are working at a refined quantization scheme that yields even smaller approximation errors, permits retrieval parallelization by disjoint index partitioning, obeys a strictly constrained main memory consumption, incorporates various clustering/pivot finding techniques, and considers other data domains and similarity measures. More information and a demo of the prototype are temporarily unavailable due to transfer from UMIT to University of Basel.

Similarity search approach in digital library applications exploiting XML encoded metadata

XML is becoming one of the primarily used formats for the representation of heterogeneous information in many and diverse application sectors, such as multimedia digital libraries, public administration, EDI, insurances, etc. This widespread use has posed a significant number of technical requirements to systems used for storage and content-based retrieval of XML data, and many others is posing today. In particular, retrieval of XML data based on content and structure has been widely studied and it has been solved with the definition of query languages such as XPath and XQuery and with the development of systems able to execute queries expressed in these languages. However, many other research issues are still open.
There are cases where the content of elements of XML documents cannot be exactly matched against constants expressed in query, as for instance in case of large text context or low-level feature descriptors, as in MPEG-7 visual or audio descriptors. The standardization effort carried-out by MPEG-7, intending to provide a normative framework for multimedia content description, has permitted several features for images to be represented as visual descriptors to be encoded in XML.
We investigated the architecture of a native XML search engine that approximate content match to be combined with traditional exact match search operations. Our XML database can store and retrieve any valid XML document without need of specifying or defining their schema. Our system store XML documents natively and uses special indexes for efficient path expression execution, exact content match search, and approximate match search. For instance, in case of an MPEG-7 visual descriptor, the system administrator can associate an approximate match search index to a specific XML element so that it can be efficiently searched by similarity. The XQuery language has been extended with new operators that deal with approximate match and ranking, in order to deal with these new search functionality.
The XML search engine is the core component of the MILOS multimedia content management system.
Further information and demonstrations of the system can be found at http://milos.isti.cnr.it

Scalable and Distributed Similarity Search Structures

Centralized metric indexes achieve a significant speedup (both in terms of distance computations and disk-page reads) when compared to a baseline approach, the sequential scan. However, experience with centralized methods reveals a strong correlation between the dataset size and search costs. More specifically, costs increase linearly with the growth of the dataset, i.e., it is practically twice as expensive to compute a similarity query in a dataset of a given size as it would be with a dataset of half that size. Thus, the ability of centralized indexes to maintain a reasonable query response time when the dataset multiplies in size, its scalability, is limited.
An important activity of this cluster was the research of scalable and distributed similarity search structures. The work capitalized on our previous research on centralized structures as well as the GHT* , i.e. the first attempt to make such structures scalable through distribution. It supports similarity search in generic metric spaces, and it is based on the idea of the Generalized Hyperplane Tree. The structure allows storing datasets from any metric space and has many essential properties of the distributed and P2P approaches. It is scalable, because every peer can perform an autonomous split and distribute the data over more peers at any time. It has no hotspot, and all peers use an addressing schema as precise as possible, while learning from misaddressing. Updates are performed locally and splitting never requires sending multiple messages to many peers. Finally, every peer can store data and perform similarity queries simultaneously. Though the first results are encouraging, much more research in this direction is necessary.
According to this view, a new similarity search structure was defined: The Metric Content Addressable Network (MCAN). MCAN is a distributed data structure for searching in Metric Spaces. Based on the Content Addressable Network (CAN), the MCAN uses pivots to map objects from the original Metric Space in a Vector Space which is then divided in zones to distribute objects between the participating peers. In MCAN it's possible to perform Range Query and K-Nearest Neighbour searches in a scalable way. Distributing objects among the participating peers, the parallel cost of a single operation can be limited while the search operation doesn't involve all the participating node because of the mapping is contractive.
For more information please contact Prof. Pavel Zezula from Masaryk University of Brno.