IMFD researchers present advances in search systems for graphs in SIGMOD/PODS

"Worst-Case-Optimal Similarity Joins on Graph Databases" is the title of the research presented at SIGMOD/PODS 2024 by Millennium Institute Foundational Research on Data (IMFD) researchers Benjamín Bustos, Aidan Hogan (IMFD alternate director) and Gonzalo Navarro, from Computation at U. de Chile; Diego Arroyuelo and IMFD director Juan Reutter, from Computation at U. Católica, and Adrián Gómez-Brandón (Universidade da Coruña).

Regarding this research, Gonzalo Navarro explains: "There is a lot of recent work on so-called graph databases, which represent information in the form of nodes (the objects) and edges connecting nodes (and representing relationships between objects), because they are more flexible than relational databases. These networks are queried mainly by giving a small graph pattern, which has nodes and edges but where the identities of the nodes and edges can be variable. Each way to give a value to the variables so that the pattern appears in the graph is an answer to the query. As a very simplified example, in a network with information about pharmacies, a query like (F, sells, X), (X, component, paracetamol), (X, price, P), (F, has-agreement, Fonasa), gives me the pharmacies F with agreement with Fonasa that sell some medicine X with paracetamol, and the price P at which they sell it. Responding to these queries efficiently is not easy, and it is not uncommon for some of them to take seconds or even minutes. A quality standard for these algorithms is to be worst-case optimal. These types of algorithms are relatively recent and there is still a lot of research on them."

In this context, the academic says that one line of work developed at IMFD is called multimodal graphs, "which contain more semantics than mere connectivity between nodes. This type of information is very frequent in real life and to be effectively consulted requires enriching the language of pattern graphs with specific expressions of data types. For example, some nodes may be geographical places and have associated location information, and I would like to be able to look up the above query, but restricted to the pharmacy being within a certain distance radius. Or that the current time is between its opening and closing hours. Or that the price is less than a certain limit."

Professor Gonzalo Navarro points out that this scientific paper addresses the problem of being able to add metric information to objects and proximity conditions to pattern graphs. He explains that metrics are generalizations of Euclidean distances (such as those of the plane or space), covering not only the example of geographical distance but others such as similarity between images, audio, video, text, etc. (which are modeled as distances in very high-dimensional spaces). "We also wanted to maintain the worst-case-optimality condition of the search algorithms," he says.

In this way, the researchers show that it is possible to include this type of information and support a particular type of queries, "which is k nearest neighbors. It forms, among the objects that have the metric, a graph where each node knows which are the k nearest/closest nodes to it. The query can then ask for an object to be among the k nearest/closest to another. This allows us to ask, for example, for churches in my own city whose facade resembles (as an image) one I am interested in. We show that the k-neighbor graph can be joined to the database graph so as to integrate it elegantly into the same worst-case-optimal algorithm, reusing in a way all the known development on this algorithm. We show that, in several cases, we can retain worst-case-optimality, although not in all. And we implement the new algorithm and show that it is considerably better than a basic solution. This opens the door to support other types of queries (by distance, as in the example of pharmacies), and other types of data (topological, geographical regions, temporal, etc.)," comments the academic.

SOURCE: Communications DCC U. de Chile