Friday, June 7, 2024

SQL/PGQ and graph theory

The story goes almost two decades back. I was studying computer engineering in College of Engineering Pune. Prof Vinayak Joshi had recently joined COEP and had research interest in discrete mathematic, especially Lattices. He needed some software for exploring some patterns in graphs. Matlab, though almost three decades old by then, was new to us and quite expensive. He approached my scholar friend Amit Bose and requested him to write a program for exploring the desired graph patterns. If memory serves, Amit wrote the program in C and quite possibly used TurboC as an IDE. That was a remarkable feat given that graphs could consume a huge memory, C is a very basic language and TurboC was clumsy and clunky back then. May be Amit knew how to use gcc and linux which allowed huge memory models. (I am sure today's software engineers haven't heard about memory models.) In the due course we studied discrete mathematics, graph theory and also lattices.

Few years later I joined post-graduate program in Computer Science and Engineering at IIT Bombay. The institute employed eminent faculty in the area of theoretical computer science, including graph theory. I worked as a teaching assistant to Prof. Ajit Diwan and Prof. Abhiram Ranade. I also took a course run by Prof. Sundar Vishwanathan. That developed my interest in the graph theory. But I felt any work in graph theory required a lot of patience (more details to follow) and it was too difficult and intangible. Like many others in my class, I dared not take a research project in the subject.

Graph theory still fascinates me. Graphs are still my first go-to-tool to model and solve any problem in profession or life. If graph theory fails I use other tools. Fast forward to the present, when I saw opportunity to work with graphs in RDBMS, I immediately grabbed it. Follow pgsql-hackers thread for more about that work. I don't know whether Prof. Ajit had graph databases in his mind when he said to our class, "Do you think that I can not offer projects in databases? I can offer a lot of them". But that's quite true. In the age of AI and analytics, graph databases are once again a hot topic.

In Prof. Sundar prescribed "Introduction to graph theory (second edition)" by Douglas West for his course. the exercises in that book were like puzzles for me. I liked to work those out myself. Most of the times I didn't. One such problem was related to "king". If memory serves me, it was 1.4.38 on page 66 of that book, which I still have with me. I spent hours trying to prove the theorem but did not succeed. I went to Prof. Sundar for help. He patiently listened to all the things I had tried to solve that problem. He said I was very close to the solution and any hint from him would be as good as the solution itself. He suggested that I sit in his room and try again. After an hour of struggle, I left his room without any success. The problem still haunts me.

I don't know whether Matlab is still expensive and whether Prof. Joshi is still using programs to explore his graphs. But if he is, SQL/PGQ might come handy. Having it in PostgreSQL means they can use it for free. All the database capabilities allow them to store and retrieve the graphs they have tried in their research. Let's take a simple example, of a king.

In a digraph, a king is a vertex from which every vertex is reachable by a path of length at most 2. In other words, if a vertex v is a king, it is connected to every other vertex by a path of length at most two. Let's see how to do that with SQL/PGQ. Assume a table "vertexes" which contains all the vertexes of the graph and a table "edges" which contains all the edges in the graph connecting those vertices.

create table vertexes (id int primary key,
                        name varchar(10));
create table edges (id int primary key,
                    src int references vertexes(id),
                    dest int references vertexes(id),
                    name varchar(10));
create property graph tournament
    vertex tables (vertexes default label)
    edge tables (edges source key (src) references vertexes(id)
                        destination key(dest) references vertexes(id)
                        default label)

Let's build the query to find a king step by step. First step would be to find all the nodes reachable from a given node by a path of length at most 2.

select src_name, dest_name 
    from graph_table (tournament 
                        match (src is vertexes)->{1,2}(dest is vertexes)
                        where (src.id <> dest.id)
                        columns (src.name as src_name, dest.name as dest_name))
    order by src_name;;

I have discussed most of the constructs in my previous posts on DBaaG and its components. {1, 2} is a new construct being used here which indicates that the path between src and dest is of length maximum 2. Thus it lists all src nodes and respective dest nodes which are connected to their respective src node by a path of length at most 2. Also notice that we are eliminating the src node being reported as dest node to simplify the next step in the query.

Now we have to find node/s which is connected to all the nodes in such a way. To do that we simply count the distinct dest nodes an src node is reachable to. If this count is same as the number of vertexes in the graph but one, corresponding src node is the king.

select src_name, count(distinct dest_name) num_reachable_nodes 
    from graph_table (tournament
                match (src is vertexes)->{1,2}(dest is vertexes)
                where (src.id <> dest.id)
                columns (src.name as src_name, dest.name as dest_name))
    group by src_name
    having count(distinct dest_name) = (select count(*) - 1 from vertexes);

distinct in aggregate count makes sure to count each dest node only once. having clause filters every node which is not connected to all  the other nodes in the graph. If we populate vertex and edge tables as follows:

insert into vertexes values (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e');
insert into edges values (1, 1, 2, 'a-b'), (2, 2, 3, 'b-c'), (3, 1, 3, 'a-c'), (4, 3, 4, 'c-d'), (5, 4, 5, 'd-e'), (6, 2, 1, 'b-a');

This will create a graph as shown in the figure below. Notice the cycle a->b->a. Node e is reachable from node a by a path of length 3. All other nodes are reachable from node a by a path of length at most 2.

Above query does not return any rows since there is no king in this graph.







Let's remove outlier node e and the edge connecting nodes d and e.

delete from edges where name = 'd-e';
delete from vertexes where name = 'e';
With that the above query returns two king nodes a and b.

src_name   num_reachable_nodes
---------- -------------------
a   | 3
b   | 3

PostgreSQL is loved by developers. Hope introduction of SQL/PGQ makes it popular among graph theory researchers as well.

No comments:

Post a Comment