Groups | Blog | Home
all groups > sql server programming > august 2003 >

sql server programming : Database Representation of Undirected Graph (Peer Nodes)


Dave
8/7/2003 11:10:15 PM
I'm trying to figure out the best way to represent a
network of nodes in an undirected graph.
Each node can have n peers.

I'd like to be able to derive the following:

1) Count all nodes in a given network
2) Retrieve all nodes closer than X steps away from a
given node
3) Retrieve all nodes further than X steps away from a
given node

What's the best way to represent this?
David Portas
8/8/2003 7:40:13 AM
If your graph is a straightforward hierarchy then see these articles:

http://www.intelligententerprise.com/001020/celko1_1.shtml
http://www.dbazine.com/tropashko4.html

If it's a reconvergent graph there is a discussion in this thread which
gives some pointers:

http://groups.google.com/groups?selm=O4d55UwzCHA.2368%40TK2MSFTNGP12

--
David Portas
------------
Please reply only to the newsgroup
--


Joe Celko
8/8/2003 11:46:16 AM
[quoted text, click to view]

The best way I have found in SQL is an adjacency list.

CREATE TABLE DirectedGraph
(source CHAR(3) NOT NULL,
destination CHAR(3) NOT NULL,
cost INTEGER NOT NULL,
CHECK(source < node_2),
PRIMARY KEY (source, destination));

then use

CREATE VIEW Graph (node_1, node_2, cost)
AS SELECT source, destination, cost FROM DirectedGraph
UNION ALL
SELECT destination, source, cost FROM DirectedGraph;

1) Count all nodes in a given network

SELECT COUNT(DISTINCT node_1)
FROM Graph;

2) Retrieve all nodes closer than X steps away from a
given node.

This is very ugly, but it is the best I have. The best way is probably
to do the Floyd-Warshall or Johnson algorithm in a procedural language
and load a table with the results. But I want to do this in pure SQL as
an exercise. Let's start with a simple graph and represent it as an
adjacency list with weights on the edges.

CREATE TABLE Graph
(source CHAR(2) NOT NULL,
destination CHAR(2) NOT NULL,
cost INTEGER NOT NULL,
PRIMARY KEY (source, destination));

I got data for this table from the book Introduction to Algorithms by
Cormen, Leiserson and Rivest (ISBN 0-262-03141-8), page 518. This book
is very popular in college courses in the United States. I made one
decision that will be important later; I added self-traversal edges; the
node is both the source and the destination so the cost of those paths
is zero.

INSERT INTO Graph
VALUES ('s', 's', 0),
('s', 'u', 3),
('s', 'x', 5),
('u', 'u', 0),
('u', 'v', 6),
('u', 'x', 2),
('v', 'v', 0),
('v', 'y', 2),
('x', 'u', 1),
('x', 'v', 4),
('x', 'x', 0),
('x', 'y', 6),
('y', 's', 3),
('y', 'v', 7),
('y', 'y', 0);

I am not happy about this approach, because I have to decide the maximum
number of edges in path before I start looking for an answer. But this
will work and I know that a path will have no more than the total number
of nodes in the graph. Let's create a table to hold the paths:

CREATE TABLE Paths
(step_1 CHAR(2) NOT NULL,
step_2 CHAR(2) NOT NULL,
step_3 CHAR(2) NOT NULL,
step_4 CHAR(2) NOT NULL,
step_5 CHAR(2) NOT NULL,
total_cost INTEGER NOT NULL,
path_length INTEGER NOT NULL,
PRIMARY KEY (step_1, step_2, step_3, step_4, step_5));


The step_1 node is where I begin the path. The other columns are the
second step, third step, fourth step, and so forth. The last step
column is the end of the journey. The total_cost column is the total
cost, based on the sum of the weights of the edges, on this path. The
path length column is harder to explain, but for now, let's just say
that it is a count of the nodes visited in the path.

To keep things easier, let's look at all the paths from "s" to "y" in
the graph. The INSERT INTO statement for construction that set looks
like this:

INSERT INTO Paths
SELECT G1.source, -- it is 's' in this example
G2.source,
G3.source,
G4.source,
G4.destination, -- it is 'y' in this example
(G1.cost + G2.cost + G3.cost + G4.cost),
(CASE WHEN G1.source
NOT IN (G2.source, G3.source, G4.source)
THEN 1 ELSE 0 END
+ CASE WHEN G2.source
NOT IN (G1.source, G3.source, G4.source)
THEN 1 ELSE 0 END
+ CASE WHEN G3.source
NOT IN (G1.source, G2.source, G4.source)
THEN 1 ELSE 0 END
+ CASE WHEN G4.source
NOT IN (G1.source, G2.source, G3.source)
THEN 1 ELSE 0 END)
FROM Graph AS G1, Graph AS G2, Graph AS G3, Graph AS G4
WHERE G1.source = 's'
AND G1.destination = G2.source
AND G2.destination = G3.source
AND G3.destination = G4.source
AND G4.destination = 'y';

I put in "s" and "y" as the source and destination of the path, and made
sure that the destination of one step in the path was the source of the
next step in the path. This is a combinatorial explosion, but it is
easy to read and understand.

The sum of the weights is the cost of the path, which is easy to
understand. The path_length calculation is a bit harder. This sum of
CASE expressions looks at each node in the path. If it is unique within
the row, it is assigned a value of one, if it is not unique within the
row, it is assigned a value of zero.

All paths will have five steps in them because that is the way to table
is declared. But what if a path exists between the two nodes which is
shorter than five steps? That is where the self-traversal rows are used!
Consecutive pairs of steps in the same row can be repetitions of the
same node.

Here is what the rows of the Paths table look like after this INSERT
INTO statement, ordered by descending path_length, and then by ascending
cost.

Paths
step_1 step_2 step_3 step_4 step_5 total_cost path_length
======================================================
s s x x y 11 0
s s s x y 11 1
s x x x y 11 1
s x u x y 14 2
s s u v y 11 2
s s u x y 11 2
s s x v y 11 2
s s x y y 11 2
s u u v y 11 2
s u u x y 11 2
s u v v y 11 2
s u x x y 11 2
s x v v y 11 2
s x x v y 11 2
s x x y y 11 2
s x y y y 11 2
s x y v y 20 4
s x u v y 14 4
s u v y y 11 4
s u x v y 11 4
s u x y y 11 4
s x v y y 11 4

Many of these rows are equivalent to each other. For example, the paths
('s', 'x', 'v', 'v', 'y', 11, 2) and ('s', 'x', 'x', 'v', 'y', 11, 2)
are both really the same path as ('s', 'x', 'v', 'y').

In this example, the total_cost column defines the cost of a path, so we
can eliminate some of the paths from the table with this statement, if
we want the lowest cost.

DELETE FROM Paths
WHERE total_cost
[quoted text, click to view]
FROM Paths);

In this example, it got rid of 3 out of 22 possible paths. Let's
consider another cost factor; the number of paths. People do not like
to change airplanes or trains. If they can go from Amsterdam to New
AddThis Social Bookmark Button