Recursively traversing a partially ordered set - Mailing list pgsql-sql

From Jason Grout
Subject Recursively traversing a partially ordered set
Date
Msg-id 465C3380.1080207@creativetrax.com
Whole thread Raw
Responses Re: Recursively traversing a partially ordered set  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-sql
I have a performance related question (and a few code snippets to
contribute to the list archives).

I am trying to write a function that recursively walks a hierarchal
structure to find all ancestors or descendants of a particular item or
list of items.  My relationships are set up with the table:

CREATE TABLE relation (parent_id integer, child_id integer);
CREATE INDEX relation_parent ON relation(parent_id);
CREATE INDEX relation_child ON relation(child_id);

My data is *not* a tree (otherwise I'd probably use Celko's method).
Instead, my structure is a (graded) partially ordered set, which is more
general than a tree. In other words, each item has a level in the
hierarchy and can be connected to a number of nodes in the level above 
it and a number of nodes in the level below it. Items on the same level 
have no relation between them, but they generally have many common 
parents and many common children. In my case, items will generally have 
far more parents than children. Right now my relation table has several 
hundred thousand rows and 9 levels, but will eventually have tens of 
millions of rows and more levels. Note that all of these algorithms work 
just fine for trees too.

Given a list of item ids, I need to determine all items above them in
the hierarchy (or below them).  Since determining ancestors is the
inverse problem of determining descendants, I'll only treat the
ancestors problem in the rest of this email.  Note that the list I am
given may include items on different levels.

After playing with it for a few days, I ended up with four functions
(after trying lots of variations on these, based on things found in the
mailing list, documentation, and thinking about the problem).  The
functions are listed from slowest to fastest. My questions are at the
end of the email.

1. this rather nice, succinct, recursive (and slow) SQL function:

CREATE OR REPLACE FUNCTION get_ancestors_sql(integer[])   RETURNS SETOF integer AS
$BODY$
SELECT DISTINCT parent_id from subgraphs
WHERE child_id = ANY ($1 ||    CASE    WHEN EXISTS(SELECT child_id FROM relation            WHERE child_id = ANY($1))
    THEN ARRAY(SELECT * FROM            get_ancestors_sql(            ARRAY(SELECT parent_id FROM relation WHERE
       child_id=ANY($1))))    ELSE '{}'::int[] END)
 
$BODY$   LANGUAGE 'sql' STABLE;

2. this more verbose recursive (and still slow) pl/pgsql function:

CREATE OR REPLACE FUNCTION get_ancestors_pg(integer[])   RETURNS SETOF integer AS
$BODY$
DECLARE
r record;
s record;
BEGIN
SELECT INTO s child_id FROM relation WHERE child_id=ANY($1) LIMIT 1;
IF FOUND THEN
FOR r IN SELECT DISTINCT parent_id from relation    WHERE child_id = ANY ($1) OR    child_id IN (SELECT * FROM
get_ancestors_pg(          ARRAY(SELECT DISTINCT parent_id FROM relation WHERE            child_id=ANY($1)))) LOOP
RETURNNEXT r.parent_id;
 
END LOOP;
END IF;
END;
$BODY$   LANGUAGE 'plpgsql' STABLE;


3. this faster function that uses a table to cache results.  This could
be modified so that consecutive calls use the cache to provide much
quicker responses, but currently I delete the information at the end of
a function call.

CREATE OR REPLACE FUNCTION get_ancestors_pg_temp(ids integer[])   RETURNS SETOF integer AS
$BODY$
DECLARE
-- This is a "session" variable so that two concurrent calls
-- to the function don't step on each other and mutilate
-- each other's data.
randvar integer:=round(random()*2000000)::int;
i integer:=0;
r record;
BEGIN

INSERT INTO temp    SELECT DISTINCT g.parent_id, randvar from relation as g        WHERE g.child_id=ANY(ids);

LOOP
INSERT INTO temp    SELECT DISTINCT g.parent_id, randvar+1 from relation as g    WHERE g.child_id IN        (SELECT
parent_idFROM temp WHERE session=randvar);
 
EXIT WHEN NOT FOUND;
i:=i+1;
randvar:=randvar+1;
END LOOP;

FOR r IN SELECT parent_id FROM temp    WHERE session BETWEEN randvar-i AND randvar LOOP    RETURN NEXT r.parent_id;
END LOOP;

DELETE FROM temp WHERE session BETWEEN randvar-i and randvar;

END;
$BODY$   LANGUAGE 'plpgsql' VOLATILE;


4. This much, much faster version that loops and just throws everything
into an array and then spits out the array at the end.

CREATE OR REPLACE FUNCTION get_ancestors_pg_array(ids integer[])   RETURNS integer[] AS
$BODY$
DECLARE
result int[]:='{}'::int[];
nextiter int[]:='{}'::int[];
temp int[]:=ids;
BEGIN

LOOP
SELECT INTO nextiter    ARRAY(SELECT DISTINCT g.parent_id from relation as g        WHERE g.child_id=ANY(temp));
result:=nextiter||result;
EXIT WHEN nextiter[1] IS NULL;
temp:=nextiter;
END LOOP;

RETURN result;

END;
$BODY$   LANGUAGE 'plpgsql' STABLE;


The fourth function, get_ancestors_pg_array, seems to execute in about a
tenth the time that the other functions execute in.  My questions:

1. Is there a huge overhead for recursion?  In general, is it better to
stuff temporary results into a table (incurring the cost of an INSERT)
rather than recurse?

2. Is there a big difference in speed between using an array versus
using a SELECT in a WHERE condition?  In other words, which is generally
going to be faster:

SELECT * from table where field IN (some function returning a SETOF);

or

SELECT * from table where field = ANY(some function returning an array);

3. Is there a strong reason I should strip out duplicates in either of
the two cases in question 2?  Or is the performance about the same when
doing the queries whether or not the SETOF or arrays contain duplicates?

4. Can you see any obvious optimizations to the above functions
(particularly the last one)?

Thanks for your help. Thanks for the absolutely wonderful database and
solid documentation.  I originally did this project in MySQL and had the
weirdest errors (the errors turned out to be due to the default
case-insensitive collation of MySQL!).  That's when I decided to move to
postgresql when I updated the project.

Thanks,

Jason


pgsql-sql by date:

Previous
From: "Raj A"
Date:
Subject: Re: aggregate query
Next
From: Gregory Stark
Date:
Subject: Re: Recursively traversing a partially ordered set