Re: Making a tree with "millions and millions" of dynamic - Mailing list pgsql-general

From Arjen van der Meijden
Subject Re: Making a tree with "millions and millions" of dynamic
Date
Msg-id 001401c3bbf3$0e9f6500$3ac15e91@acm
Whole thread Raw
In response to Re: Making a tree with "millions and millions" of dynamic  ("Rick Gigger" <rick@alpinenetworking.com>)
List pgsql-general
> Rick Gigger wrote:
>
> I was glad to see this topic come up on the list as I was
> about to start asking about some of these issues myself.  I
> would like to discuss each of the methods I have researched
> so far for doing trees in sql and see if anyone has some
> experience or insite into the topic that could help me.

Have a look here: http://www.scenic-route.com/program/db/lists.htm
It is quite a nice overview with some advantages and disadvantages of
each method.

> I have a lot of thouhts on this but here is my first question:
>
> In the first article at the end of the section on materialized paths
> and the beginning of the nested set section Tropashko basically says
> that neither one really does a good job at answering "who are all of
> my ancestors" but that the materialized path method has a nice kludge
> that you can do with some funky string manipulation.

I don't see how the nested intervals-query is better/faster than the
nested set approach, but both seem to be possibly using indices.
The nested set uses a simple "where left between parent.left and
parent.right", if you happen to have those numbers in your application's
memory, you could even build your query directly like that and save
yourself a self join.
Another approach, I used, is something like:
Select * from nset n1, (select (distinct) * from nset where nset.id =
$parentid) as n2 where n1.left between n2.left and n2.right
(the distinct will force a materialization, which might improve the
performance or not, in my tests it did)

With the materialized path you'll have to use a like-search (child.path
like parent.path || '%'), where hopefully the optimiser notices that it
can chop off the end of the string and just look at the first part

> Is this true?  Celko in his articles seems to make it sound
> like this query will be very fast.  But Tropashko it seems
It'll use an index range-search, so I don't see how it can be faster,
except for a hash/index lookup or so (which is possible with the
transitive closure version of an adjacency list).

> is saying that it will be slow. Am I reading this right?  If
> so is Tropashko right?  Any ideas on this?  Any articles / papers
> that might shed some light on this specific question or the
> topic in general?
The above article is a nice one to read I think.

I'll try to summarize my thoughts on them, since I'm looking into an
efficient tree approach myself aswell:
All methods have both advantages and disadvantages and it'll depend on
your situation whether that is a showstopper or not. All are bad when
you want to remove a node from the tree, but the insert/update penalties
depend on the situation.

* The adjancency list is lightning fast with inserts, updates but a bit
clumsy with deletions (you need to connect the direct children to
another element, that's all). It'll be very fast with the two simple
queries "who are my direct children" and "who is my parent", but you
need either a recursive approach or some enhancement like a transitive
closure-graph to enhance queries like "who are my predecessors" and "who
are all my ancestors".

So if you'll have a large tree, only few a levels deep and a lot of
inserts and/or subtree movements, this seems to be the best approach.
And you can store "max int"-number of nodes.

* The materialized path is not so very efficient in terms of storage, it
does inserts fast, but updates and deletions are slow. Retrieval of all
ancestors is fast, retrieval of the predecessors is extremely trivial,
but retrieval of the direct parent and/or children is somewhat more
difficult (you'll need to look at the depth, if you do that often a
functional-index might be handy).
Perhaps the ltree-implementation in contrib/ltree is worth a look and
encoding the trees in much more dense encodings (like a 256bit encoding
I saw mentioned earlier and skipping the dot for just a single-bit
terminator) makes it less inefficient.

So if you do a lot of inserts, not so many updates and deletes and have
much need for the entire list of predecessors/ancestors, I suppose this
one is quite nice. But it is not so efficient in terms of storage, so
very large/deep trees impose a huge overhead since each node stores its
entire path.

* The nested set is inefficient with insertions, updates and deletions
but much more efficient with storage than the MP. To speed up the
process of selecting only the direct parent/children you can use a depth
field (redundant, but handy) which'll save you one or two extra
selfjoins.
For relatively static trees the nested set is good, but actively
changing trees are very bad for your performance. You can store "max
int"/2 nodes in this approach.

* The nested intervals use the same approach as the nested set, but uses
a static encoding. This allows you to do very efficient inserts, so the
author claims. I have my doubts dough, if you don't know the number of
children of your parent and just want to do "connect this child to that
parent", it'll be more difficult, you'll need to find that parent, count
it's children, decide the encoding-number of this new child and then you
can calculate its new numbers.
Still more efficient than the nested set, but not so efficient and you
DO need to look into your database, while the author claimed you didn't
need to.
I havent't worked with this approach yet, but I saw the author use a lot
of functions in his queries, I'm not sure whether that is a good idea,
it might affect the usability of your indices...
Another disadvantage is its very inefficient use of the integer-values
available, it can only have a depth of 32 with normal integers, afaik
and not so many children on a level as you'd like perhaps. The author
suggests defining a unlimited integer, which is not so good for your
performance at all (search for '"nested interval" sql' in google or so
and you'll find his messages on the dbforums).

* His other approach, the single-integer version, is even worse in terms
of efficient integer-use, you can have only 32 levels at the best case
(the left most branch only!) and only about 30 nodes next to each other
in the same bestcase (since the numbers increase exponantional: level1 =
5, 11, 23, 47, 95, etc). The worst case (the right most branch) can
possibly store only one or two levels with only one or two nodes...

So the claimed advantage of the better performance is a bit limited,
you'll either need *very* large integers to store large trees, or you
are simply limited in the size of your tree by some relatively small
numbers. And that is a bit sad, since bad performance is only a
problem/visible with large(r) trees.

I'm not sure whether the above summary is correct, anyone any idea? I've
to use it in a research paper for my study anyway, so expanding my
thoughts in an email is a nice test :)
Please comment.

Best regards,

Arjen van der Meijden




pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: postgresql locks the whole table!
Next
From: Bruce Momjian
Date:
Subject: Re: Any *current* summary of postgres-r 7.2 status? (was Re: