Re: Making a tree with "millions and millions" of dynamic - Mailing list pgsql-general

From Arjen van der Meijden
Subject Re: Making a tree with "millions and millions" of dynamic
Date
Msg-id 3FD62927.70904@vulcanus.its.tudelft.nl
Whole thread Raw
In response to Re: Making a tree with "millions and millions" of dynamic nodes  (Christian Fowler <google@NOSPAM.gravesweeper.com>)
List pgsql-general
Christian Fowler wrote:

> Here is another approach (only first half of the page applies here)
> I was emailed personally in response to my orginal post:
>
> http://fungus.teststation.com/~jon/treehandling/TreeHandling.htm
>
> Basically, the (too obivious for me to think of ;-) concept appears to
> be storing a mapping of the Node / Parent for every parent in the
> chain. Basically it splits up the materialized view concept into
> multiple rows, or one can think of it as the adjancency list with all
> parents stored. Perhaps call this the "Materialized List".

Afaik, this is what Celko (in SQL for smarties) calls the "transitive
closure version of a adjancecy list" or something like that.
It's in the same chapter as adjanceny list-based trees near the end, if
you want to read that book.

> Personally, this is my likely favorite since I tend to "trust" the
> performance of giant tables that are nothing more than 2 columns of
> integers. Seems like this approach would handle most of the desired
> tree cases (select path & child count, insert, delete, move) very
> well.

I have now (and for now, not sure which aproach is best/good enough) a
combination of this, nested sets and adjancency lists, but I'm getting
quite a bit of overhead for a few relations ;)

I have two types of data, categories (in a tree with interlinks between
categories, besides the parent-child relations) and documents.
I've a nested-set + adjancency list for the tree itselves and a
"transative closure"-table for the documents-category relations (that
saves me some 70% of the querieing time, if I'd looked up those
relations using the nested set-table on the fly).

Anyway, my tables won't be multimillion, although they might be heavily
used in webpages, and will be relatively static. So I can allow quite a
bit of overhead.

> Hopefully postgres can handle a very active, very narrow 50 million
> row table without breaking a sweat.

It probably can, bear in mind that if you disable OIDs for that table,
it saves you 50M * 4bytes of storage orso... And I don't thing you'd
really need them (unless you want to do some lookups on the "newly
inserted row").

Best regards,

Arjen

> "Arjen van der Meijden" <acmmailing@vulcanus.its.tudelft.nl> wrote:
>
>>>Rick Gigger wrote:
>>>
>>>I was glad to see this topic come up on the list as I was
>>>about to start asking about some of these issues myself.  I
>>>would like to discuss each of the methods I have researched
>>>so far for doing trees in sql and see if anyone has some
>>>experience or insite into the topic that could help me.
>>
>>Have a look here: http://www.scenic-route.com/program/db/lists.htm
>>It is quite a nice overview with some advantages and disadvantages of
>>each method.
>>
>>
>>>I have a lot of thouhts on this but here is my first question:
>>>
>>>In the first article at the end of the section on materialized paths
>>>and the beginning of the nested set section Tropashko basically says
>>>that neither one really does a good job at answering "who are all of
>>>my ancestors" but that the materialized path method has a nice kludge
>>>that you can do with some funky string manipulation.
>>
>>I don't see how the nested intervals-query is better/faster than the
>>nested set approach, but both seem to be possibly using indices.
>>The nested set uses a simple "where left between parent.left and
>>parent.right", if you happen to have those numbers in your application's
>>memory, you could even build your query directly like that and save
>>yourself a self join.
>>Another approach, I used, is something like:
>>Select * from nset n1, (select (distinct) * from nset where nset.id =
>>$parentid) as n2 where n1.left between n2.left and n2.right
>>(the distinct will force a materialization, which might improve the
>>performance or not, in my tests it did)
>>
>>With the materialized path you'll have to use a like-search (child.path
>>like parent.path || '%'), where hopefully the optimiser notices that it
>>can chop off the end of the string and just look at the first part
>>
>>
>>>Is this true?  Celko in his articles seems to make it sound
>>>like this query will be very fast.  But Tropashko it seems
>>
>>It'll use an index range-search, so I don't see how it can be faster,
>>except for a hash/index lookup or so (which is possible with the
>>transitive closure version of an adjacency list).
>>
>>
>>>is saying that it will be slow. Am I reading this right?  If
>>>so is Tropashko right?  Any ideas on this?  Any articles / papers
>>>that might shed some light on this specific question or the
>>>topic in general?
>>
>>The above article is a nice one to read I think.
>>
>>I'll try to summarize my thoughts on them, since I'm looking into an
>>efficient tree approach myself aswell:
>>All methods have both advantages and disadvantages and it'll depend on
>>your situation whether that is a showstopper or not. All are bad when
>>you want to remove a node from the tree, but the insert/update penalties
>>depend on the situation.
>>
>>* The adjancency list is lightning fast with inserts, updates but a bit
>>clumsy with deletions (you need to connect the direct children to
>>another element, that's all). It'll be very fast with the two simple
>>queries "who are my direct children" and "who is my parent", but you
>>need either a recursive approach or some enhancement like a transitive
>>closure-graph to enhance queries like "who are my predecessors" and "who
>>are all my ancestors".
>>
>>So if you'll have a large tree, only few a levels deep and a lot of
>>inserts and/or subtree movements, this seems to be the best approach.
>>And you can store "max int"-number of nodes.
>>
>>* The materialized path is not so very efficient in terms of storage, it
>>does inserts fast, but updates and deletions are slow. Retrieval of all
>>ancestors is fast, retrieval of the predecessors is extremely trivial,
>>but retrieval of the direct parent and/or children is somewhat more
>>difficult (you'll need to look at the depth, if you do that often a
>>functional-index might be handy).
>>Perhaps the ltree-implementation in contrib/ltree is worth a look and
>>encoding the trees in much more dense encodings (like a 256bit encoding
>>I saw mentioned earlier and skipping the dot for just a single-bit
>>terminator) makes it less inefficient.
>>
>>So if you do a lot of inserts, not so many updates and deletes and have
>>much need for the entire list of predecessors/ancestors, I suppose this
>>one is quite nice. But it is not so efficient in terms of storage, so
>>very large/deep trees impose a huge overhead since each node stores its
>>entire path.
>>
>>* The nested set is inefficient with insertions, updates and deletions
>>but much more efficient with storage than the MP. To speed up the
>>process of selecting only the direct parent/children you can use a depth
>>field (redundant, but handy) which'll save you one or two extra
>>selfjoins.
>>For relatively static trees the nested set is good, but actively
>>changing trees are very bad for your performance. You can store "max
>>int"/2 nodes in this approach.
>>
>>* The nested intervals use the same approach as the nested set, but uses
>>a static encoding. This allows you to do very efficient inserts, so the
>>author claims. I have my doubts dough, if you don't know the number of
>>children of your parent and just want to do "connect this child to that
>>parent", it'll be more difficult, you'll need to find that parent, count
>>it's children, decide the encoding-number of this new child and then you
>>can calculate its new numbers.
>>Still more efficient than the nested set, but not so efficient and you
>>DO need to look into your database, while the author claimed you didn't
>>need to.
>>I havent't worked with this approach yet, but I saw the author use a lot
>>of functions in his queries, I'm not sure whether that is a good idea,
>>it might affect the usability of your indices...
>>Another disadvantage is its very inefficient use of the integer-values
>>available, it can only have a depth of 32 with normal integers, afaik
>>and not so many children on a level as you'd like perhaps. The author
>>suggests defining a unlimited integer, which is not so good for your
>>performance at all (search for '"nested interval" sql' in google or so
>>and you'll find his messages on the dbforums).
>>
>>* His other approach, the single-integer version, is even worse in terms
>>of efficient integer-use, you can have only 32 levels at the best case
>>(the left most branch only!) and only about 30 nodes next to each other
>>in the same bestcase (since the numbers increase exponantional: level1 =
>>5, 11, 23, 47, 95, etc). The worst case (the right most branch) can
>>possibly store only one or two levels with only one or two nodes...
>>
>>So the claimed advantage of the better performance is a bit limited,
>>you'll either need *very* large integers to store large trees, or you
>>are simply limited in the size of your tree by some relatively small
>>numbers. And that is a bit sad, since bad performance is only a
>>problem/visible with large(r) trees.
>>
>>I'm not sure whether the above summary is correct, anyone any idea? I've
>>to use it in a research paper for my study anyway, so expanding my
>>thoughts in an email is a nice test :)
>>Please comment.
>>
>>Best regards,
>>
>>Arjen van der Meijden
>>
>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: the planner will ignore your desire to choose an index scan if your
>>     joining column's datatypes do not match
>>
>
>




pgsql-general by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Reset oid , starting value=1 -- Max. Number of OID
Next
From: "Rick Gigger"
Date:
Subject: Re: Making a tree with "millions and millions" of dynamic nodes