Re: Making a tree with "millions and millions" of dynamic - Mailing list pgsql-general
From | Rick Gigger |
---|---|
Subject | Re: Making a tree with "millions and millions" of dynamic |
Date | |
Msg-id | 004301c3be98$3ec0e040$0700a8c0@trogdor Whole thread Raw |
In response to | Re: Making a tree with "millions and millions" of dynamic ("Arjen van der Meijden" <acmmailing@vulcanus.its.tudelft.nl>) |
List | pgsql-general |
> > Rick Gigger wrote: > > > > I was glad to see this topic come up on the list as I was > > about to start asking about some of these issues myself. I > > would like to discuss each of the methods I have researched > > so far for doing trees in sql and see if anyone has some > > experience or insite into the topic that could help me. > > Have a look here: http://www.scenic-route.com/program/db/lists.htm > It is quite a nice overview with some advantages and disadvantages of > each method. Thanks for the link. It cleared up some things on the nested intervals system. > > I have a lot of thouhts on this but here is my first question: > > > > In the first article at the end of the section on materialized paths > > and the beginning of the nested set section Tropashko basically says > > that neither one really does a good job at answering "who are all of > > my ancestors" but that the materialized path method has a nice kludge > > that you can do with some funky string manipulation. > > I don't see how the nested intervals-query is better/faster than the > nested set approach, but both seem to be possibly using indices. Well it doesn't require you to update the whole table if you insert a new node into the far left of the tree. That one is kind of a deal breaker for me. > The nested set uses a simple "where left between parent.left and > parent.right", if you happen to have those numbers in your application's > memory, you could even build your query directly like that and save > yourself a self join. So with simple b-tree indexes on left and right wouldn't it have to first look at the index on left and get the whole tree of items with left > child.left then get the whole tree of items in the index on right where right < child.left and then get the intersection of tuples referenced in those two subtrees? Would that be fast? Wouldn't it be faster to just do and sequence scan? Of is there a better way? How would you need to set up the indexes? Would you need to add a single index on left and right? How does it even handle >, < in a situation like that? Or would an rtree index work better? I guess I would need to try myself before I could be certain that it would actually use an index on that. > With the materialized path you'll have to use a like-search (child.path > like parent.path || '%'), where hopefully the optimiser notices that it > can chop off the end of the string and just look at the first part Once again I guess I would need to try it out to see if the optimizer could handle that. If it could handle both of these cases than Tropashko's assertion that those are hard problems was simple untrue. That would invalidate much of the need for his system. > > Is this true? Celko in his articles seems to make it sound > > like this query will be very fast. But Tropashko it seems > It'll use an index range-search, so I don't see how it can be faster, > except for a hash/index lookup or so (which is possible with the > transitive closure version of an adjacency list). > > is saying that it will be slow. Am I reading this right? If > > so is Tropashko right? Any ideas on this? Any articles / papers > > that might shed some light on this specific question or the > > topic in general? > The above article is a nice one to read I think. see above comments > > I'll try to summarize my thoughts on them, since I'm looking into an > efficient tree approach myself aswell: > All methods have both advantages and disadvantages and it'll depend on > your situation whether that is a showstopper or not. All are bad when > you want to remove a node from the tree, but the insert/update penalties > depend on the situation. > > * The adjancency list is lightning fast with inserts, updates but a bit > clumsy with deletions (you need to connect the direct children to > another element, that's all). It'll be very fast with the two simple > queries "who are my direct children" and "who is my parent", but you > need either a recursive approach or some enhancement like a transitive > closure-graph to enhance queries like "who are my predecessors" and "who > are all my ancestors". > > So if you'll have a large tree, only few a levels deep and a lot of > inserts and/or subtree movements, this seems to be the best approach. > And you can store "max int"-number of nodes. > > * The materialized path is not so very efficient in terms of storage, it > does inserts fast, but updates and deletions are slow. Retrieval of all > ancestors is fast, retrieval of the predecessors is extremely trivial But in order to use a "where row is a predecessor" clause in a more complex sql statement you would need to write a user function that did a funky string parsing operation and then returned the parsed out ids in a result set (unless it will use the indes on where parent.path || '%' like child.path. I am doubtfull on this one.) > but retrieval of the direct parent and/or children is somewhat more > difficult (you'll need to look at the depth, if you do that often a > functional-index might be handy). > Perhaps the ltree-implementation in contrib/ltree is worth a look and > encoding the trees in much more dense encodings (like a 256bit encoding > I saw mentioned earlier and skipping the dot for just a single-bit > terminator) makes it less inefficient. > > So if you do a lot of inserts, not so many updates and deletes and have > much need for the entire list of predecessors/ancestors, I suppose this > one is quite nice. But it is not so efficient in terms of storage, so > very large/deep trees impose a huge overhead since each node stores its > entire path. I don't really do a ton of them but it would prefer to keep them as fast as possible as the user will be waiting when they happen. Space is not really a concern for me. I've only got about 40,000 rows. > * The nested set is inefficient with insertions, updates and deletions > but much more efficient with storage than the MP. To speed up the > process of selecting only the direct parent/children you can use a depth > field (redundant, but handy) which'll save you one or two extra > selfjoins. > For relatively static trees the nested set is good, but actively > changing trees are very bad for your performance. You can store "max > int"/2 nodes in this approach. Once again the user will be wating while nodes are moved around in the tree. > * The nested intervals use the same approach as the nested set, but uses > a static encoding. This allows you to do very efficient inserts, so the > author claims. I have my doubts dough, if you don't know the number of > children of your parent and just want to do "connect this child to that > parent", it'll be more difficult, you'll need to find that parent, count > it's children, decide the encoding-number of this new child and then you > can calculate its new numbers. > Still more efficient than the nested set, but not so efficient and you > DO need to look into your database, while the author claimed you didn't > need to. > I havent't worked with this approach yet, but I saw the author use a lot > of functions in his queries, I'm not sure whether that is a good idea, > it might affect the usability of your indices... > > Another disadvantage is its very inefficient use of the integer-values > available, it can only have a depth of 32 with normal integers, afaik > and not so many children on a level as you'd like perhaps. The author > suggests defining a unlimited integer, which is not so good for your > performance at all (search for '"nested interval" sql' in google or so > and you'll find his messages on the dbforums). > > * His other approach, the single-integer version, is even worse in terms > of efficient integer-use, you can have only 32 levels at the best case > (the left most branch only!) and only about 30 nodes next to each other > in the same bestcase (since the numbers increase exponantional: level1 = > 5, 11, 23, 47, 95, etc). The worst case (the right most branch) can > possibly store only one or two levels with only one or two nodes... > > So the claimed advantage of the better performance is a bit limited, > you'll either need *very* large integers to store large trees, or you > are simply limited in the size of your tree by some relatively small > numbers. And that is a bit sad, since bad performance is only a > problem/visible with large(r) trees. You hit my main concerns with it dead on. I don't want to touch it until I can at least find at least one person who has made it work for them. I have yet to do so. I have read postings for many people who had some big problems with it. The overflow issue is probably a deal breaker for me. It might not be if I understood under what exact conditions it would happen but the whole thing as too many questions for me to want to spend time on it. If I find someone who has gotten it to work in a real world not trivial situation then I will revisit it. > I'm not sure whether the above summary is correct, anyone any idea? I've > to use it in a research paper for my study anyway, so expanding my > thoughts in an email is a nice test :) > Please comment. I have a variation of nested sets that will involve only updating your descendants on a move. That way changes to the left most part of the tree won't affect the whole table. I simple use a string for left and right. It is obviously not hard to pick two strings for left and right that have an infinite number of strings in between them. This it doesn't have the smae volatility problems as celkos nested sets. It doesn't add much to the materialized path though except that no string operations are required to getAllAncestors and that if celkos nested sets will use the indexes so will mine. If materialzed path will use indexes on getAllAncestors though mind doesn't add any real functionality. It will probalby come down to which method is fastest at moving nodes at the top of the tree around the fastest. rg
pgsql-general by date: