Thread: Making a tree with "millions and millions" of dynamic nodes
I have a VERY LARGE pile of geographic data that I am importing into a database (db of choice is postgres, though may hop to oracle if necessary). The data is strictly hierarchical - each node has one, and only one parent. The depth should not exceed 6 or 7 levels. The initial import will have about 6 million leaves, and 3 million branches. I would expect the leaves to grow significantly, in number easily tripling. However, the branches will likely stay very constant in number, but I expect there locations to shift around somewhat (affecting up to thousand of children). For selection, it is crucial for me to get: 1. path generation speed 2. immediate sibling speed 3. children count speed I have implemented a first run using Joe Celko's Nested Sets (w/ a mod to store tree level for speed). The update propigation issue is the achilles heel of this approach for me. I have read Vadim Tropashko Nested Interval concept ( http://www.dbazine.com/tropashko4.html ) , and my brain is painfully stretched enough to get the general idea. I have a feeling i will hit the arithmetic issues others of reported. So it seems Materialized Path is my only option, however I am concerned about LIKE performance for the right hand side of the tree, where the path is 8digits x 6 levels = 48 chars. Should I be concerned? I need split-second real-time performance, and can't imagine it will be as fast the Nested Set arithmatic approach. I can flatten out the import to insure the upper tree has the smallest numbers, however, it will save at most 8 chars on the path. I've been googling through USENET archives watching the big debates of years gone by. I've grazed much knowledge, but now am curious if anyone has any thoughts/advice/war stories about a data set this large. (and if any fellow postgres fans have some reassuring words about real-time performance of a 20 million row tree, i'd love to hear ;-) -- [ \ / [ >X< Christian Fowler | http://www.steelsun.com/ [ / \
"Christian Fowler" <google@NOSPAM.gravesweeper.com> wrote in message news:6b-dnQJnDI6YvlCiXTWc-g@speakeasy.net... > For selection, it is crucial for me to get: > > 1. path generation speed Path to the root? It is coded explicitly in the materialized path. All you need to do is parsing the path and generating a list of prefixes that you use within your query like this: select * from hierarchy where path in ('1','1.2','1.2.1', '1.2.1.1') If you have an index, and your rdbms optimiser is smart enough, the query should execute fast. > 2. immediate sibling speed Here is the kludge: select * from hierarchy where path in ('1.2.1.1','1.2.1.2','1.2.1.3', '1.2.1.4','1.2.1.5') Again, I assume that your query would use an index here. If you get exactly 5 records, then you can't be sure that your node has that many children. You have to repeat query like this: select * from hierarchy where path in ( '1.2.1.1','1.2.1.2','1.2.1.3', '1.2.1.4','1.2.1.5' ,'1.2.1.6','1.2.1.7','1.2.1.8', '1.2.1.9','1.2.1.10') ,'1.2.1.11','1.2.1.12','1.2.1.13', '1.2.1.4','1.2.1.15') ,'1.2.1.16','1.2.1.17','1.2.1.18', '1.2.1.19','1.2.1.20') ,'1.2.1.21','1.2.1.22','1.2.1.23', '1.2.1.4','1.2.1.25') Yet again, there might be more than 25 children, so you'll have to run yet again more "generos" query. The pitfall here is that at some point the optimiser may get tired to do "or" expansion, so that at some point you might end up with full table scan. But, this is implementation dependent, so that you might be able to influence optimizer decision. As you see, I'm not worried how many bytes materialized path has, or if parsing it takes more time than multiplying 2 integers. My concern is if your query can leverage index or not. > 3. children count speed Children, or descendants? I guess no method can beat Celko's descendant's formula #descendants=(rgt-lft+1)/2 The count is implicitly stored at the root node, so that even for hierarchy 1M nodes we answer how many nodes are there instantly. This is also an Achiles' heel of the method: any update to the hierarchy triggers refresh of the "counter". One aggregate is never enough, though: it's often useful to know total salary too:-) If you meant not descendants, but children, then use a method similar to bullet #2.
Re: Making a tree with "millions and millions" of dynamic nodes
From
"Joe \"Nuke Me Xemu\" Foster"
Date:
"Mikito Harakiri" <mikharakiri@iahu.com> wrote in message <news:3kczb.23$2A3.117@news.oracle.com>... > "Christian Fowler" <google@NOSPAM.gravesweeper.com> wrote in message > news:6b-dnQJnDI6YvlCiXTWc-g@speakeasy.net... > > For selection, it is crucial for me to get: > > > > 1. path generation speed > > Path to the root? It is coded explicitly in the materialized path. All you > need to do is parsing the path and generating a list of prefixes that you > use within your query like this: > > select * from hierarchy where path in ('1','1.2','1.2.1', '1.2.1.1') > > If you have an index, and your rdbms optimiser is smart enough, the query > should execute fast. > > > 2. immediate sibling speed > > Here is the kludge: > > select * from hierarchy where path in ('1.2.1.1','1.2.1.2','1.2.1.3', > '1.2.1.4','1.2.1.5') > > Again, I assume that your query would use an index here. > > If you get exactly 5 records, then you can't be sure that your node has that > many children. You have to repeat query like this: > select * from hierarchy where path in ( > '1.2.1.1','1.2.1.2','1.2.1.3', '1.2.1.4','1.2.1.5' > ,'1.2.1.6','1.2.1.7','1.2.1.8', '1.2.1.9','1.2.1.10') > ,'1.2.1.11','1.2.1.12','1.2.1.13', '1.2.1.4','1.2.1.15') > ,'1.2.1.16','1.2.1.17','1.2.1.18', '1.2.1.19','1.2.1.20') > ,'1.2.1.21','1.2.1.22','1.2.1.23', '1.2.1.4','1.2.1.25') > > Yet again, there might be more than 25 children, so you'll have to run yet > again more "generos" query. Here's an alternative which may not perform very well but may still be better than risking a full table-scan... select * from hierarchy where path like '1.2.1.%' and path not like '1.2.1.%.%' What if you use a fixed number of chars per level? The sibling and immediate children queries then become something more like, select * from hierarchy where path like '0001.0002.0001.____' select count(*) from hierarchy where path like '0001.0002.0001.1234.____' Since the characters per level is fixed, you can get rid of the dots too, and parsing becomes a matter of length and substring. Finding all descendants could be implemented using BETWEEN in either materialized path scheme: select * from hierarchy where path between '1.2.1.1234.' and '1.2.1.1234~' select * from hierarchy where path between '0001.0002.0001.1234.' and '0001.0002.0001.1234~' select * from hierarchy where path between '0001000200011234.' and '0001000200011234~' The '.' and '~' may need changing depending on collating order. -- Joe Foster <mailto:jlfoster%40znet.com> DC8s in Spaace: <http://www.xenu.net/> WARNING: I cannot be held responsible for the above They're coming to because my cats have apparently learned to type. take me away, ha ha!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, - -- "Joe \"Nuke Me Xemu\" Foster" <bftsi0!joe@news.hub.org> wrote: > Here's an alternative which may not perform very well but may > still be better than risking a full table-scan... I use exactly this method in a forum software, and it performs VERY good. I tested it with about a million of rows, and it is really very fast. Even with deep trees (1000 sub branches). The only difference is that I use a base 255 encoded text column instead of only 0-9. But attention: the character set must be ASCII (ordering!) ... I want to change this to bytea to avoid base255 encoding and the character set problems, but there are still some Bugs with bytea and the like operator. :-( Ciao Alvar - -- ** Alvar C.H. Freude -- http://alvar.a-blast.org/ -- http://odem.org/ ** Berufsverbot? http://odem.org/aktuelles/staatsanwalt.de.html ** ODEM.org-Tour: http://tour.odem.org/ ** Informationsgesellschaft: http://www.wsis-koordinierungskreis.de/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (FreeBSD) iD4DBQE/z59nOndlH63J86wRAsrEAJ9OjO5fXhnw2mmLoB7YNHJFYO/X8QCXc31M FWdV8T92N3kzctSgOOkVMw== =Uwtm -----END PGP SIGNATURE-----
Alvar Freude wrote: > I want to change this to bytea to avoid base255 encoding and the character > set problems, but there are still some Bugs with bytea and the like > operator. :-( Be patient, I'm working on it. Apparently you are the first person to ever really try to *use* the like operator for bytea, because the bug has been there since the feature was originally committed. Joe
> Christian Fowler wrote: > > So it seems Materialized Path is my only option, however I am > concerned about LIKE performance for the right hand side of > the tree, where the path is 8digits x 6 levels = 48 chars. > Should I be concerned? I need split-second real-time > performance, and can't imagine it will be as fast the Nested > Set arithmatic approach. Perhaps the contrib/ltree-stuff is interesting if you're going to do materialized paths. It's an indexable tree-format for postgresql based on the materialized paths (so it seems at least). I haven't really tested it, but wanted to point it out to you. Regards, Arjen van der Meijden
Christian, you may try our contrib/ltree module see http://www.sai.msu.su/~megera/postgres/gist/ltree With tuned postgresql and reasonable hardware you might get what you're looking for. Oleg On Tue, 2 Dec 2003, Christian Fowler wrote: > > > I have a VERY LARGE pile of geographic data that I am importing into a database (db of choice is postgres, > though may hop to oracle if necessary). The data is strictly hierarchical - each node has one, and only one > parent. The depth should not exceed 6 or 7 levels. The initial import will have about 6 million leaves, and > 3 million branches. I would expect the leaves to grow significantly, in number easily tripling. However, the > branches will likely stay very constant in number, but I expect there locations to shift around somewhat > (affecting up to thousand of children). > > For selection, it is crucial for me to get: > > 1. path generation speed > 2. immediate sibling speed > 3. children count speed > > > I have implemented a first run using Joe Celko's Nested Sets (w/ a mod to store tree level for speed). The > update propigation issue is the achilles heel of this approach for me. I have read Vadim Tropashko Nested > Interval concept ( http://www.dbazine.com/tropashko4.html ) , and my brain is painfully stretched enough to > get the general idea. I have a feeling i will hit the arithmetic issues others of reported. > > So it seems Materialized Path is my only option, however I am concerned about LIKE performance for the right > hand side of the tree, where the path is 8digits x 6 levels = 48 chars. Should I be concerned? I need > split-second real-time performance, and can't imagine it will be as fast the Nested Set arithmatic approach. > I can flatten out the import to insure the upper tree has the smallest numbers, however, it will save at > most 8 chars on the path. > > I've been googling through USENET archives watching the big debates of years gone by. I've grazed much > knowledge, but now am curious if anyone has any thoughts/advice/war stories about a data set this large. > > (and if any fellow postgres fans have some reassuring words about real-time performance of a 20 million row > tree, i'd love to hear ;-) > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
I was glad to see this topic come up on the list as I was about to start asking about some of these issues myself. I would like to discuss each of the methods I have researched so far for doing trees in sql and see if anyone has some experience or insite into the topic that could help me. That being said the four methods that seem to be the prevailing ideas are the following: 1) Adjacency List - Just use a parent field, nothing interesting or non-obvious here. 2) Materialized Path - Basically you store the path up to the node in a field called path 3) Nested Sets - uh, read the articles below. 4) Nested Intervals - definitely read the first article Here are some articles explaining these concepts: http://www.dbazine.com/tropashko4.html (this is from an earlier posting in this thread) http://www.intelligententerprise.com/001020/celko.shtml (the nested set or Celko method as it seems it should be called) http://www.dbmsmag.com/9603d06.html (more on the celko method) http://www.dbmsmag.com/9604d06.html http://www.dbmsmag.com/9605d06.html http://www.dbmsmag.com/9606d06.html I have a lot of thouhts on this but here is my first question: In the first article at the end of the section on materialized paths and the beginning of the nested set section Tropashko basically says that neither one really does a good job at answering "who are all of my ancestors" but that the materialized path method has a nice kludge that you can do with some funky string manipulation. Is this true? Celko in his articles seems to make it sound like this query will be very fast. But Tropashko it seems is saying that it will be slow. Am I reading this right? If so is Tropashko right? Any ideas on this? Any articles / papers that might shed some light on this specific question or the topic in general? thanks in advance rg
> Rick Gigger wrote: > > I was glad to see this topic come up on the list as I was > about to start asking about some of these issues myself. I > would like to discuss each of the methods I have researched > so far for doing trees in sql and see if anyone has some > experience or insite into the topic that could help me. Have a look here: http://www.scenic-route.com/program/db/lists.htm It is quite a nice overview with some advantages and disadvantages of each method. > I have a lot of thouhts on this but here is my first question: > > In the first article at the end of the section on materialized paths > and the beginning of the nested set section Tropashko basically says > that neither one really does a good job at answering "who are all of > my ancestors" but that the materialized path method has a nice kludge > that you can do with some funky string manipulation. I don't see how the nested intervals-query is better/faster than the nested set approach, but both seem to be possibly using indices. The nested set uses a simple "where left between parent.left and parent.right", if you happen to have those numbers in your application's memory, you could even build your query directly like that and save yourself a self join. Another approach, I used, is something like: Select * from nset n1, (select (distinct) * from nset where nset.id = $parentid) as n2 where n1.left between n2.left and n2.right (the distinct will force a materialization, which might improve the performance or not, in my tests it did) With the materialized path you'll have to use a like-search (child.path like parent.path || '%'), where hopefully the optimiser notices that it can chop off the end of the string and just look at the first part > Is this true? Celko in his articles seems to make it sound > like this query will be very fast. But Tropashko it seems It'll use an index range-search, so I don't see how it can be faster, except for a hash/index lookup or so (which is possible with the transitive closure version of an adjacency list). > is saying that it will be slow. Am I reading this right? If > so is Tropashko right? Any ideas on this? Any articles / papers > that might shed some light on this specific question or the > topic in general? The above article is a nice one to read I think. I'll try to summarize my thoughts on them, since I'm looking into an efficient tree approach myself aswell: All methods have both advantages and disadvantages and it'll depend on your situation whether that is a showstopper or not. All are bad when you want to remove a node from the tree, but the insert/update penalties depend on the situation. * The adjancency list is lightning fast with inserts, updates but a bit clumsy with deletions (you need to connect the direct children to another element, that's all). It'll be very fast with the two simple queries "who are my direct children" and "who is my parent", but you need either a recursive approach or some enhancement like a transitive closure-graph to enhance queries like "who are my predecessors" and "who are all my ancestors". So if you'll have a large tree, only few a levels deep and a lot of inserts and/or subtree movements, this seems to be the best approach. And you can store "max int"-number of nodes. * The materialized path is not so very efficient in terms of storage, it does inserts fast, but updates and deletions are slow. Retrieval of all ancestors is fast, retrieval of the predecessors is extremely trivial, but retrieval of the direct parent and/or children is somewhat more difficult (you'll need to look at the depth, if you do that often a functional-index might be handy). Perhaps the ltree-implementation in contrib/ltree is worth a look and encoding the trees in much more dense encodings (like a 256bit encoding I saw mentioned earlier and skipping the dot for just a single-bit terminator) makes it less inefficient. So if you do a lot of inserts, not so many updates and deletes and have much need for the entire list of predecessors/ancestors, I suppose this one is quite nice. But it is not so efficient in terms of storage, so very large/deep trees impose a huge overhead since each node stores its entire path. * The nested set is inefficient with insertions, updates and deletions but much more efficient with storage than the MP. To speed up the process of selecting only the direct parent/children you can use a depth field (redundant, but handy) which'll save you one or two extra selfjoins. For relatively static trees the nested set is good, but actively changing trees are very bad for your performance. You can store "max int"/2 nodes in this approach. * The nested intervals use the same approach as the nested set, but uses a static encoding. This allows you to do very efficient inserts, so the author claims. I have my doubts dough, if you don't know the number of children of your parent and just want to do "connect this child to that parent", it'll be more difficult, you'll need to find that parent, count it's children, decide the encoding-number of this new child and then you can calculate its new numbers. Still more efficient than the nested set, but not so efficient and you DO need to look into your database, while the author claimed you didn't need to. I havent't worked with this approach yet, but I saw the author use a lot of functions in his queries, I'm not sure whether that is a good idea, it might affect the usability of your indices... Another disadvantage is its very inefficient use of the integer-values available, it can only have a depth of 32 with normal integers, afaik and not so many children on a level as you'd like perhaps. The author suggests defining a unlimited integer, which is not so good for your performance at all (search for '"nested interval" sql' in google or so and you'll find his messages on the dbforums). * His other approach, the single-integer version, is even worse in terms of efficient integer-use, you can have only 32 levels at the best case (the left most branch only!) and only about 30 nodes next to each other in the same bestcase (since the numbers increase exponantional: level1 = 5, 11, 23, 47, 95, etc). The worst case (the right most branch) can possibly store only one or two levels with only one or two nodes... So the claimed advantage of the better performance is a bit limited, you'll either need *very* large integers to store large trees, or you are simply limited in the size of your tree by some relatively small numbers. And that is a bit sad, since bad performance is only a problem/visible with large(r) trees. I'm not sure whether the above summary is correct, anyone any idea? I've to use it in a research paper for my study anyway, so expanding my thoughts in an email is a nice test :) Please comment. Best regards, Arjen van der Meijden
Oleg, I worked for a company where we crunched 16+ terabytes of multiple drug company sales data per month. We of course used Sun's BIGGEST boxes and Oracle. But! the secret to doing it was LOTS of memory and putting the indexes, pointers, etc. in arrays and using Oracle OCI calls. If it is possible to use some lower level funcs, perhaps funcs like you see in the .c output of ecpg this might theoretically be possible in postgres. I know that putting the index references or functionally similar data in arrays, and doing the data location resolution to the logical or even physical disk location is not only extreamly FAST but possible, having done it on raw partitions and optical disks myself in custom designed data bases. This approach would work extreamly fast with only 20 million rows. Yes, it would require some extra programming but if you need REAL TIME response this approach could help solve your problem. Best Regards, Lynn P. Tilby Unix Consultant ltilby@asu.edu Ph: 480 632-8635 Quoting Oleg Bartunov <oleg@sai.msu.su>: > Christian, > > you may try our contrib/ltree module > see http://www.sai.msu.su/~megera/postgres/gist/ltree > With tuned postgresql and reasonable hardware you might get what > you're > looking for. > > Oleg > > On Tue, 2 Dec 2003, Christian Fowler wrote: > > > > > > > I have a VERY LARGE pile of geographic data that I am importing into a > database (db of choice is postgres, > > though may hop to oracle if necessary). The data is strictly > hierarchical - each node has one, and only one > > parent. The depth should not exceed 6 or 7 levels. The initial import > will have about 6 million leaves, and > > 3 million branches. I would expect the leaves to grow significantly, > in number easily tripling. However, the > > branches will likely stay very constant in number, but I expect there > locations to shift around somewhat > > (affecting up to thousand of children). > > > > For selection, it is crucial for me to get: > > > > 1. path generation speed > > 2. immediate sibling speed > > 3. children count speed > > > > > > I have implemented a first run using Joe Celko's Nested Sets (w/ a mod > to store tree level for speed). The > > update propigation issue is the achilles heel of this approach for me. > I have read Vadim Tropashko Nested > > Interval concept ( http://www.dbazine.com/tropashko4.html ) , and my > brain is painfully stretched enough to > > get the general idea. I have a feeling i will hit the arithmetic > issues others of reported. > > > > So it seems Materialized Path is my only option, however I am > concerned about LIKE performance for the right > > hand side of the tree, where the path is 8digits x 6 levels = 48 > chars. Should I be concerned? I need > > split-second real-time performance, and can't imagine it will be as > fast the Nested Set arithmatic approach. > > I can flatten out the import to insure the upper tree has the smallest > numbers, however, it will save at > > most 8 chars on the path. > > > > I've been googling through USENET archives watching the big debates of > years gone by. I've grazed much > > knowledge, but now am curious if anyone has any thoughts/advice/war > stories about a data set this large. > > > > (and if any fellow postgres fans have some reassuring words about > real-time performance of a 20 million row > > tree, i'd love to hear ;-) > > > > > > > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, sci.researcher, hostmaster of AstroNet, > Sternberg Astronomical Institute, Moscow University (Russia) > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(095)939-16-83, +007(095)939-23-83 > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
Here is another approach (only first half of the page applies here) I was emailed personally in response to my orginal post: http://fungus.teststation.com/~jon/treehandling/TreeHandling.htm Basically, the (too obivious for me to think of ;-) concept appears to be storing a mapping of the Node / Parent for every parent in the chain. Basically it splits up the materialized view concept into multiple rows, or one can think of it as the adjancency list with all parents stored. Perhaps call this the "Materialized List". Personally, this is my likely favorite since I tend to "trust" the performance of giant tables that are nothing more than 2 columns of integers. Seems like this approach would handle most of the desired tree cases (select path & child count, insert, delete, move) very well. Hopefully postgres can handle a very active, very narrow 50 million row table without breaking a sweat. "Arjen van der Meijden" <acmmailing@vulcanus.its.tudelft.nl> wrote: >> Rick Gigger wrote: >> >> I was glad to see this topic come up on the list as I was >> about to start asking about some of these issues myself. I >> would like to discuss each of the methods I have researched >> so far for doing trees in sql and see if anyone has some >> experience or insite into the topic that could help me. > > Have a look here: http://www.scenic-route.com/program/db/lists.htm > It is quite a nice overview with some advantages and disadvantages of > each method. > >> I have a lot of thouhts on this but here is my first question: >> >> In the first article at the end of the section on materialized paths >> and the beginning of the nested set section Tropashko basically says >> that neither one really does a good job at answering "who are all of >> my ancestors" but that the materialized path method has a nice kludge >> that you can do with some funky string manipulation. > > I don't see how the nested intervals-query is better/faster than the > nested set approach, but both seem to be possibly using indices. > The nested set uses a simple "where left between parent.left and > parent.right", if you happen to have those numbers in your application's > memory, you could even build your query directly like that and save > yourself a self join. > Another approach, I used, is something like: > Select * from nset n1, (select (distinct) * from nset where nset.id = > $parentid) as n2 where n1.left between n2.left and n2.right > (the distinct will force a materialization, which might improve the > performance or not, in my tests it did) > > With the materialized path you'll have to use a like-search (child.path > like parent.path || '%'), where hopefully the optimiser notices that it > can chop off the end of the string and just look at the first part > >> Is this true? Celko in his articles seems to make it sound >> like this query will be very fast. But Tropashko it seems > It'll use an index range-search, so I don't see how it can be faster, > except for a hash/index lookup or so (which is possible with the > transitive closure version of an adjacency list). > >> is saying that it will be slow. Am I reading this right? If >> so is Tropashko right? Any ideas on this? Any articles / papers >> that might shed some light on this specific question or the >> topic in general? > The above article is a nice one to read I think. > > I'll try to summarize my thoughts on them, since I'm looking into an > efficient tree approach myself aswell: > All methods have both advantages and disadvantages and it'll depend on > your situation whether that is a showstopper or not. All are bad when > you want to remove a node from the tree, but the insert/update penalties > depend on the situation. > > * The adjancency list is lightning fast with inserts, updates but a bit > clumsy with deletions (you need to connect the direct children to > another element, that's all). It'll be very fast with the two simple > queries "who are my direct children" and "who is my parent", but you > need either a recursive approach or some enhancement like a transitive > closure-graph to enhance queries like "who are my predecessors" and "who > are all my ancestors". > > So if you'll have a large tree, only few a levels deep and a lot of > inserts and/or subtree movements, this seems to be the best approach. > And you can store "max int"-number of nodes. > > * The materialized path is not so very efficient in terms of storage, it > does inserts fast, but updates and deletions are slow. Retrieval of all > ancestors is fast, retrieval of the predecessors is extremely trivial, > but retrieval of the direct parent and/or children is somewhat more > difficult (you'll need to look at the depth, if you do that often a > functional-index might be handy). > Perhaps the ltree-implementation in contrib/ltree is worth a look and > encoding the trees in much more dense encodings (like a 256bit encoding > I saw mentioned earlier and skipping the dot for just a single-bit > terminator) makes it less inefficient. > > So if you do a lot of inserts, not so many updates and deletes and have > much need for the entire list of predecessors/ancestors, I suppose this > one is quite nice. But it is not so efficient in terms of storage, so > very large/deep trees impose a huge overhead since each node stores its > entire path. > > * The nested set is inefficient with insertions, updates and deletions > but much more efficient with storage than the MP. To speed up the > process of selecting only the direct parent/children you can use a depth > field (redundant, but handy) which'll save you one or two extra > selfjoins. > For relatively static trees the nested set is good, but actively > changing trees are very bad for your performance. You can store "max > int"/2 nodes in this approach. > > * The nested intervals use the same approach as the nested set, but uses > a static encoding. This allows you to do very efficient inserts, so the > author claims. I have my doubts dough, if you don't know the number of > children of your parent and just want to do "connect this child to that > parent", it'll be more difficult, you'll need to find that parent, count > it's children, decide the encoding-number of this new child and then you > can calculate its new numbers. > Still more efficient than the nested set, but not so efficient and you > DO need to look into your database, while the author claimed you didn't > need to. > I havent't worked with this approach yet, but I saw the author use a lot > of functions in his queries, I'm not sure whether that is a good idea, > it might affect the usability of your indices... > Another disadvantage is its very inefficient use of the integer-values > available, it can only have a depth of 32 with normal integers, afaik > and not so many children on a level as you'd like perhaps. The author > suggests defining a unlimited integer, which is not so good for your > performance at all (search for '"nested interval" sql' in google or so > and you'll find his messages on the dbforums). > > * His other approach, the single-integer version, is even worse in terms > of efficient integer-use, you can have only 32 levels at the best case > (the left most branch only!) and only about 30 nodes next to each other > in the same bestcase (since the numbers increase exponantional: level1 = > 5, 11, 23, 47, 95, etc). The worst case (the right most branch) can > possibly store only one or two levels with only one or two nodes... > > So the claimed advantage of the better performance is a bit limited, > you'll either need *very* large integers to store large trees, or you > are simply limited in the size of your tree by some relatively small > numbers. And that is a bit sad, since bad performance is only a > problem/visible with large(r) trees. > > I'm not sure whether the above summary is correct, anyone any idea? I've > to use it in a research paper for my study anyway, so expanding my > thoughts in an email is a nice test :) > Please comment. > > Best regards, > > Arjen van der Meijden > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > -- [ \ / [ >X< cfowler@steelsun.com | http://www.steelsun.com/ [ / \
Christian Fowler wrote: > Here is another approach (only first half of the page applies here) > I was emailed personally in response to my orginal post: > > http://fungus.teststation.com/~jon/treehandling/TreeHandling.htm > > Basically, the (too obivious for me to think of ;-) concept appears to > be storing a mapping of the Node / Parent for every parent in the > chain. Basically it splits up the materialized view concept into > multiple rows, or one can think of it as the adjancency list with all > parents stored. Perhaps call this the "Materialized List". Afaik, this is what Celko (in SQL for smarties) calls the "transitive closure version of a adjancecy list" or something like that. It's in the same chapter as adjanceny list-based trees near the end, if you want to read that book. > Personally, this is my likely favorite since I tend to "trust" the > performance of giant tables that are nothing more than 2 columns of > integers. Seems like this approach would handle most of the desired > tree cases (select path & child count, insert, delete, move) very > well. I have now (and for now, not sure which aproach is best/good enough) a combination of this, nested sets and adjancency lists, but I'm getting quite a bit of overhead for a few relations ;) I have two types of data, categories (in a tree with interlinks between categories, besides the parent-child relations) and documents. I've a nested-set + adjancency list for the tree itselves and a "transative closure"-table for the documents-category relations (that saves me some 70% of the querieing time, if I'd looked up those relations using the nested set-table on the fly). Anyway, my tables won't be multimillion, although they might be heavily used in webpages, and will be relatively static. So I can allow quite a bit of overhead. > Hopefully postgres can handle a very active, very narrow 50 million > row table without breaking a sweat. It probably can, bear in mind that if you disable OIDs for that table, it saves you 50M * 4bytes of storage orso... And I don't thing you'd really need them (unless you want to do some lookups on the "newly inserted row"). Best regards, Arjen > "Arjen van der Meijden" <acmmailing@vulcanus.its.tudelft.nl> wrote: > >>>Rick Gigger wrote: >>> >>>I was glad to see this topic come up on the list as I was >>>about to start asking about some of these issues myself. I >>>would like to discuss each of the methods I have researched >>>so far for doing trees in sql and see if anyone has some >>>experience or insite into the topic that could help me. >> >>Have a look here: http://www.scenic-route.com/program/db/lists.htm >>It is quite a nice overview with some advantages and disadvantages of >>each method. >> >> >>>I have a lot of thouhts on this but here is my first question: >>> >>>In the first article at the end of the section on materialized paths >>>and the beginning of the nested set section Tropashko basically says >>>that neither one really does a good job at answering "who are all of >>>my ancestors" but that the materialized path method has a nice kludge >>>that you can do with some funky string manipulation. >> >>I don't see how the nested intervals-query is better/faster than the >>nested set approach, but both seem to be possibly using indices. >>The nested set uses a simple "where left between parent.left and >>parent.right", if you happen to have those numbers in your application's >>memory, you could even build your query directly like that and save >>yourself a self join. >>Another approach, I used, is something like: >>Select * from nset n1, (select (distinct) * from nset where nset.id = >>$parentid) as n2 where n1.left between n2.left and n2.right >>(the distinct will force a materialization, which might improve the >>performance or not, in my tests it did) >> >>With the materialized path you'll have to use a like-search (child.path >>like parent.path || '%'), where hopefully the optimiser notices that it >>can chop off the end of the string and just look at the first part >> >> >>>Is this true? Celko in his articles seems to make it sound >>>like this query will be very fast. But Tropashko it seems >> >>It'll use an index range-search, so I don't see how it can be faster, >>except for a hash/index lookup or so (which is possible with the >>transitive closure version of an adjacency list). >> >> >>>is saying that it will be slow. Am I reading this right? If >>>so is Tropashko right? Any ideas on this? Any articles / papers >>>that might shed some light on this specific question or the >>>topic in general? >> >>The above article is a nice one to read I think. >> >>I'll try to summarize my thoughts on them, since I'm looking into an >>efficient tree approach myself aswell: >>All methods have both advantages and disadvantages and it'll depend on >>your situation whether that is a showstopper or not. All are bad when >>you want to remove a node from the tree, but the insert/update penalties >>depend on the situation. >> >>* The adjancency list is lightning fast with inserts, updates but a bit >>clumsy with deletions (you need to connect the direct children to >>another element, that's all). It'll be very fast with the two simple >>queries "who are my direct children" and "who is my parent", but you >>need either a recursive approach or some enhancement like a transitive >>closure-graph to enhance queries like "who are my predecessors" and "who >>are all my ancestors". >> >>So if you'll have a large tree, only few a levels deep and a lot of >>inserts and/or subtree movements, this seems to be the best approach. >>And you can store "max int"-number of nodes. >> >>* The materialized path is not so very efficient in terms of storage, it >>does inserts fast, but updates and deletions are slow. Retrieval of all >>ancestors is fast, retrieval of the predecessors is extremely trivial, >>but retrieval of the direct parent and/or children is somewhat more >>difficult (you'll need to look at the depth, if you do that often a >>functional-index might be handy). >>Perhaps the ltree-implementation in contrib/ltree is worth a look and >>encoding the trees in much more dense encodings (like a 256bit encoding >>I saw mentioned earlier and skipping the dot for just a single-bit >>terminator) makes it less inefficient. >> >>So if you do a lot of inserts, not so many updates and deletes and have >>much need for the entire list of predecessors/ancestors, I suppose this >>one is quite nice. But it is not so efficient in terms of storage, so >>very large/deep trees impose a huge overhead since each node stores its >>entire path. >> >>* The nested set is inefficient with insertions, updates and deletions >>but much more efficient with storage than the MP. To speed up the >>process of selecting only the direct parent/children you can use a depth >>field (redundant, but handy) which'll save you one or two extra >>selfjoins. >>For relatively static trees the nested set is good, but actively >>changing trees are very bad for your performance. You can store "max >>int"/2 nodes in this approach. >> >>* The nested intervals use the same approach as the nested set, but uses >>a static encoding. This allows you to do very efficient inserts, so the >>author claims. I have my doubts dough, if you don't know the number of >>children of your parent and just want to do "connect this child to that >>parent", it'll be more difficult, you'll need to find that parent, count >>it's children, decide the encoding-number of this new child and then you >>can calculate its new numbers. >>Still more efficient than the nested set, but not so efficient and you >>DO need to look into your database, while the author claimed you didn't >>need to. >>I havent't worked with this approach yet, but I saw the author use a lot >>of functions in his queries, I'm not sure whether that is a good idea, >>it might affect the usability of your indices... >>Another disadvantage is its very inefficient use of the integer-values >>available, it can only have a depth of 32 with normal integers, afaik >>and not so many children on a level as you'd like perhaps. The author >>suggests defining a unlimited integer, which is not so good for your >>performance at all (search for '"nested interval" sql' in google or so >>and you'll find his messages on the dbforums). >> >>* His other approach, the single-integer version, is even worse in terms >>of efficient integer-use, you can have only 32 levels at the best case >>(the left most branch only!) and only about 30 nodes next to each other >>in the same bestcase (since the numbers increase exponantional: level1 = >>5, 11, 23, 47, 95, etc). The worst case (the right most branch) can >>possibly store only one or two levels with only one or two nodes... >> >>So the claimed advantage of the better performance is a bit limited, >>you'll either need *very* large integers to store large trees, or you >>are simply limited in the size of your tree by some relatively small >>numbers. And that is a bit sad, since bad performance is only a >>problem/visible with large(r) trees. >> >>I'm not sure whether the above summary is correct, anyone any idea? I've >>to use it in a research paper for my study anyway, so expanding my >>thoughts in an email is a nice test :) >>Please comment. >> >>Best regards, >> >>Arjen van der Meijden >> >> >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: the planner will ignore your desire to choose an index scan if your >> joining column's datatypes do not match >> > >
The one performance concern I would have here (and I think it's there for any of these except the simple adjacency list) is when I want to move a node high up in the tree how long is it going to take? That is the only change that I really need to make that will be initiated by a user, thus I would like response times to be fast. My tree is not that big, it's only got about 40,000 rows. But if I move a node near the top of the tree and all of a sudden I've got to delete 20,000 rows and insert 20,000 rows. That's not going to be that fast it seems and the user is going to be sitting around waiting for it to finish. At the same time any system that gives you a fast getAllDescendants and getAllAncestors is going to require a large amount of updates in this situation. I don't think it's even possible to get around it. I have another idea that I this will give me the same quick performance on getAllDescendants and getAllAncestors but will require exactly one row to be updated per descendant node on a move. This method will have several (inserts and updates) per descendant node. I guess I will just have to try the two and see which one is faster. rg ----- Original Message ----- From: "Christian Fowler" <google@NOSPAM.gravesweeper.com> To: <pgsql-general@postgresql.org> Sent: Monday, December 08, 2003 11:28 PM Subject: Re: [GENERAL] Making a tree with "millions and millions" of dynamic nodes > Here is another approach (only first half of the page applies here) > I was emailed personally in response to my orginal post: > > http://fungus.teststation.com/~jon/treehandling/TreeHandling.htm > > Basically, the (too obivious for me to think of ;-) concept appears to > be storing a mapping of the Node / Parent for every parent in the > chain. Basically it splits up the materialized view concept into > multiple rows, or one can think of it as the adjancency list with all > parents stored. Perhaps call this the "Materialized List". > > Personally, this is my likely favorite since I tend to "trust" the > performance of giant tables that are nothing more than 2 columns of > integers. Seems like this approach would handle most of the desired > tree cases (select path & child count, insert, delete, move) very > well. > > Hopefully postgres can handle a very active, very narrow 50 million > row table without breaking a sweat.
> > Rick Gigger wrote: > > > > I was glad to see this topic come up on the list as I was > > about to start asking about some of these issues myself. I > > would like to discuss each of the methods I have researched > > so far for doing trees in sql and see if anyone has some > > experience or insite into the topic that could help me. > > Have a look here: http://www.scenic-route.com/program/db/lists.htm > It is quite a nice overview with some advantages and disadvantages of > each method. Thanks for the link. It cleared up some things on the nested intervals system. > > I have a lot of thouhts on this but here is my first question: > > > > In the first article at the end of the section on materialized paths > > and the beginning of the nested set section Tropashko basically says > > that neither one really does a good job at answering "who are all of > > my ancestors" but that the materialized path method has a nice kludge > > that you can do with some funky string manipulation. > > I don't see how the nested intervals-query is better/faster than the > nested set approach, but both seem to be possibly using indices. Well it doesn't require you to update the whole table if you insert a new node into the far left of the tree. That one is kind of a deal breaker for me. > The nested set uses a simple "where left between parent.left and > parent.right", if you happen to have those numbers in your application's > memory, you could even build your query directly like that and save > yourself a self join. So with simple b-tree indexes on left and right wouldn't it have to first look at the index on left and get the whole tree of items with left > child.left then get the whole tree of items in the index on right where right < child.left and then get the intersection of tuples referenced in those two subtrees? Would that be fast? Wouldn't it be faster to just do and sequence scan? Of is there a better way? How would you need to set up the indexes? Would you need to add a single index on left and right? How does it even handle >, < in a situation like that? Or would an rtree index work better? I guess I would need to try myself before I could be certain that it would actually use an index on that. > With the materialized path you'll have to use a like-search (child.path > like parent.path || '%'), where hopefully the optimiser notices that it > can chop off the end of the string and just look at the first part Once again I guess I would need to try it out to see if the optimizer could handle that. If it could handle both of these cases than Tropashko's assertion that those are hard problems was simple untrue. That would invalidate much of the need for his system. > > Is this true? Celko in his articles seems to make it sound > > like this query will be very fast. But Tropashko it seems > It'll use an index range-search, so I don't see how it can be faster, > except for a hash/index lookup or so (which is possible with the > transitive closure version of an adjacency list). > > is saying that it will be slow. Am I reading this right? If > > so is Tropashko right? Any ideas on this? Any articles / papers > > that might shed some light on this specific question or the > > topic in general? > The above article is a nice one to read I think. see above comments > > I'll try to summarize my thoughts on them, since I'm looking into an > efficient tree approach myself aswell: > All methods have both advantages and disadvantages and it'll depend on > your situation whether that is a showstopper or not. All are bad when > you want to remove a node from the tree, but the insert/update penalties > depend on the situation. > > * The adjancency list is lightning fast with inserts, updates but a bit > clumsy with deletions (you need to connect the direct children to > another element, that's all). It'll be very fast with the two simple > queries "who are my direct children" and "who is my parent", but you > need either a recursive approach or some enhancement like a transitive > closure-graph to enhance queries like "who are my predecessors" and "who > are all my ancestors". > > So if you'll have a large tree, only few a levels deep and a lot of > inserts and/or subtree movements, this seems to be the best approach. > And you can store "max int"-number of nodes. > > * The materialized path is not so very efficient in terms of storage, it > does inserts fast, but updates and deletions are slow. Retrieval of all > ancestors is fast, retrieval of the predecessors is extremely trivial But in order to use a "where row is a predecessor" clause in a more complex sql statement you would need to write a user function that did a funky string parsing operation and then returned the parsed out ids in a result set (unless it will use the indes on where parent.path || '%' like child.path. I am doubtfull on this one.) > but retrieval of the direct parent and/or children is somewhat more > difficult (you'll need to look at the depth, if you do that often a > functional-index might be handy). > Perhaps the ltree-implementation in contrib/ltree is worth a look and > encoding the trees in much more dense encodings (like a 256bit encoding > I saw mentioned earlier and skipping the dot for just a single-bit > terminator) makes it less inefficient. > > So if you do a lot of inserts, not so many updates and deletes and have > much need for the entire list of predecessors/ancestors, I suppose this > one is quite nice. But it is not so efficient in terms of storage, so > very large/deep trees impose a huge overhead since each node stores its > entire path. I don't really do a ton of them but it would prefer to keep them as fast as possible as the user will be waiting when they happen. Space is not really a concern for me. I've only got about 40,000 rows. > * The nested set is inefficient with insertions, updates and deletions > but much more efficient with storage than the MP. To speed up the > process of selecting only the direct parent/children you can use a depth > field (redundant, but handy) which'll save you one or two extra > selfjoins. > For relatively static trees the nested set is good, but actively > changing trees are very bad for your performance. You can store "max > int"/2 nodes in this approach. Once again the user will be wating while nodes are moved around in the tree. > * The nested intervals use the same approach as the nested set, but uses > a static encoding. This allows you to do very efficient inserts, so the > author claims. I have my doubts dough, if you don't know the number of > children of your parent and just want to do "connect this child to that > parent", it'll be more difficult, you'll need to find that parent, count > it's children, decide the encoding-number of this new child and then you > can calculate its new numbers. > Still more efficient than the nested set, but not so efficient and you > DO need to look into your database, while the author claimed you didn't > need to. > I havent't worked with this approach yet, but I saw the author use a lot > of functions in his queries, I'm not sure whether that is a good idea, > it might affect the usability of your indices... > > Another disadvantage is its very inefficient use of the integer-values > available, it can only have a depth of 32 with normal integers, afaik > and not so many children on a level as you'd like perhaps. The author > suggests defining a unlimited integer, which is not so good for your > performance at all (search for '"nested interval" sql' in google or so > and you'll find his messages on the dbforums). > > * His other approach, the single-integer version, is even worse in terms > of efficient integer-use, you can have only 32 levels at the best case > (the left most branch only!) and only about 30 nodes next to each other > in the same bestcase (since the numbers increase exponantional: level1 = > 5, 11, 23, 47, 95, etc). The worst case (the right most branch) can > possibly store only one or two levels with only one or two nodes... > > So the claimed advantage of the better performance is a bit limited, > you'll either need *very* large integers to store large trees, or you > are simply limited in the size of your tree by some relatively small > numbers. And that is a bit sad, since bad performance is only a > problem/visible with large(r) trees. You hit my main concerns with it dead on. I don't want to touch it until I can at least find at least one person who has made it work for them. I have yet to do so. I have read postings for many people who had some big problems with it. The overflow issue is probably a deal breaker for me. It might not be if I understood under what exact conditions it would happen but the whole thing as too many questions for me to want to spend time on it. If I find someone who has gotten it to work in a real world not trivial situation then I will revisit it. > I'm not sure whether the above summary is correct, anyone any idea? I've > to use it in a research paper for my study anyway, so expanding my > thoughts in an email is a nice test :) > Please comment. I have a variation of nested sets that will involve only updating your descendants on a move. That way changes to the left most part of the tree won't affect the whole table. I simple use a string for left and right. It is obviously not hard to pick two strings for left and right that have an infinite number of strings in between them. This it doesn't have the smae volatility problems as celkos nested sets. It doesn't add much to the materialized path though except that no string operations are required to getAllAncestors and that if celkos nested sets will use the indexes so will mine. If materialzed path will use indexes on getAllAncestors though mind doesn't add any real functionality. It will probalby come down to which method is fastest at moving nodes at the top of the tree around the fastest. rg
Lynn.Tilby@asu.edu wrote: > Oleg, > > I worked for a company where we crunched 16+ terabytes of > multiple drug company sales data per month. We of course > used Sun's BIGGEST boxes and Oracle. But! the secret > to doing it was LOTS of memory and putting the indexes, > pointers, etc. in arrays and using Oracle OCI calls. > > If it is possible to use some lower level funcs, perhaps > funcs like you see in the .c output of ecpg this might > theoretically be possible in postgres. I know that > putting the index references or functionally similar data > in arrays, and doing the data location resolution to > the logical or even physical disk location is not only > extreamly FAST but possible, having done it on raw > partitions and optical disks myself in custom designed > data bases. This approach would work extreamly fast > with only 20 million rows. Yes, it would require some > extra programming but if you need REAL TIME response > this approach could help solve your problem. Each row has a ctid(tid) value that represents the physical page and offset on the page, and you could store that, but and update to the row would change the tid, as would a VACUUM FULL, and a delete could reuse the tid. Not pretty, but could be done. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Re: Making a tree with "millions and millions" of dynamic nodes
From
jisaacks@yahoo.com (John Isaacks)
Date:
> > > > We did it with C++ and a memory mapped file for persistence. > > We had to use a 64bit cpu and 12G of RAM. > > Wow! That's only 300 bytes of RAM for every 10 digit phone number! We didn't use all of that, last time I checked it was using apx 4G of RAM, We had to plan for every possible USA phone number being ported. The overhead or memory requirements of indexing goes down as more numbers are placed in the system.
Re: Making a tree with "millions and millions" of dynamic nodes
From
jisaacks@yahoo.com (John Isaacks)
Date:
Christian Fowler <google@NOSPAM.gravesweeper.com> wrote in message news:<6b-dnQJnDI6YvlCiXTWc-g@speakeasy.net>... > I have a VERY LARGE pile of geographic data that I am importing into a database (db of choice is postgres, > though may hop to oracle if necessary). The data is strictly hierarchical - each node has one, and only one > parent. The depth should not exceed 6 or 7 levels. The initial import will have about 6 million leaves, and > 3 million branches. > For selection, it is crucial for me to get: > > 1. path generation speed > 2. immediate sibling speed > 3. children count speed > I don't know how high your requirement for having an SQL interface is. At my company we wrote a real-time database for the local number portability, with over 40M entries, with support for variable length phone numbers ( up to 10 digits ). The customer wanted the longest phone number match possible. example there could 1, 12, 123 in the database and the number 1234 would match to 123. We did it with C++ and a memory mapped file for persistence. We had to use a 64bit cpu and 12G of RAM. we can do over 100K insert/search/delete's per second. a relatively flat or equal performance for each operation ( depends on number length ). If real-time performance is that important, then you might have to customize.
"John Isaacks" <jisaacks@yahoo.com> wrote in message news:8a9a4c2a.0312101423.4d31c3d4@posting.google.com... > Christian Fowler <google@NOSPAM.gravesweeper.com> wrote in message news:<6b-dnQJnDI6YvlCiXTWc-g@speakeasy.net>... > > I have a VERY LARGE pile of geographic data that I am importing into a database (db of choice is postgres, > > though may hop to oracle if necessary). The data is strictly hierarchical - each node has one, and only one > > parent. The depth should not exceed 6 or 7 levels. The initial import will have about 6 million leaves, and > > 3 million branches. > > For selection, it is crucial for me to get: > > > > 1. path generation speed > > 2. immediate sibling speed > > 3. children count speed > > > > I don't know how high your requirement for having an SQL interface is. > At my company we wrote a real-time database for the local number > portability, with over 40M entries, with support for variable length > phone numbers ( up to 10 digits ). The customer wanted the longest > phone number match possible. example there could 1, 12, 123 in the > database and the number 1234 would match to 123. > > We did it with C++ and a memory mapped file for persistence. > We had to use a 64bit cpu and 12G of RAM. Wow! That's only 300 bytes of RAM for every 10 digit phone number!
Re: Making a tree with "millions and millions" of dynamic nodes
From
joe.celko@northface.edu (--CELKO--)
Date:
>> The depth should not exceed 6 or 7 levels. The initial import will have about 6 million leaves, and 3 million branches. I would expect the leaves to grow significantly, in number easily tripling. << That is not too bad -- depth would worry me more than breadth >> 1. path generation speed << That is pretty easy in a nested set model: SELECT T1.node, (T1.rgt-T1.lft) AS depth FROM Tree AS T1, Tree AS T2 WHERE T2.lft BETWEEN T1.lft AND T1.rgt AND T2.node = :my_guy ORDER BY depth; The greater (T1.rgt-T1.lft) is, the closer the node is to the root You can also convert depth into a sequential number by using. SELECT T2.node, COUNT (T1.node)AS depth FROM Tree AS T1, Tree AS T2 WHERE T2.lft BETWEEN T1.lft AND T1.rgt GROUP BY T2.part; >> 2. immediate sibling speed << SELECT B.node AS boss, P.node FROM Tree AS P LEFT OUTER JOIN Tree AS B ON B.lft = (SELECT MAX(lft) FROM Tree AS S WHERE P.lft > S.lft ND P.lft <S.rgt); >> 3. children count speed << SELECT :my_guy, (rgt - lft +1)/2 AS subtree_size FROM Tree WHERE node = :my_guy; >> I have implemented a first run using Joe Celko's Nested Sets (w/ a mod to store tree level for speed). The update propigation issue is the achilles heel of this approach for me. << It is not as bad as people first think. The tree structure is in one table and the nodes are in another, so you can manipulating of two integers and one key (perhaps another integer). Since new siblings are added to the right side of the existing line of siblings under the parent, you average a bit less than half of the nodes being scanned in practice. There are some other tricks, like leaving a large gap between (lft) and (rgt) numbers so you have room to insert new nodes. >> So it seems Materialized Path is my only option, however I am concerned about LIKE performance for the right hand side of the tree, where the path is 8 digits x 6 levels = 48 chars. Should I be concerned? << It does not sound too bad, assuming an ordered index. I wonder if you could improve performance of the LIKE predicates by reversing the path string ... >> ... some reassuring words about real-time performance of a 20 million row tree, i'd love to hear << Sorry, the most I have played with is a few hundred thousand rows with the nested set model.
Re: Making a tree with "millions and millions" of dynamic nodes
From
"Joe \"Nuke Me Xemu\" Foster"
Date:
"Bob Badour" <bbadour@golden.net> wrote in message <news:rJmdndW9eNtgeEqiRVn-iw@golden.net>... > "John Isaacks" <jisaacks@yahoo.com> wrote in message > news:8a9a4c2a.0312101423.4d31c3d4@posting.google.com... > > I don't know how high your requirement for having an SQL interface is. > > At my company we wrote a real-time database for the local number > > portability, with over 40M entries, with support for variable length > > phone numbers ( up to 10 digits ). The customer wanted the longest > > phone number match possible. example there could 1, 12, 123 in the > > database and the number 1234 would match to 123. > > > > We did it with C++ and a memory mapped file for persistence. > > We had to use a 64bit cpu and 12G of RAM. > > Wow! That's only 300 bytes of RAM for every 10 digit phone number! They must have been using Prevayler... g/d/r -- Joe Foster <mailto:jlfoster%40znet.com> Sign the Check! <http://www.xenu.net/> WARNING: I cannot be held responsible for the above They're coming to because my cats have apparently learned to type. take me away, ha ha!