Thread: self referencing tables/ nested sets etc...
Hi all, I mostly just lurk on the list, but I need to vent. And I could use any advice/ experiences you'd be willing to share. For those of you that don't want to read my rant: What solutions are there to storing a wide/shallow self referencing tree in a database? Any pointers to previous discussions, articles or other background information is very much apreciated. I seem to be googling in circles, the same articles keep popping up. If I learned one thing today it's that I need to educate myself a bit before settling on an approach for this ;-) Cheers, Rob Pre-P.S. If anyone is interested in my plpgsql version of the approach from the dbazine.com article from my rant, you're welcome to it. I'd be happy to post it to the list or sending it in the mail. <RANT> Today I was confronted with the problem of storing self referencing data, (The popular tutorial material seems to be employees with a boss/subordinate relationship.) I'm sure many of you have been there. So like a good boy I went to trawl the pgsql archives and found some references to the Celko 'nested set' model [http://www.intelligententerprise.com/001020/celko.shtml]. After some more googling around I found http://www.dbazine.com/tropashko4.shtml. I'm not sure if any of you are familiar with this approach, but it's similar to the 'nested sets' approach and somehow this approach appealed to me more than the Celko 'nested sets'. ( And the 'Real World Performance group at Oracle Corp.' sounds like a ringing endorsement ) I don't have any mathematical background, no formal CS education of note and not a lot of experience with plpgsql programming. But... I'm a sucker for a challenge. So, equipped with my limited skillset, I have spent today wrestling with my lack of skill, my unfamiliarity with the problem domain, inaccuracies in the article, oracle to postgres translation quirks. But I pressed on. Never loosing sight of my goal: the reward and pride of overcoming this academic challenge. It's been a long day, I managed to squeeze in a lunch break, two bathroom visits and a few trips to the coffee machine. But my heart was light at the prospect of a job well done. And now,... at the end of the day I am proud to announce to the world: "I have actually done it! Today I have acheived something I have not done before. I am a better man tonight than I was this morning." But,... it's a bittersweet victory because the provided solutions is completely useless to me. It appears that when adding more than 48 sub nodes to any node in the tree, craps out because of an INT8 column overflowing. So after this aticlimax I've set aside my pride and turn to the list for guidance. </RANT>
Rob, I'm not sure what application you're doing this for, so I don't know how to advise. However, I implemented my own interpretation of "nested sets" recently, which I will describe for you. I was inspired by reading http://www.yafla.com/papers/sqlhierarchies/sqlhierarchies.htm by Dennis W. Forbes. But my own approach is to let the unique id of each object serve as its "key" in the nested set key-chain (which I call its "ancestry"), and separate the ids with path separator, e.g.: sah=# select id, parent, ancestry, depth from objects; id | parent | ancestry | depth ---+--------+----------+------ 2 | 1 | 1/2 | 1 1 | 1 | 1 | 0 3 | 1 | 1/3 | 1 4 | 3 | 1/3/4 | 2 5 | 4 | 1/3/4/5 | 3 The ancestry attribute (type varchar) shows the "path" to each item. Then I can do queries like the following: -- selects all objects where the ancestry attribute has '3' as a complete word. select * from objects where ancestry ~ '.*[[:<:]]3[[:>:]].*'; The ancestry and depth are created automatically by trigger. This has been working fine for me, but I haven't tested it with a tree of any great depth. It would seem to be perfect for wide, shallow trees. The trick is the trigger to create ancestry and depth on insert & update to objects. I just use a "recursive trigger" -- it updates the ancestry and depth for each direct child of the updated object, which in turn calls the trigger on each of those children. I'm not aware of any recursion limits on this type of thing in pgsql -- I'd be interested to know from the list if there are any. HTH, Shawn Harrison ----- Original Message ----- From: "Rob Hoopman" <rob@tuna.nl> To: <pgsql-general@postgresql.org> Sent: Tuesday, March 23, 2004 3:25 PM Subject: [GENERAL] self referencing tables/ nested sets etc... > Hi all, > I mostly just lurk on the list, but I need to vent. And I could use any > advice/ experiences you'd be willing to share. > > For those of you that don't want to read my rant: > What solutions are there to storing a wide/shallow self referencing tree in a > database? > Any pointers to previous discussions, articles or other background information > is very much apreciated. I seem to be googling in circles, the same articles > keep popping up. > If I learned one thing today it's that I need to educate myself a bit before > settling on an approach for this ;-) > > Cheers, > Rob > > Pre-P.S. If anyone is interested in my plpgsql version of the approach from > the dbazine.com article from my rant, you're welcome to it. I'd be happy to > post it to the list or sending it in the mail. > > > <RANT> > Today I was confronted with the problem of storing self referencing data, (The > popular tutorial material seems to be employees with a boss/subordinate > relationship.) I'm sure many of you have been there. > So like a good boy I went to trawl the pgsql archives and found some > references to the Celko 'nested set' model > [http://www.intelligententerprise.com/001020/celko.shtml]. After some more > googling around I found http://www.dbazine.com/tropashko4.shtml. > > I'm not sure if any of you are familiar with this approach, but it's similar > to the 'nested sets' approach and somehow this approach appealed to me more > than the Celko 'nested sets'. ( And the 'Real World Performance group at > Oracle Corp.' sounds like a ringing endorsement ) > > I don't have any mathematical background, no formal CS education of note and > not a lot of experience with plpgsql programming. But... I'm a sucker for a > challenge. > So, equipped with my limited skillset, I have spent today wrestling with my > lack of skill, my unfamiliarity with the problem domain, inaccuracies in the > article, oracle to postgres translation quirks. > But I pressed on. Never loosing sight of my goal: the reward and pride of > overcoming this academic challenge. > > It's been a long day, I managed to squeeze in a lunch break, two bathroom > visits and a few trips to the coffee machine. But my heart was light at the > prospect of a job well done. > > And now,... at the end of the day I am proud to announce to the world: > "I have actually done it! Today I have acheived something I have not done > before. I am a better man tonight than I was this morning." > > But,... it's a bittersweet victory because the provided solutions is > completely useless to me. It appears that when adding more than 48 sub nodes > to any node in the tree, craps out because of an INT8 column overflowing. > > So after this aticlimax I've set aside my pride and turn to the list for > guidance. > </RANT> > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
Shawn Harrison wrote: > Forbes. But my own approach is to let the unique id of each object serve as > its "key" in the nested set key-chain (which I call its "ancestry"), and > separate the ids with path separator, e.g.: See connectby() in contrib/tablefunc. It produces much the same output dynamically. Joe
On Tue, 23 Mar 2004 22:25:17 +0100, Rob Hoopman <rob@tuna.nl> wrote: >What solutions are there to storing a wide/shallow self referencing tree in a >database? contrib/ltree (http://www.sai.msu.su/~megera/postgres/gist/ltree/) or Joe's connectby() might help. >googling around I found http://www.dbazine.com/tropashko4.shtml. Thanks for the link. I really enjoyed reading that article. It has beautiful coloured graphics and is nicely spiced with buzz words like *fractal* etc. >( And the 'Real World Performance group at >Oracle Corp.' sounds like a ringing endorsement ) Ok, that made me change my sig. <bg> > I have spent today wrestling with [...] > inaccuracies in the article, ^ ... and bugs ... >But,... it's a bittersweet victory because the provided solutions is >completely useless to me. So you had your fun. Now throw it away and start working on a real solution. > It appears that when adding more than 48 sub nodes >to any node in the tree, craps out because of an INT8 column overflowing. AFAICS it doesn't depend on the number of siblings, but it fails when the sum of the numbers in dotted path notation exceeds 62. <rant> Tropashko's Nested Intervals concept as presented in dbazine is broken by design. It is essentially a Materialized Path implementation using bit patterns to store path information. I'll call it Obfuscated Materialized Path Method, OMPM. Given the definitions as found in the article the following conditions hold for each tree node: 0 < numer < 2*denom denom = 2^n, n >= 1 Adding the first child to a node identified by path p: denom(p.1) = 2 * denom(p) numer(p.1) = 2 * numer(p) + 1 Adding child c+1 to a parent node identified by path p: denom(p.(c+1)) = 2 * denom(p.c) numer(p.(c+1)) = 2 * numer(p.c) - 3 For the example nodes used in the article we get the following values: path x+y bin(numer) <root> 1/1 1 1 3/2 11 1.1 7/4 111 1.1.1 15/8 1111 1.1.1.1 31/16 11111 1.2 11/8 1011 1.2.1 23/16 10111 1.3 19/16 10011 1.3.1 39/32 100111 2 3/4 011 2.1 7/8 0111 2.1.1 15/16 01111 2.1.2 27/32 011011 2.2 11/16 01011 2.2.1 19/32 010111 3 3/8 0011 3.1 7/16 00111 4 3/16 00011 The third column in this table shows the binary representation of the numerator, padded to the length of the denominator with leading zeros. The last two bits are always 11, they are redundant and we subsequently ignore them. The remaining bits can be used to find a node in this binary tree: 0 0 0 1---------------------2------------3------------4---- - . . . | | | | 1| 1| 1| 1| | 0 0 | | | 1.1-----1.2-----1.3 2.1 3.1---3.2 4.1---4.2 | | | | 1| 1| 1| | | | | | | 1.2.1---1.2.2 | | | | | | | 1.2.1.1 | | | | 0 | 1.1.1 2.1.1---2.1.2 4.1.1 | | Horizontal edges are marked with 0, they connect each node to its immediate siblings on the same tree level. Vertical edges are marked with 1, they lead from a parent node to its first child. The length of the edges doesn't have any significance. The node corresponding to a (numer, denom) pair is found by scanning the bits of the binary representation of numer from left (most significant) to right (least significant) starting at the position of the single bit which is set in denom (and ignoring the last two bits as has already been mentioned). We locate the upper left node "1" in the tree and move to the right for each 0 bit and down for each 1 bit. Using dotted path notation this means that we start with path "1". For each 1 bit we append ".1" to the path, for each 0 bit we increase the last component of the path by 1. Converting a path to a bit pattern and a bit pattern to numer and denom is left for the reader. So path strings and OMPM (numer, denom) pairs uniquely correspond to each other, and the number of bits required for numer and denom depends on the *sum* of the path components. E.g. the path "4.3.24.22.15" cannot be stored if you use 64 bit integers for numer and denom. It seems the author is aware of this limitation because he writes: | More thorough implementation of the method would involve a domain | of integers with a unlimited range Does Oracle have integers with unlimited range? AFAIK Postgres does not. But we have strings with practically unlimited length (text). contrib/ltree stores path information in strings. It seems to have a limit of 256 bytes. Storage efficiency: Even if we had variable length integers with unlimited range, the number of bits required to store a path in OMPM would depend on the sum of the path components. With 64 bit integers we could store paths like "1.1.1.1. ... .1.1.1" (62 levels) or "1.61" in 16 bytes (2 * 64 bits). In 16 bytes Postgres can store a string with a length of up to 12 bytes. For "1.1" to "9.99" and "10.1" to "99.9" even 8 bytes are sufficient. So for wide/shallow trees dotted path strings are more storage efficient than OMPM bitmaps with long runs of 0 bits. Regarding performance, both queries given near the end of the article | * All the descendants of JONES, excluding himself: | select h1.name from hierarchy h1, hierarchy h2 | where h2.name = 'JONES' | and distance(h1.numer, h1.denom, | h2.numer, h2.denom)>0; and | * All the ancestors of FORD, excluding himself: | select h2.name from hierarchy h1, hierarchy h2 | where h1.name = 'FORD' | and distance(h1.numer, h1.denom, | h2.numer, h2.denom)>0; require a full table scan over hierarchy because the condition involving a distance() function call cannot use an index. So with OMPM the performance of common queries is worse than with classic materialized path strings (e1.path LIKE e2.path || '%'). One more reason to look at contrib/ltree. </rant> Good luck! Servus Manfred, self-appointed member of the Global PG Performance Group
Thank you both, I should have known to have look at the contribs first. On Thursday 25 March 2004 06:37, Joe Conway wrote: > Shawn Harrison wrote: > > Forbes. But my own approach is to let the unique id of each object serve > > as its "key" in the nested set key-chain (which I call its "ancestry"), > > and separate the ids with path separator, e.g.: > > See connectby() in contrib/tablefunc. It produces much the same output > dynamically. > > Joe > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend
On Thursday 25 March 2004 12:19, Manfred Koizar wrote: > On Tue, 23 Mar 2004 22:25:17 +0100, Rob Hoopman <rob@tuna.nl> wrote: > >What solutions are there to storing a wide/shallow self referencing tree > > in a database? > > contrib/ltree (http://www.sai.msu.su/~megera/postgres/gist/ltree/) or > Joe's connectby() might help. That's just what I need, thanks! > > >( And the 'Real World Performance group at > >Oracle Corp.' sounds like a ringing endorsement ) > > Ok, that made me change my sig. <bg> :-) I should have know the Oracle Corp's 'Real World' has little or nothing to do with my real world. > > > I have spent today wrestling with [...] > > inaccuracies in the article, > > ^ > ... and bugs ... That's what I was thinking, but I thought surely this Oracle dude knows better than me. > > It appears that when adding more than 48 sub nodes > >to any node in the tree, craps out because of an INT8 column overflowing. > > AFAICS it doesn't depend on the number of siblings, but it fails when > the sum of the numbers in dotted path notation exceeds 62. > Maybe, but some of the intermediate steps are larger than the number that gets stored in the end. Actually that's where this implementation broke for me. > <rant> SNIP > </rant> Thanks, that was educational. Cheers, Rob
On Thu, 25 Mar 2004 20:56:35 +0100, Rob Hoopman <rob@tuna.nl> wrote: >> > It appears that when adding more than 48 sub nodes >> >to any node in the tree, craps out because of an INT8 column overflowing. >> >> AFAICS it doesn't depend on the number of siblings, but it fails when >> the sum of the numbers in dotted path notation exceeds 62. >> >Maybe, but some of the intermediate steps are larger than the number that gets >stored in the end. Actually that's where this implementation broke for me. Rob, do you still have the functions and the data that led to the overflow? If so, would you care to locate the parent of the node you failed to insert and this parent's last child. Then please post the output of SELECT pk, numer, denom, path(numer, denom) FROM yourtable WHERE pk = 'parentpk' OR pk = 'childpk'; I'd like to find out whether OMPM is flawed or my theory about it. Thanks. Servus Manfred
On Thursday 25 March 2004 22:01, Manfred Koizar wrote: > On Thu, 25 Mar 2004 20:56:35 +0100, Rob Hoopman <rob@tuna.nl> wrote: > >> > It appears that when adding more than 48 sub nodes > >> >to any node in the tree, craps out because of an INT8 column > >> > overflowing. > >> > >> AFAICS it doesn't depend on the number of siblings, but it fails when > >> the sum of the numbers in dotted path notation exceeds 62. > > > >Maybe, but some of the intermediate steps are larger than the number that > > gets stored in the end. Actually that's where this implementation broke > > for me. > > Rob, do you still have the functions Yes > and the data No, but I can reproduce. I wrote a function that let's me insert a number of child nodes in the tree. > that led to the > overflow? If so, would you care to locate the parent of the node you > failed to insert and this parent's last child. Then please post the > output of > > SELECT pk, numer, denom, path(numer, denom) > FROM yourtable > WHERE pk = 'parentpk' OR pk = 'childpk'; Here you go: test=> SELECT name, numer, denom, path(numer,denom) test-> FROM emps test-> WHERE name = 'Drone 1.1.10.8' OR name = 'Drone 1.1.10.8.29'; name | numer | denom | path -------------------+-----------------+-----------------+-------------- Drone 1.1.10.8 | 1573379 | 1048576 | .1.1.10.8 Drone 1.1.10.8.29 | 844700881780739 | 562949953421312 | .1.1.10.8.29 (2 rows) Have another: test=> SELECT name, numer, denom, path(numer,denom) test-> FROM emps test-> WHERE name = 'KING' OR name = 'Drone 1.48'; name | numer | denom | path ------------+-----------------+-----------------+------- KING | 3 | 2 | .1 Drone 1.48 | 562949953421315 | 562949953421312 | .1.48 test=> So it seems that you are right, but the magic number seems to be 49 > > I'd like to find out whether OMPM is flawed or my theory about it. I wouldn't want to rule out the possibility of me screwing up somewhere in the code. I had never done any plpgsl before yesterday. Cheers, Rob Some more info: The insert fails like this after enabling some debugging ( ni_insert_nodes is the function that autogenerates childnodes ): test2=> SELECT ni_insert_nodes( '1', 1 ); NOTICE: Current Child: 49 NOTICE: Last Child: 49 NOTICE: Inserting Drone 1.49 NOTICE: >>>> start child_number NOTICE: num is: 1 NOTICE: den is: 1 NOTICE: child is: 1 NOTICE: <<<< end child_number NOTICE: >>>> start child_number NOTICE: num is: 3 NOTICE: den is: 2 NOTICE: child is: 49 NOTICE: <<<< end child_number WARNING: Error occurred while executing PL/pgSQL function child_numer WARNING: while casting return value to function's return type ERROR: Bad int8 external representation "1.12589990684263e+15" The function that fails: CREATE FUNCTION child_numer( INT8, INT8, INT8) RETURNS INT8 AS' DECLARE num ALIAS FOR $1; den ALIAS FOR $2; child ALIAS FOR $3; BEGIN RAISE NOTICE '' >>>> start child_number''; RAISE NOTICE ''num is: %'', num; RAISE NOTICE ''den is: %'', den; RAISE NOTICE ''child is: %'', child; RAISE NOTICE '' <<<< end child_number''; RETURN num*pow(2, child)+3-pow(2, child); END ' LANGUAGE plpgsql;
On Fri, 26 Mar 2004 01:03:38 +0100, Rob Hoopman <rob@tuna.nl> wrote: >So it seems that you are right, but the magic number seems to be 49 >NOTICE: num is: 3 >NOTICE: den is: 2 >NOTICE: child is: 49 >WARNING: Error occurred while executing PL/pgSQL function child_numer >WARNING: while casting return value to function's return type >ERROR: Bad int8 external representation "1.12589990684263e+15" > RETURN num*pow(2, child)+3-pow(2, child); pow() is a floating point function. Double has a precision of 48 bits, so 2 ^ 48 is the largest number that can reliably be converted from double to bigint. Better use integer arithmetic: CREATE OR REPLACE FUNCTION pow2(bigint) RETURNS bigint AS ' DECLARE e bigint; ret bigint; BEGIN e = $1; IF (e < 0) THEN RAISE EXCEPTION ''2 ^ % does not fit into a bigint'', e; END IF; IF (e > 62) THEN RAISE EXCEPTION ''2 ^ % does not fit into a bigint'', e; END IF; ret = 1; WHILE (e > 0) LOOP ret = 2 * ret; e = e - 1; END LOOP; RETURN ret; END ' LANGUAGE plpgsql; Thus you could use OMPM to store up to 61 child nodes, but only for the very first parent node ("1.1", ..., "1.61"). BTW, I've read Tropashko's follow-up articles http://arxiv.org/html/cs.DB/0401014, http://arxiv.org/pdf/cs.DB/0402051, as well as various discussions on comp.databases.theory. My conclusion is that OMPM is irreparably broken. With every kludge added to it (Farey Intervals, Continued Fractions) the correspondence to materialized path strings gets even more obfuscated, but the main shortcoming remains: If you try to store materialized path information in a fixed number of integers you run out of bits very soon. And if you use floating point you lose accuracy. Fuzzy trees: "My boss is JONES. Or is it CLARK?" Servus Manfred
Joe, That's cool! Thanks for the tip. It looks like your "branch" is my "ancestry". I'll look in contrib more often. I could use connectby() instead of my homebrew plpgsql functions. Shawn ----- Original Message ----- From: "Joe Conway" <mail@joeconway.com> To: "Shawn Harrison" <harrison@tbc.net> Cc: <pgsql-general@postgresql.org> Sent: Wednesday, March 24, 2004 11:37 PM Subject: Re: [GENERAL] self referencing tables/ nested sets etc... > Shawn Harrison wrote: > > Forbes. But my own approach is to let the unique id of each object serve as > > its "key" in the nested set key-chain (which I call its "ancestry"), and > > separate the ids with path separator, e.g.: > > See connectby() in contrib/tablefunc. It produces much the same output > dynamically. > > Joe
Re: self referencing tables/ nested sets etc...
From
vadimtro_invalid@yahoo.com (Vadim Tropashko)
Date:
mkoi-pg@aon.at (Manfred Koizar) wrote in message news:<lao7605t695dguk7ehql7ikcm56rufhc0s@email.aon.at>... > BTW, I've read Tropashko's follow-up articles > http://arxiv.org/html/cs.DB/0401014, http://arxiv.org/pdf/cs.DB/0402051, > as well as various discussions on comp.databases.theory. My conclusion > is that OMPM is irreparably broken. With every kludge added to it > (Farey Intervals, Continued Fractions) the correspondence to > materialized path strings gets even more obfuscated, but the main > shortcoming remains: If you try to store materialized path information > in a fixed number of integers you run out of bits very soon. The article series Binary Fractions (dbazine) -> Farey Fractions -> Continued Fractions might give you wrong impression that it's one kludge upon another. In reality it's just a trail of discovery process. Let me summarize what I think withstanded the test of time: http://www.dbazine.com/tropashko4.shtml The idea of generalizing nested integers to nested fractions is still valid, although the particular encoding schema with Binary Rationals proved to be not practical. http://www.dbazine.com/tropashko5.shtml Uses the same Binary Rationals encoding schema solving tree relocation problem. Farey Fractions Different encoding schema, that still is waiting somebody to refute it. Unlike previous articles I did some volume testing there. Continued Fractions That is just a different perspective into essentially the same encoding. Now the connection to Materialized Path is transparent: whenever you concatenate paths in path encoding, for example 11.2 + 7.3.5 = 11.2.7.3.5 you nest continued fractions x=7+1/(3+...) inside 11+1/(2+1/x) to get 11+1/(2+1/(7+...)) Technically, this could be done much easier than nesting continued fractions. The encoding is just four integer numbers <a,b,c,d> that can be arranged either into Moebius function (ax+c)/(bx+d) so that concatenation of paths is substitution of functions, or alternatively, into 2x2 matrix: Matrix([a,c],[b,d]) so that concatenation of paths is Matrix multiplication. Example: path 3.1.1.9.9.9.12.31.24.500.17.4.39 corresponds to continued fraction 3+1/(1+1/(1+1/(9+1/(9+1/(9+1/(12+1/(31+1/(24+1/(500+1/(17+1/(4+1/39))))))))))); which corresponds to Matrix([68075613118554,1734570625891],[19306670376781,491935096655]) All matrices have additional property that absolute value of the determinant is 1 (Proposition 1). In our case 68075613118554*491935096655-1734570625891*19306670376781=1 In short, I'm taking back my previous accessment that for all practical purposes Materialized Path is the best encoding. Continued Fractions/Moebius transformations/2x2 Matrices are much nicer IMO. But this totally depends upon how comfortable are you with math.