Re: self referencing tables/ nested sets etc... - Mailing list pgsql-general
From | Manfred Koizar |
---|---|
Subject | Re: self referencing tables/ nested sets etc... |
Date | |
Msg-id | rha560l2v6ot23sp1e5kj24vvgs6i5bgo1@email.aon.at Whole thread Raw |
In response to | self referencing tables/ nested sets etc... (Rob Hoopman <rob@tuna.nl>) |
Responses |
Re: self referencing tables/ nested sets etc...
|
List | pgsql-general |
On Tue, 23 Mar 2004 22:25:17 +0100, Rob Hoopman <rob@tuna.nl> wrote: >What solutions are there to storing a wide/shallow self referencing tree in a >database? contrib/ltree (http://www.sai.msu.su/~megera/postgres/gist/ltree/) or Joe's connectby() might help. >googling around I found http://www.dbazine.com/tropashko4.shtml. Thanks for the link. I really enjoyed reading that article. It has beautiful coloured graphics and is nicely spiced with buzz words like *fractal* etc. >( And the 'Real World Performance group at >Oracle Corp.' sounds like a ringing endorsement ) Ok, that made me change my sig. <bg> > I have spent today wrestling with [...] > inaccuracies in the article, ^ ... and bugs ... >But,... it's a bittersweet victory because the provided solutions is >completely useless to me. So you had your fun. Now throw it away and start working on a real solution. > It appears that when adding more than 48 sub nodes >to any node in the tree, craps out because of an INT8 column overflowing. AFAICS it doesn't depend on the number of siblings, but it fails when the sum of the numbers in dotted path notation exceeds 62. <rant> Tropashko's Nested Intervals concept as presented in dbazine is broken by design. It is essentially a Materialized Path implementation using bit patterns to store path information. I'll call it Obfuscated Materialized Path Method, OMPM. Given the definitions as found in the article the following conditions hold for each tree node: 0 < numer < 2*denom denom = 2^n, n >= 1 Adding the first child to a node identified by path p: denom(p.1) = 2 * denom(p) numer(p.1) = 2 * numer(p) + 1 Adding child c+1 to a parent node identified by path p: denom(p.(c+1)) = 2 * denom(p.c) numer(p.(c+1)) = 2 * numer(p.c) - 3 For the example nodes used in the article we get the following values: path x+y bin(numer) <root> 1/1 1 1 3/2 11 1.1 7/4 111 1.1.1 15/8 1111 1.1.1.1 31/16 11111 1.2 11/8 1011 1.2.1 23/16 10111 1.3 19/16 10011 1.3.1 39/32 100111 2 3/4 011 2.1 7/8 0111 2.1.1 15/16 01111 2.1.2 27/32 011011 2.2 11/16 01011 2.2.1 19/32 010111 3 3/8 0011 3.1 7/16 00111 4 3/16 00011 The third column in this table shows the binary representation of the numerator, padded to the length of the denominator with leading zeros. The last two bits are always 11, they are redundant and we subsequently ignore them. The remaining bits can be used to find a node in this binary tree: 0 0 0 1---------------------2------------3------------4---- - . . . | | | | 1| 1| 1| 1| | 0 0 | | | 1.1-----1.2-----1.3 2.1 3.1---3.2 4.1---4.2 | | | | 1| 1| 1| | | | | | | 1.2.1---1.2.2 | | | | | | | 1.2.1.1 | | | | 0 | 1.1.1 2.1.1---2.1.2 4.1.1 | | Horizontal edges are marked with 0, they connect each node to its immediate siblings on the same tree level. Vertical edges are marked with 1, they lead from a parent node to its first child. The length of the edges doesn't have any significance. The node corresponding to a (numer, denom) pair is found by scanning the bits of the binary representation of numer from left (most significant) to right (least significant) starting at the position of the single bit which is set in denom (and ignoring the last two bits as has already been mentioned). We locate the upper left node "1" in the tree and move to the right for each 0 bit and down for each 1 bit. Using dotted path notation this means that we start with path "1". For each 1 bit we append ".1" to the path, for each 0 bit we increase the last component of the path by 1. Converting a path to a bit pattern and a bit pattern to numer and denom is left for the reader. So path strings and OMPM (numer, denom) pairs uniquely correspond to each other, and the number of bits required for numer and denom depends on the *sum* of the path components. E.g. the path "4.3.24.22.15" cannot be stored if you use 64 bit integers for numer and denom. It seems the author is aware of this limitation because he writes: | More thorough implementation of the method would involve a domain | of integers with a unlimited range Does Oracle have integers with unlimited range? AFAIK Postgres does not. But we have strings with practically unlimited length (text). contrib/ltree stores path information in strings. It seems to have a limit of 256 bytes. Storage efficiency: Even if we had variable length integers with unlimited range, the number of bits required to store a path in OMPM would depend on the sum of the path components. With 64 bit integers we could store paths like "1.1.1.1. ... .1.1.1" (62 levels) or "1.61" in 16 bytes (2 * 64 bits). In 16 bytes Postgres can store a string with a length of up to 12 bytes. For "1.1" to "9.99" and "10.1" to "99.9" even 8 bytes are sufficient. So for wide/shallow trees dotted path strings are more storage efficient than OMPM bitmaps with long runs of 0 bits. Regarding performance, both queries given near the end of the article | * All the descendants of JONES, excluding himself: | select h1.name from hierarchy h1, hierarchy h2 | where h2.name = 'JONES' | and distance(h1.numer, h1.denom, | h2.numer, h2.denom)>0; and | * All the ancestors of FORD, excluding himself: | select h2.name from hierarchy h1, hierarchy h2 | where h1.name = 'FORD' | and distance(h1.numer, h1.denom, | h2.numer, h2.denom)>0; require a full table scan over hierarchy because the condition involving a distance() function call cannot use an index. So with OMPM the performance of common queries is worse than with classic materialized path strings (e1.path LIKE e2.path || '%'). One more reason to look at contrib/ltree. </rant> Good luck! Servus Manfred, self-appointed member of the Global PG Performance Group
pgsql-general by date: