Re: [CFReview] Red-Black Treey - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: [CFReview] Red-Black Treey |
Date | |
Msg-id | Pine.LNX.4.64.1001292150010.16860@sn.sai.msu.ru Whole thread Raw |
In response to | [CFReview] Red-Black Tree (Mark Cave-Ayland <mark.cave-ayland@siriusit.co.uk>) |
List | pgsql-hackers |
Mark, do you need my data to reproduce results from http://www.sai.msu.su/~megera/wiki/2009-07-27 ? Oleg On Fri, 29 Jan 2010, Mark Cave-Ayland wrote: > Hi Robert, > > I've also spent some time reviewing this patch since it is a > pre-requisite to the KNNGiST patch. I did have a much more comprehensive > list of suggestions, but it seems you've managed to resolve most of > these with your latest re-write. Please find some more comments inline: > >> Here's an edited version, which I've now reviewed more fully. Some >> more substantive review comments: > > Firstly: the re-worked patch that you have posted seems to contain > remnants of at least 2 other patches. I have extracted the rbtree-only > sections and re-attached to this email. > > The patch was tested against git head 124a3cc... and applied without any > fuzz or other issues. > >> 1. I'm pretty satisfied that the rbtree code is generally sane, >> although I wonder if we should think about putting it in access/common >> rather than utils/misc. I'm not sure that I have a sufficiently >> clearly defined notion of what each subdirectory is for to draw a >> definitive conclusion on this point; hopefully someone else will weigh >> in. > > I'm happy that the code is a reasonable implementation of an RB-Tree, at > least with respect to the link to the related public domain source that > was posted. In terms of location, I think utils/misc is a reasonable > place for it to live since I see it as analogous to the hash table > implementation, i.e. it's a template RB-Tree implementation designed to > be used as required throughout the codebase. backend/access seems to be > the home of index AMs only. > > Other code points: > > - The new names for the iterator enum seem much better to me - or at > least it helped understand the meaning of the iterator code. > > - You correctly picked up on the namespace issue, although I noticed > that you left RED and BLACK as they were. Maybe RBRED and RBBLACK would > be better, though not that there are any colour related defines around > in a database backend to make a name clash probable. > > - I found the name of the "appendator" method misleading - perhaps > "updater" would make more sense? > >> 2. I'm a little concerned about the performance implications of this >> patch. Looking at http://www.sai.msu.su/~megera/wiki/2009-04-03, it's >> clear that the patch is a huge win in some cases. But I'm also >> surprised by how much performance is lost in some of the other cases >> that you tested. I suspect, on balance, that it's probably still a >> good idea to put this in, but I wonder if you've profiled this at all >> to see where the extra time is going and/or explored possible ways of >> squashing that overhead down a little more. >> >> 3. In ginInsertEntry(), the "else" branch of the if statement really >> looks like magic when you first read it. I wonder if it would make >> sense to pull the contents of EAAllocate() directly into this >> function, since there's only one call site anyhow. > > God yes. This is not a good example of maintainable code - even with your > comment I struggled for a while to try and figure it out :( I would suggest > that this is refactored appropriately before commit. > >> I still have not done any performance testing of my own on this code, >> and it probably needs that. If anyone else has time to step up and >> help with that, it would be much appreciated. It would be useful to >> have some plain old benchmarks as well as some profiling data as >> mentioned above. > > As part of my testing, I attempted to duplicate some of the benchmarks > at http://www.sai.msu.su/~megera/wiki/2009-04-03. Unfortunately I was not > really able to reproduce the RND (teodor's) dataset, nor the random array > test as the SQL used to test the implementation was not present on the page > above. > > For each test, I dropped and recreated the database to ensure that any > autovacuum impact would be the same. > > > 1) Fixed length random & sequential string arrays > > SELECT array_to_string(ARRAY(select '' || a || '.' || b from > generate_series(1,50) b), ' ')::tsvector AS i INTO seq FROM > generate_series(1,100000) a; > > SELECT array_to_string(ARRAY(select '' || random() from > generate_series(1,50) b), ' ')::tsvector AS i INTO rnd FROM > generate_series(1,100000) a; > > > Before patch: > > test=# create index seq_idx on seq using gin (i); > CREATE INDEX > Time: 103205.790 ms > test=# create index rnd_idx on rnd using gin (i); > CREATE INDEX > Time: 6770.131 ms > > > After patch: > > test=# create index seq_idx on seq using gin (i); > CREATE INDEX > Time: 87724.953 ms > test=# create index rnd_idx on rnd using gin (i); > CREATE INDEX > Time: 5596.666 ms > > > 2) Identical records, variable length test > > select ARRAY(select generate_series(1,len)) as a50 into arr50 from > generate_series(1,100000) b; > > > Before patch: > > i) len=3 > > select ARRAY(select generate_series(1,3)) as a3 into arr3 from > generate_series(1,100000) b; > > test=# create index arr3_idx on arr3 using gin (a3); > CREATE INDEX > Time: 324.251 ms > > > ii) len=30 > > select ARRAY(select generate_series(1,30)) as a30 into arr30 from > generate_series(1,100000) b; > > test=# create index arr30_idx on arr30 using gin (a30); > CREATE INDEX > Time: 2163.786 ms > > > iii) len=50 > > select ARRAY(select generate_series(1,50)) as a50 into arr50 from > generate_series(1,100000) b; > > test=# create index arr50_idx on arr50 using gin (a50); > CREATE INDEX > Time: 3306.074 ms > > > iv) len=random > > select ARRAY(select generate_series(1, (random() * 100)::int)) as arand into > arrrand from generate_series(1,100000) b; > > test=# create index arrrand_idx on arrrand using gin (arand); > CREATE INDEX > Time: 4725.556 ms > > > After patch: > > i) len=3 > > select ARRAY(select generate_series(1,3)) as a3 into arr3 from > generate_series(1,100000) b; > > test=# create index arr3_idx on arr3 using gin (a3); > CREATE INDEX > Time: 299.090 ms > > > ii) len=30 > > select ARRAY(select generate_series(1,30)) as a30 into arr30 from > generate_series(1,100000) b; > > test=# create index arr30_idx on arr30 using gin (a30); > CREATE INDEX > Time: 2828.424 ms > > > iii) len=50 > > select ARRAY(select generate_series(1,50)) as a50 into arr50 from > generate_series(1,100000) b; > > test=# create index arr50_idx on arr50 using gin (a50); > CREATE INDEX > Time: 3984.456 ms > > > iv) len=random > > select ARRAY(select generate_series(1, (random() * 100)::int)) as arand into > arrrand from generate_series(1,100000) b; > > test=# create index arrrand_idx on arrrand using gin (arand); > CREATE INDEX > Time: 3514.972 ms > > > Summary > ======= > > I believe Robert has done a good deal of the groundwork required to get this > patch ready for inclusion. With the current version, I was able to see a > measurable performance improvement in some test cases with no significant > performance regression. It would have been nice to be able to reproduce the > whole set of test cases but this was not possible from the information > specified. > > With perhaps some minor tweaks to some of the names and a rework of the else > clause in ginInsertEntry(), I feel this patch is reasonably close to commit. > > I shall now continue my review of the associated KNNGiST patch. > > > ATB, > > Mark. > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-hackers by date: