Thread: WIP: SP-GiST, Space-Partitioned GiST
Hi there, attached is our WIP-patch for 9.2 development source tree, which provides implementation of SP-GiST (prototype was presented at PGCon-2011, see http://www.pgcon.org/2011/schedule/events/309.en.html and presentation for details) as a core feature. Main differences from prototype version: 1. Now it's part of pg core, not contrib module 2. It provides more operations for quadtree and suffix tree 3. It uses clustering algorithm of nodes on disk and has much better utilization of disk space. Fillfactor is supported 4. Some corner cases were eliminated 5. It provides support for concurency and recovery (inserts are logged, supports for deletes, and log replay will be added really soon) So, now code contains almost all possible overhead of production code and we ask hackers to test performance on real data sets. We expect the same performance for random data (since almost no overlaps) and much better performance on real-life data, plus much better index creation time. Also, we appreciate your comments and suggestions about API. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
This is updates SP-GiST patch, which fixed one bug and replaced test to the locale independent one. On Wed, 31 Aug 2011, Oleg Bartunov wrote: > Hi there, > > attached is our WIP-patch for 9.2 development source tree, which provides > implementation of SP-GiST (prototype was presented at PGCon-2011, see > http://www.pgcon.org/2011/schedule/events/309.en.html and presentation > for details) as a core feature. Main differences from prototype version: > > 1. Now it's part of pg core, not contrib module > 2. It provides more operations for quadtree and suffix tree > 3. It uses clustering algorithm of nodes on disk and has much better > utilization of disk space. Fillfactor is supported > 4. Some corner cases were eliminated > 5. It provides support for concurency and recovery (inserts are > logged, supports for deletes, and log replay will be added really > soon) > > So, now code contains almost all possible overhead of production code > and we ask hackers to test performance on real data sets. We expect > the same performance for random data (since almost no overlaps) and > much better performance on real-life data, plus much better index > creation time. Also, we appreciate your comments and suggestions about > API. > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Attached is updated SP-GiST patch, which provides full logging support and fixed several bugs (Thanks ALexander Korotkov for help). Regards, Oleg On Thu, 1 Sep 2011, Oleg Bartunov wrote: > This is updates SP-GiST patch, which fixed one bug and replaced test to the > locale independent one. > > On Wed, 31 Aug 2011, Oleg Bartunov wrote: > >> Hi there, >> >> attached is our WIP-patch for 9.2 development source tree, which provides >> implementation of SP-GiST (prototype was presented at PGCon-2011, see >> http://www.pgcon.org/2011/schedule/events/309.en.html and presentation >> for details) as a core feature. Main differences from prototype version: >> >> 1. Now it's part of pg core, not contrib module >> 2. It provides more operations for quadtree and suffix tree >> 3. It uses clustering algorithm of nodes on disk and has much better >> utilization of disk space. Fillfactor is supported >> 4. Some corner cases were eliminated >> 5. It provides support for concurency and recovery (inserts are >> logged, supports for deletes, and log replay will be added really >> soon) >> >> So, now code contains almost all possible overhead of production code >> and we ask hackers to test performance on real data sets. We expect >> the same performance for random data (since almost no overlaps) and >> much better performance on real-life data, plus much better index >> creation time. Also, we appreciate your comments and suggestions about >> API. >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
I attached wrong patch in previous message, sorry ! Here is a last version. This is a new WIP of SP-GiST patch, which provides support for: 1. Concurrent vacuum 2. Vacuum logging 3. WAL replay Oleg On Thu, 1 Sep 2011, Oleg Bartunov wrote: > This is updates SP-GiST patch, which fixed one bug and replaced test to the > locale independent one. > > On Wed, 31 Aug 2011, Oleg Bartunov wrote: > >> Hi there, >> >> attached is our WIP-patch for 9.2 development source tree, which provides >> implementation of SP-GiST (prototype was presented at PGCon-2011, see >> http://www.pgcon.org/2011/schedule/events/309.en.html and presentation >> for details) as a core feature. Main differences from prototype version: >> >> 1. Now it's part of pg core, not contrib module >> 2. It provides more operations for quadtree and suffix tree >> 3. It uses clustering algorithm of nodes on disk and has much better >> utilization of disk space. Fillfactor is supported >> 4. Some corner cases were eliminated >> 5. It provides support for concurency and recovery (inserts are >> logged, supports for deletes, and log replay will be added really >> soon) >> >> So, now code contains almost all possible overhead of production code >> and we ask hackers to test performance on real data sets. We expect >> the same performance for random data (since almost no overlaps) and >> much better performance on real-life data, plus much better index >> creation time. Also, we appreciate your comments and suggestions about >> API. >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Here is the latest spgist patch, which has all planned features as well as all overhead, introduced by concurrency and recovery, so performance measurement should be realistic now. Oleg On Mon, 5 Sep 2011, Oleg Bartunov wrote: > I attached wrong patch in previous message, sorry ! Here is a last version. > > This is a new WIP of SP-GiST patch, which provides support for: > > 1. Concurrent vacuum > 2. Vacuum logging > 3. WAL replay > > Oleg > On Thu, 1 Sep 2011, Oleg Bartunov wrote: > >> This is updates SP-GiST patch, which fixed one bug and replaced test to the >> locale independent one. >> >> On Wed, 31 Aug 2011, Oleg Bartunov wrote: >> >>> Hi there, >>> >>> attached is our WIP-patch for 9.2 development source tree, which provides >>> implementation of SP-GiST (prototype was presented at PGCon-2011, see >>> http://www.pgcon.org/2011/schedule/events/309.en.html and presentation >>> for details) as a core feature. Main differences from prototype version: >>> >>> 1. Now it's part of pg core, not contrib module >>> 2. It provides more operations for quadtree and suffix tree >>> 3. It uses clustering algorithm of nodes on disk and has much better >>> utilization of disk space. Fillfactor is supported >>> 4. Some corner cases were eliminated >>> 5. It provides support for concurency and recovery (inserts are >>> logged, supports for deletes, and log replay will be added really >>> soon) >>> >>> So, now code contains almost all possible overhead of production code >>> and we ask hackers to test performance on real data sets. We expect >>> the same performance for random data (since almost no overlaps) and >>> much better performance on real-life data, plus much better index >>> creation time. Also, we appreciate your comments and suggestions about >>> API. >>> >>> Regards, >>> Oleg >>> _____________________________________________________________ >>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>> Sternberg Astronomical Institute, Moscow University, Russia >>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>> phone: +007(495)939-16-83, +007(495)939-23-83 >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 09/06/2011 07:34 PM, Oleg Bartunov wrote: > Here is the latest spgist patch, which has all planned features as > well as > all overhead, introduced by concurrency and recovery, so performance > measurement should be realistic now. > > Oleg Sorry for not getting the might-be-obvious here, but will this patch bring indexed substring-search to PG? So queries conceptually equal to this will be possible to index: WHERE som_col @@ ':substr1:&:substr2!substr3:' meaning "contains substr1" AND "ends with substr2" OR "starts with substr3"? -- Andreas Joseph Krogh <andreak@officenet.no> - mob: +47 909 56 963 Senior Software Developer / CTO - OfficeNet AS - http://www.officenet.no Public key: http://home.officenet.no/~andreak/public_key.asc
On Tue, Sep 6, 2011 at 10:21 PM, Andreas Joseph Krogh <andreak@officenet.no> wrote:
------
With best regards,
Alexander Korotkov.
Sorry for not getting the might-be-obvious here, but will this patchbring indexed substring-search to PG? So queries conceptually equal to
this will be possible to index: WHERE som_col @@
':substr1:&:substr2!substr3:' meaning "contains substr1" AND "ends with
substr2" OR "starts with substr3"?
Such applications are potentionally possible, but aren't presented yet. Currently only k-d-tree and suffix-tree are presented. Suffix-tree supports prefix search like btree with *_pattern_ops. (Oleg will corect me if I'm missing something)
For the queries you mentioned you may see LIKE acceleration in wildspeed module (http://www.sai.msu.su/~megera/wiki/wildspeed) and pg_trgm of 9.1 (http://www.postgresql.org/docs/9.1/static/pgtrgm.html).
With best regards,
Alexander Korotkov.
On 05.09.2011 09:39, Oleg Bartunov wrote: > I attached wrong patch in previous message, sorry ! Here is a last version. One little detail caught my eye: In spgSplitNodeAction, you call SpGistGetBuffer() within a critical section. That should be avoided, SpGistGetBuffer() can easily fail if you e.g run out of disk space. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 06.09.2011 20:34, Oleg Bartunov wrote: > Here is the latest spgist patch, which has all planned features as well as > all overhead, introduced by concurrency and recovery, so performance > measurement should be realistic now. I'm ignoring the text suffix-tree part of this for now, because of the issue with non-C locales that Alexander pointer out. Regarding the quadtree, have you compared the performance of that with Alexander's improved split algorithm? I ran some tests using the test harness I still had lying around from the fast GiST index build tests: testname | time | accesses | indexsize -------------------------+-----------------+----------+----------- points unordered auto | 00:03:58.188866 | 378779 |522 MB points ordered auto | 00:07:14.362355 | 177534 | 670 MB points unordered auto | 00:02:59.130176 | 46561| 532 MB points ordered auto | 00:04:00.50756 | 45066 | 662 MB points unordered spgist | 00:03:05.569259 | 78871 | 394 MB points ordered spgist | 00:01:46.06855 | 422104 | 417 MB (8 rows) These tests were with a table with 7500000 random points. In the ordered-tests, the table is sorted by x,y coordinates. 'time' is the time used to build the index on it, and 'accesses' is the total number of index blocks hit by a series of 10000 bounding box queries, measured from pg_statio_user_indexes.idx_blks_hit + idx_blks_read. The first two tests in the list are with a GiST index on unpatched PostgreSQL. The next six tests are with Alexander's double-sorting split patch. The last two tests are with an SP-GiST index. It looks like the query performance with GiST using the double-sorting split is better than SP-GiST, although the SP-GiST index is somewhat smaller. The ordered case seems pathologically bad, is that some sort of a worst-case scenario for quadtrees? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 22, 2011 at 2:05 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
------
With best regards,
Alexander Korotkov.
Regarding the quadtree, have you compared the performance of that with Alexander's improved split algorithm? I ran some tests using the test harness I still had lying around from the fast GiST index build tests:
testname | time | accesses | indexsize
-------------------------+-----------------+----------+-----------
points unordered auto | 00:03:58.188866 | 378779 | 522 MB
points ordered auto | 00:07:14.362355 | 177534 | 670 MB
points unordered auto | 00:02:59.130176 | 46561 | 532 MB
points ordered auto | 00:04:00.50756 | 45066 | 662 MB
points unordered spgist | 00:03:05.569259 | 78871 | 394 MB
points ordered spgist | 00:01:46.06855 | 422104 | 417 MB
(8 rows)
I assume first two rows to be produced by new linear split algorithm(current) and secound two rows by double sorting split algorithm(my patch).
These tests were with a table with 7500000 random points. In the ordered-tests, the table is sorted by x,y coordinates. 'time' is the time used to build the index on it, and 'accesses' is the total number of index blocks hit by a series of 10000 bounding box queries, measured from pg_statio_user_indexes.idx_blks_hit + idx_blks_read.
The first two tests in the list are with a GiST index on unpatched PostgreSQL. The next six tests are with Alexander's double-sorting split patch. The last two tests are with an SP-GiST index.
It looks like the query performance with GiST using the double-sorting split is better than SP-GiST, although the SP-GiST index is somewhat smaller. The ordered case seems pathologically bad, is that some sort of a worst-case scenario for quadtrees?
Comparison of search speed using number of page accesses is quite comprehensive for various GiST indexes. But when we're comparing SP-GiST vs GiST we should take into accoung that they have different CPU/IO ratio. GiST scans whole page which it accesses. SP-GiST can scan only fraction of page because several nodes can be packed into single page. Thereby it would be interesting to compare also CPU load GiST vs. SP-GiST. Also, there is some hope to reduce number of page accesses in SP-GiST by improving clustering algorithm.
With best regards,
Alexander Korotkov.
On Thu, 22 Sep 2011, Heikki Linnakangas wrote: > On 06.09.2011 20:34, Oleg Bartunov wrote: >> Here is the latest spgist patch, which has all planned features as well as >> all overhead, introduced by concurrency and recovery, so performance >> measurement should be realistic now. > > I'm ignoring the text suffix-tree part of this for now, because of the issue > with non-C locales that Alexander pointer out. > > Regarding the quadtree, have you compared the performance of that with > Alexander's improved split algorithm? I ran some tests using the test harness > I still had lying around from the fast GiST index build tests: > > testname | time | accesses | indexsize > -------------------------+-----------------+----------+----------- > points unordered auto | 00:03:58.188866 | 378779 | 522 MB > points ordered auto | 00:07:14.362355 | 177534 | 670 MB > points unordered auto | 00:02:59.130176 | 46561 | 532 MB > points ordered auto | 00:04:00.50756 | 45066 | 662 MB > points unordered spgist | 00:03:05.569259 | 78871 | 394 MB > points ordered spgist | 00:01:46.06855 | 422104 | 417 MB > (8 rows) > > These tests were with a table with 7500000 random points. In the > ordered-tests, the table is sorted by x,y coordinates. 'time' is the time > used to build the index on it, and 'accesses' is the total number of index > blocks hit by a series of 10000 bounding box queries, measured from > pg_statio_user_indexes.idx_blks_hit + idx_blks_read. > > The first two tests in the list are with a GiST index on unpatched > PostgreSQL. The next six tests are with Alexander's double-sorting split > patch. The last two tests are with an SP-GiST index. > > It looks like the query performance with GiST using the double-sorting split > is better than SP-GiST, although the SP-GiST index is somewhat smaller. The > ordered case seems pathologically bad, is that some sort of a worst-case > scenario for quadtrees? random points are not good use-case, on real data sp-gist is much faster than GiST. I attached the latest version of patch, which descrease wal-traffic and introduce kd-tree example. real-life example from geonames database can be downloaded from http://mira.sai.msu.su/~megera/geo-all.dump.gz Below is a log of test run on my notebook: zcat /home/megera/app/pgsql/knn/geo-all.dump.gz | psql test create index spg_kd_idx on geo using spgist(point kd_point_ops); CREATE INDEX Time: 97180.047 ms test=# select pg_total_relation_size('spg_quad_idx'); pg_total_relation_size ------------------------ 363126784 test=# create index spg_quad_idx on geo using spgist(point); LOG: checkpoints are occurring too frequently (22 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". LOG: checkpoints are occurring too frequently (19 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". LOG: checkpoints are occurring too frequently (20 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". CREATE INDEX Time: 68565.279 ms test=# select pg_total_relation_size('spg_quad_idx'); pg_total_relation_size ------------------------ 363126784 test=# create index gist_idx on geo using gist(point); CREATE INDEX Time: 354446.965 ms test=# select pg_total_relation_size('gist_idx'); pg_total_relation_size ------------------------ 542793728 Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 06.09.2011 20:34, Oleg Bartunov wrote: >> Here is the latest spgist patch, which has all planned features as well as >> all overhead, introduced by concurrency and recovery, so performance >> measurement should be realistic now. > I'm ignoring the text suffix-tree part of this for now, because of the > issue with non-C locales that Alexander pointer out. It seems to me that SP-GiST simply cannot work for full text comparisons in non-C locales, because it's critically dependent on the assumption that comparisons of strings are consistent with comparisons of prefixes of those strings ... an assumption that's just plain false for most non-C locales. We can dodge that problem in the same way that we did in the btree pattern_ops opclasses, namely implement the opclass only for the = operator and the special operators ~<~ etc. I think I favor doing this for the first round, because it's a simple matter of removing code that's currently present in the patch. Even with only = support the opclass would be extremely useful. Something we could consider later is a way to use the index for the regular text comparison operators (< etc), but only when the operator is using C collation. This is not so much a matter for the index implementation as it is about teaching the planner to optionally consider collation when matching an operator call to the index. It's probably going to tie into the unfinished business of marking which operators are collation sensitive and which are not. In other news, I looked at the patch briefly, but I don't think I want to review it fully without some documentation. The absolute minimum requirement IMO is documentation comparable to what we have for GIN, ie a specification for the support methods and some indication of when you'd use this index type in preference to others. I'd be willing to help copy-edit and SGML-ize such documentation, but I do not care to reverse-engineer it from the code. regards, tom lane
We are working on the hackers documentation, hope to submit it before my himalaya track. Oleg On Sun, 2 Oct 2011, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> On 06.09.2011 20:34, Oleg Bartunov wrote: >>> Here is the latest spgist patch, which has all planned features as well as >>> all overhead, introduced by concurrency and recovery, so performance >>> measurement should be realistic now. > >> I'm ignoring the text suffix-tree part of this for now, because of the >> issue with non-C locales that Alexander pointer out. > > It seems to me that SP-GiST simply cannot work for full text comparisons > in non-C locales, because it's critically dependent on the assumption > that comparisons of strings are consistent with comparisons of prefixes > of those strings ... an assumption that's just plain false for most > non-C locales. > > We can dodge that problem in the same way that we did in the btree > pattern_ops opclasses, namely implement the opclass only for the = > operator and the special operators ~<~ etc. I think I favor doing this > for the first round, because it's a simple matter of removing code > that's currently present in the patch. Even with only = support > the opclass would be extremely useful. > > Something we could consider later is a way to use the index for the > regular text comparison operators (< etc), but only when the operator > is using C collation. This is not so much a matter for the index > implementation as it is about teaching the planner to optionally > consider collation when matching an operator call to the index. It's > probably going to tie into the unfinished business of marking which > operators are collation sensitive and which are not. > > In other news, I looked at the patch briefly, but I don't think I want > to review it fully without some documentation. The absolute minimum > requirement IMO is documentation comparable to what we have for GIN, > ie a specification for the support methods and some indication of when > you'd use this index type in preference to others. I'd be willing to > help copy-edit and SGML-ize such documentation, but I do not care to > reverse-engineer it from the code. > > regards, tom lane > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hi there, attached is the latest version of patch (v.110) for current 9.2. Now it includes README and spgist.sgml. There is one annoying problem under MAC OS (Linux, FreeBSD have no problem), we just can't figure out how to find it, since we are not familiar with MAC OS - it fails to restart after 'kill -9' backend, but only if sources were compiled with -O2 option (no problem occured with -O0). Since the fail happens not every time, we use following script to reproduce the problem. We ask MAC OS guru to help us debugging this problem. ============================================================================ #!/bin/sh # enable core file dumping by a system ulimit -c unlimited # create test data createdb test psql test -c 'drop table if exists qq; create table qq as select point( p.lat, p.long) as p from ( select (0.5-random())*180 as lat, random()*360 as long from generate_series(1,10000000) ) as p; ' while : do psql test -c 'create index spgist_idx on qq using spgist(p)' & sleep 20 kill -9 `ps ax | grep postgres:| grep -v grep | awk '{print $1}' | xargs` rc=$? if [ $rc -gt 0 ] then echoSomething is going wrong (rc=$rc) ... break fi sleep 10 done Regads, Oleg On Sun, 2 Oct 2011, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> On 06.09.2011 20:34, Oleg Bartunov wrote: >>> Here is the latest spgist patch, which has all planned features as well as >>> all overhead, introduced by concurrency and recovery, so performance >>> measurement should be realistic now. > >> I'm ignoring the text suffix-tree part of this for now, because of the >> issue with non-C locales that Alexander pointer out. > > It seems to me that SP-GiST simply cannot work for full text comparisons > in non-C locales, because it's critically dependent on the assumption > that comparisons of strings are consistent with comparisons of prefixes > of those strings ... an assumption that's just plain false for most > non-C locales. > > We can dodge that problem in the same way that we did in the btree > pattern_ops opclasses, namely implement the opclass only for the = > operator and the special operators ~<~ etc. I think I favor doing this > for the first round, because it's a simple matter of removing code > that's currently present in the patch. Even with only = support > the opclass would be extremely useful. > > Something we could consider later is a way to use the index for the > regular text comparison operators (< etc), but only when the operator > is using C collation. This is not so much a matter for the index > implementation as it is about teaching the planner to optionally > consider collation when matching an operator call to the index. It's > probably going to tie into the unfinished business of marking which > operators are collation sensitive and which are not. > > In other news, I looked at the patch briefly, but I don't think I want > to review it fully without some documentation. The absolute minimum > requirement IMO is documentation comparable to what we have for GIN, > ie a specification for the support methods and some indication of when > you'd use this index type in preference to others. I'd be willing to > help copy-edit and SGML-ize such documentation, but I do not care to > reverse-engineer it from the code. > > regards, tom lane > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > There is one annoying problem under MAC OS (Linux, FreeBSD have no problem), we > just can't figure out how to find it, since we are not familiar with MAC OS - > it fails to restart after 'kill -9' backend, but only if sources were > compiled with -O2 option (no problem occured with -O0). Since the fail happens > not every time, we use following script to reproduce the problem. We ask > MAC OS guru to help us debugging this problem. I don't think it's Mac-specific at all; it looks to me like garden variety uninitialized data, specifically that there are paths through doPickSplit that don't set xlrec.newPage. The crash I'm seeing is TRAP: FailedAssertion("!(offset <= (((PageHeader) (page))->pd_lower <= (__builtin_offsetof (PageHeaderData, pd_linp)) ? 0: ((((PageHeader) (page))->pd_lower - (__builtin_offsetof (PageHeaderData, pd_linp))) / sizeof(ItemIdData))) + 1)", File:"spgxlog.c", Line: 81) #0 0x00007fff883f982a in __kill () #1 0x00007fff85bdda9c in abort () #2 0x0000000103165a71 in ExceptionalCondition (conditionName=<value temporarily unavailable, due to optimizations>, errorType=<valuetemporarily unavailable, due to optimizations>, fileName=<value temporarily unavailable, due to optimizations>,lineNumber=<value temporarily unavailable, due to optimizations>) at assert.c:57 #3 0x0000000102eeec73 in addOrReplaceTuple (page=0x74cc <Address 0x74cc out of bounds>, tuple=0x7faa1182d64c " ", size=88,offset=70) at spgxlog.c:81 #4 0x0000000102eed4bc in spgRedoPickSplit [inlined] () at /Users/tgl/pgsql/src/backend/access/spgist/spgxlog.c:504 #5 0x0000000102eed4bc in spg_redo (record=0x7fff62a5ccf0) at spgxlog.c:803 #6 0x0000000102ec4f48 in StartupXLOG () at xlog.c:6534 #7 0x0000000103054378 in StartupProcessMain () at startup.c:220 #8 0x0000000102ef4449 in AuxiliaryProcessMain (argc=2, argv=0x7fff62a60030) at bootstrap.c:414 The xlog record it's working on is (gdb) p *(spgxlogPickSplit*)(0x7fcb20826600 + 32) $6 = { node = { spcNode = 1663, dbNode = 41578, relNode = 204800 }, nTuples = 75, nNodes = 4, blknoSrc = 988, nDelete = 74, blknoInner = 929, offnumInner = 70, newPage = 1 '\001', blknoParent = 929, offnumParent = 13, nodeI= 2, stateSrc = { attType_attlen = 16, fakeTupleSize = 32, isBuild = 1 } } Since newPage is set, addOrReplaceTuple gets called on a freshly initialized page, and not surprisingly complains that offset 70 is way out of range. Maybe there's something wrong with the replay logic, but what I'm thinking is that newPage should not have been true here, which means that doPickSplit failed to set it correctly, which doesn't look at all improbable. I added a memset at the top of doPickSplit to force the whole struct to zeroes, and so far haven't seen the crash again. regards, tom lane
> I don't think it's Mac-specific at all; it looks to me like garden > variety uninitialized data, specifically that there are paths through > doPickSplit that don't set xlrec.newPage. The crash I'm seeing is Thank you a lot. I wasn't able to reproduce it on FreeBSD/Linux. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
Teodor Sigaev <teodor@sigaev.ru> writes: > [ spgist patch ] I've been working through this patch and fixing assorted things. There are a couple of issues that require some discussion: 1. It took me awhile to realize it, but there are actually three different datatypes that can be stored in an SPGist index: the prefix value of an inner tuple, the key value for each node (downlink) in an inner tuple, and the data value of a leaf tuple. One reason this was not obvious is that only two of these types can be configured by the opclass config function; the leaf tuple datatype is hard-wired to be the same as the indexed column's type. Why is that? It seems to me to be both confusing and restrictive. For instance, if you'd designed the suffix tree opclass just a little differently, it would be wanting to store "char" not text in leaf nodes. Shouldn't we change this to allow the leaf data type to be specified explicitly? 2. The code is a bit self-contradictory about whether the index is complete or not. The top-level ambuild and aminsert code rejects NULLs (but there seems to be some provision for storing nulls in leaf tuples --- I'm confused what that's used for, but apparently not for actual data nulls). If this is a designed, permanent limitation then SPGist indexes can never be expected to represent every heap tuple, which means there is no value in full-index scans. However there is code in spgWalk() to support a full-index scan. It's dead code, because pg_am.amoptionalkey is not set for spgist and so the planner will never create a qual-free index scan plan. Perhaps we should rip out the full-index-scan code? Or do you have plans to support indexing nulls later? regards, tom lane
I wrote: > ... the leaf tuple datatype is hard-wired to be > the same as the indexed column's type. Why is that? It seems to me > to be both confusing and restrictive. For instance, if you'd designed > the suffix tree opclass just a little differently, it would be wanting > to store "char" not text in leaf nodes. Shouldn't we change this to > allow the leaf data type to be specified explicitly? After another day's worth of hacking, I now understand the reason for the above: when an index is less than a page and an incoming value would still fit on the root page, the incoming value is simply dumped into a leaf tuple without ever calling any opclass-specific function at all. To allow the leaf datatype to be different from the indexed column, we'd need at least one more opclass support function, and it's not clear that the potential gain is worth any extra complexity. However, I now have another question: what is the point of the allTheSame mechanism? It seems to add quite a great deal of complexity, without saving much of any space. At least for the node key types used so far, the null bitmap that's added to the node tuples eats just as much space as storing a normal key would. We could probably avoid that by using custom tuple construction code instead of index_form_tuple, but on the whole I think it'd be better to remove the concept. For one thing, it's giving me fits while attempting to fix the limitation on storing long indexed values. There's no reason why a suffix tree representation shouldn't work for long strings, but you have to be willing to cap the length of any given inner tuple's prefix to something that will fit on a page --- and that breaks the badly-underdocumented allTheSame logic in spgtextproc.c. You can't choose to just drop the node key (i.e., the next byte in the original string) unless there isn't any because you reached the end of the string. In general it's not clear to me why it's sensible to drop a node key ever. I'm also still wondering what your thoughts are on storing null values versus full-index-scan capability. I'm leaning towards getting rid of the dead code, but if you have an idea how to remove the limitation, maybe we should do that instead. regards, tom lane
> I wrote: >> ... the leaf tuple datatype is hard-wired to be > After another day's worth of hacking, I now understand the reason for > the above: when an index is less than a page and an incoming value would > still fit on the root page, the incoming value is simply dumped into a > leaf tuple without ever calling any opclass-specific function at all. Exactly. > To allow the leaf datatype to be different from the indexed column, > we'd need at least one more opclass support function, and it's not clear > that the potential gain is worth any extra complexity. Agree, all opclasses which I could imagine for sp-gist use the same type. Without clear example I don't like an idea to add one more support function and it could be easily added later as an optional support function as it's already done for distance function for GiST > However, I now have another question: what is the point of the > allTheSame mechanism? It seems to add quite a great deal of complexity, I thought about two options: separate code path in core to support a-lot-of-the-same-values with minimal support in support functions and move all logic about this case to support functions. Second option is demonstrated in k-d-tree implementation, where split axis is contained by each half-plane. May be it is a simpler solution although it moves responsibility to opclass developers. > one thing, it's giving me fits while attempting to fix the limitation > on storing long indexed values. There's no reason why a suffix tree > representation shouldn't work for long strings, but you have to be > willing to cap the length of any given inner tuple's prefix to something I don't see clear interface for now: let we have an empty index and we need to insert a long string (more than even several page). So, it's needed to have support function to split input value to several ones. I supposed that sp-gist is already complex enough for first step to add support for this non very useful case. Of course, for future we have a plans to add support of long values, NULLs/IS NULL, knn-search at least. > I'm also still wondering what your thoughts are on storing null values > versus full-index-scan capability. I'm leaning towards getting rid of > the dead code, but if you have an idea how to remove the limitation, > maybe we should do that instead. I didn't have a plan to support NULLs in first stage, because it's not clear for me how and where to store them. It seems to me that it should be fully separated from normal path, like a linked list of pages with only ItemPointer data (similar to leaf data pages in GIN) I missed that planner will not create qual-free scan, because I thought it's still possible with NOT NULL columns. If not, this code could be removed/commented/ifdefed. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> However, I now have another question: what is the point of the >> allTheSame mechanism? It seems to add quite a great deal of complexity, > I thought about two options: separate code path in core to support > a-lot-of-the-same-values with minimal support in support functions and move all > logic about this case to support functions. Second option is demonstrated in > k-d-tree implementation, where split axis is contained by each half-plane. > May be it is a simpler solution although it moves responsibility to opclass > developers. I eventually realized that you have to have it to ensure that you can split a leaf-tuple list across pages even when the opclass picksplit function thinks the values are all the same. What made no sense to me was (a) having the property forcibly inherit to child inner tuples, and (b) suppressing node labels in allTheSame tuples. That could make it impossible for the opclass to reconstruct values. In my local copy I've changed this behavior a bit and improved the documentation about what opclasses have to do with the flag. >> one thing, it's giving me fits while attempting to fix the limitation >> on storing long indexed values. There's no reason why a suffix tree >> representation shouldn't work for long strings, but you have to be >> willing to cap the length of any given inner tuple's prefix to something > I don't see clear interface for now: let we have an empty index and we > need to insert a long string (more than even several page). So, it's > needed to have support function to split input value to several > ones. I supposed that sp-gist is already complex enough for first step > to add support for this non very useful case. Well, I have it working well enough to satisfy me locally. The picksplit function can handle the prefixing perfectly well, as long as it's not surprised by getting called on a single oversized leaf tuple. (I changed the format of leaf tuples to let the size field be 30 bits, so we can have an oversized leaf tuple in memory even if we can't store it. This doesn't cost anything space-wise because of alignment rules.) > Of course, for future we have a plans to add support of long values, > NULLs/IS NULL, knn-search at least. I think if we're going to do nulls we can't wait; that will necessarily change the on-disk representation, which is going to be a hard sell once 9.2 is out the door. Neither leaf nor inner tuples have any good way to deal with nulls at present. Maybe if you invent a completely separate representation for nulls it could be added on after the fact, but that seems like an ugly answer. > I missed that planner will not create qual-free scan, because I thought it's > still possible with NOT NULL columns. If not, this code could be > removed/commented/ifdefed. Well, it won't do so because pg_am.amoptionalkey is not set. But we can't set that if we don't store NULLs. Right at the moment, my local copy has completely broken handling of WAL, because I've been focusing on the opclass interface and didn't worry about WAL while I was whacking the picksplit code around. I'll try to clean that up today and then post a new version of the patch. regards, tom lane
I wrote: > Right at the moment, my local copy has completely broken handling of > WAL, because I've been focusing on the opclass interface and didn't > worry about WAL while I was whacking the picksplit code around. > I'll try to clean that up today and then post a new version of the > patch. I promised a patch, so here is what I've got now. At this point I've been through most of the code at least once. I'm fairly happy with the opclass interface details, and have brought the documentation for that to what seems a committable state. I have a lot of other loose ends still to look at, but I think the major commit blocker at this point is the VACUUM code, which I don't like/trust at all. I'll comment on that more in a separate message. regards, tom lane
Attachment
Since this was marked WIP and Tom has now kicked off two side discussions, what I've done is tagged references to each of them as comments to the main patch, then marked this as returned with feedback. Surely what I do in the CF app isn't going to influence what Tom wants to work on, so I'll leave this one to his watch now. If it does turn out to be committed soon instead of re-submitted as I'm presuming right now, I can always go back and fix that. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
Greg Smith <greg@2ndQuadrant.com> writes: > Since this was marked WIP and Tom has now kicked off two side > discussions, what I've done is tagged references to each of them as > comments to the main patch, then marked this as returned with feedback. > Surely what I do in the CF app isn't going to influence what Tom wants > to work on, so I'll leave this one to his watch now. If it does turn > out to be committed soon instead of re-submitted as I'm presuming right > now, I can always go back and fix that. Yeah, I've been head-down on this patch for more than a week now. I would probably have bounced it back except that I think this is a pretty important feature for 9.2. At this point I've fixed all the stuff I think is must-fix before commit, but it still needs another read-through. Might get done tomorrow if I'm feeling less under the weather than I was today. regards, tom lane
Attachment
Teodor Sigaev <teodor@sigaev.ru> writes: > [ spgist patch ] I've applied this after a lot of revisions, some cosmetic (better comments etc), some not so much (less bogus vacuum and WAL handling). There are still a number of loose ends that need to be worked on: * The question of whether to store nulls so SPGiST can support full-index scans. I'm not convinced this is worth doing, because a full-index scan is likely to suck performance-wise in SPGiST anyway. But if it's going to be done then IMO it needs to happen for 9.2, because it will affect both on-disk data and the opclass API. * Supporting index-only scans. This I think is worth doing, and I plan to go do it in the next few days. However, this is going to complicate any desire to support full-index scans, because right now value reconstruction is done by inner_consistent and leaf_consistent, so there's no way to do it unless there's a qual to check. * Perhaps it'd be a good idea to move the loop over scankeys to inside the opclass consistent methods, ie call them just once to check all the scankeys. Then we could meaningfully define zero scankeys as a full index scan, and we would also get rid of redundant value reconstruction work when there's more than one scankey. (Though I'm not sure multiple-scankey cases will occur often enough to make that really worth worrying about.) * SPGiST inserts XIDs into redirect tuples so that it can tell when it's safe for VACUUM to delete them. This is similar to something btree does, and as I recall, that was a headache for hot standby. So probably SPGiST also needs some work to be safe for hot standby. * The index free space management seems pretty questionable. I can understand the use of the small local cache of recently-used pages in each backend, but I'm much less than convinced that it's a good idea to share that across backends through the metapage. It's too small for that. I think the code needs to exploit the FSM a lot more. For one thing, AFAICS only *completely empty* pages will ever get added to FSM by VACUUM, which means that once the backend that initialized a given page has dropped it from its (tiny) local cache, it's basically impossible for that page to ever receive any additions except as splits from its existing entries. Perhaps that's intentional but it seems likely to lead to poor space utilization. * It bothers me a bit that the config function gets called on every single operation. Maybe the rd_amcache struct should be used to cache the config function results (as well as the datatype lookup results). On the other hand, I have no proof that that's a noticeable time sink. * It occurs to me that it might be worth sorting the ScanStackEntry list by block number after each inner-tuple visit, so as to improve locality of access. I don't have any evidence for that either, though. * I pulled out the spgstat() function because it seemed like something that didn't belong in core. If we're going to have it at all, it should be in contrib/pgstattuple. I'm willing to do the work to put it there, but I wonder if anyone has any thoughts about whether to keep its existing API (return a text value) or change it to look more like pgstatindex(), ie return a set of columns. I'd favor the latter if I thought the set of columns was well-defined, but I'm not sure it is. (As a reminder, I've attached the last state of the function before I pulled it out.) regards, tom lane Datum spgstat(PG_FUNCTION_ARGS) { text *name = PG_GETARG_TEXT_P(0); RangeVar *relvar; Relation index; BlockNumber blkno; BlockNumbertotalPages = 0, innerPages = 0, leafPages = 0, emptyPages = 0, deletedPages = 0; double usedSpace = 0.0; char res[1024]; int bufferSize = -1; int64 innerTuples = 0, leafTuples = 0, nAllTheSame = 0, nLeafPlaceholder =0, nInnerPlaceholder = 0, nLeafRedirect = 0, nInnerRedirect = 0; #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX) #define IS_SPGIST(r) ((r)->rd_rel->relam == SPGIST_AM_OID) relvar = makeRangeVarFromNameList(textToQualifiedNameList(name)); index = relation_openrv(relvar, AccessExclusiveLock); if (!IS_INDEX(index) || !IS_SPGIST(index)) elog(ERROR, "relation \"%s\" is not an SPGiST index", RelationGetRelationName(index)); totalPages = RelationGetNumberOfBlocks(index); for (blkno = SPGIST_HEAD_BLKNO; blkno < totalPages; blkno++) { Buffer buffer; Page page; int pageFree; buffer = ReadBuffer(index, blkno); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); if (PageIsNew(page) || SpGistPageIsDeleted(page)) { deletedPages++; UnlockReleaseBuffer(buffer); continue; } if (SpGistPageIsLeaf(page)) { leafPages++; leafTuples += PageGetMaxOffsetNumber(page); nLeafPlaceholder += SpGistPageGetOpaque(page)->nPlaceholder; nLeafRedirect += SpGistPageGetOpaque(page)->nRedirection; } else { int i, max; innerPages++; max = PageGetMaxOffsetNumber(page); innerTuples += max; nInnerPlaceholder+= SpGistPageGetOpaque(page)->nPlaceholder; nInnerRedirect += SpGistPageGetOpaque(page)->nRedirection; for (i = FirstOffsetNumber; i <= max; i++) { SpGistInnerTupleit; it = (SpGistInnerTuple) PageGetItem(page, PageGetItemId(page,i)); if (it->allTheSame) nAllTheSame++; } } if (bufferSize < 0) bufferSize = BufferGetPageSize(buffer) - MAXALIGN(sizeof(SpGistPageOpaqueData)) - SizeOfPageHeaderData; pageFree = PageGetExactFreeSpace(page); usedSpace += bufferSize - pageFree; if (pageFree == bufferSize) emptyPages++; UnlockReleaseBuffer(buffer); } index_close(index, AccessExclusiveLock); totalPages--; /* discount metapage */ snprintf(res, sizeof(res), "totalPages: %u\n" "deletedPages: %u\n" "innerPages: %u\n" "leafPages: %u\n" "emptyPages: %u\n" "usedSpace: %.2f kbytes\n" "freeSpace: %.2f kbytes\n" "fillRatio: %.2f%%\n" "leafTuples: " INT64_FORMAT "\n" "innerTuples: " INT64_FORMAT "\n" "innerAllTheSame: " INT64_FORMAT "\n" "leafPlaceholders: " INT64_FORMAT "\n" "innerPlaceholders:" INT64_FORMAT "\n" "leafRedirects: " INT64_FORMAT "\n" "innerRedirects: " INT64_FORMAT, totalPages, deletedPages, innerPages, leafPages, emptyPages, usedSpace / 1024.0, (((double) bufferSize) * ((double) totalPages) - usedSpace) / 1024, 100.0 * (usedSpace / (((double) bufferSize)* ((double) totalPages))), leafTuples, innerTuples, nAllTheSame, nLeafPlaceholder, nInnerPlaceholder, nLeafRedirect, nInnerRedirect); PG_RETURN_TEXT_P(CStringGetTextDatum(res)); }