Thread: Minmax indexes
Hi, Here's a reviewable version of what I've dubbed Minmax indexes. Some people said they would like to use some other name for this feature, but I have yet to hear usable ideas, so for now I will keep calling them this way. I'm open to proposals, but if you pick something that cannot be abbreviated "mm" I might have you prepare a rebased version which renames the files and structs. The implementation here has been simplified from what I originally proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular, I noticed that there's no need to involve aggregate functions at all; we can just use inequality operators. So the pg_amproc entries are gone; only the pg_amop entries are necessary. I've somewhat punted on the question of doing resummarization separately from vacuuming. Right now, resummarization (as well as other necessary index cleanup) takes place in amvacuumcleanup. This is not optimal; I have stated elsewhere that I'd like to create separate maintenance actions that can be carried out by autovacuum. That would be useful both for Minmax indexes and GIN indexes (pending insertion list); maybe others. That's not part of this patch, however. The design of this stuff is in the file "minmax-proposal" at the top of the tree. That file is up to date, though it still contains some open questions that were present in the original proposal. (I have not fixed some bogosities pointed out by Noah, for instance. I will do that shortly.) In a final version, that file would be applied as src/backend/access/minmax/README, most likely. One area on which I needed to modify core code is IndexBuildHeapScan. I needed a version that was able to scan only a certain range of pages, not the entire table, so I introduced a new IndexBuildHeapRangeScan, and added a quick "heap_scansetlimits" function. I haven't tested that this works outside of the HeapRangeScan thingy, so it's probably completely bogus; I'm open to suggestions if people think this should be implemented differently. In any case, keeping that implementation together with vanilla IndexBuildHeapScan makes a lot of sense. One thing still to tackle is when to mark ranges as unsummarized. Right now, any new tuple on a page range would cause a new index entry to be created and a new revmap update. This would cause huge index bloat if, say, a page is emptied and vacuumed and filled with new tuples with increasing values outside the original range; each new tuple would create a new index tuple. I have two ideas about this (1. mark range as unsummarized if 3rd time we touch the same page range; 2. vacuum the affected index page if it's full, so we can maintain the index always up to date without causing unduly bloat), but I haven't implemented anything yet. The "amcostestimate" routine is completely bogus; right now it returns constant 0, meaning the index is always chosen if it exists. There are opclasses for int4, numeric and text. The latter doesn't work at all, because collation info is not passed down at all. I will have to figure that out (even if I find unlikely that minmax indexes have any usefulness on top of text columns). I admit that numeric hasn't been tested, and it's quite likely that they won't work; mainly because of lack of some datumCopy() calls, about which the code contains some /* XXX */ lines. I think this should be relatively straightforward. Ideally, the final version of this patch would contain opclasses for all supported datatypes (i.e. the same that have got btree opclasses). I have messed up the opclass information, as evidenced by failures in opr_sanity regression test. I will research that later. There's working contrib/pageinspect support; pg_xlogdump (and wal_debug) seems to work sanely too. This patch compiles cleanly under -Werror. The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sat, 2013-09-14 at 21:14 -0300, Alvaro Herrera wrote: > Here's a reviewable version of what I've dubbed Minmax indexes. Please fix duplicate OID 3177.
On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
--
Thom
Hi,
Here's a reviewable version of what I've dubbed Minmax indexes. Some
people said they would like to use some other name for this feature, but
I have yet to hear usable ideas, so for now I will keep calling them
this way. I'm open to proposals, but if you pick something that cannot
be abbreviated "mm" I might have you prepare a rebased version which
renames the files and structs.
The implementation here has been simplified from what I originally
proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular,
I noticed that there's no need to involve aggregate functions at all; we
can just use inequality operators. So the pg_amproc entries are gone;
only the pg_amop entries are necessary.
I've somewhat punted on the question of doing resummarization separately
from vacuuming. Right now, resummarization (as well as other necessary
index cleanup) takes place in amvacuumcleanup. This is not optimal; I
have stated elsewhere that I'd like to create separate maintenance
actions that can be carried out by autovacuum. That would be useful
both for Minmax indexes and GIN indexes (pending insertion list); maybe
others. That's not part of this patch, however.
The design of this stuff is in the file "minmax-proposal" at the top of
the tree. That file is up to date, though it still contains some open
questions that were present in the original proposal. (I have not fixed
some bogosities pointed out by Noah, for instance. I will do that
shortly.) In a final version, that file would be applied as
src/backend/access/minmax/README, most likely.
One area on which I needed to modify core code is IndexBuildHeapScan. I
needed a version that was able to scan only a certain range of pages,
not the entire table, so I introduced a new IndexBuildHeapRangeScan, and
added a quick "heap_scansetlimits" function. I haven't tested that this
works outside of the HeapRangeScan thingy, so it's probably completely
bogus; I'm open to suggestions if people think this should be
implemented differently. In any case, keeping that implementation
together with vanilla IndexBuildHeapScan makes a lot of sense.
One thing still to tackle is when to mark ranges as unsummarized. Right
now, any new tuple on a page range would cause a new index entry to be
created and a new revmap update. This would cause huge index bloat if,
say, a page is emptied and vacuumed and filled with new tuples with
increasing values outside the original range; each new tuple would
create a new index tuple. I have two ideas about this (1. mark range as
unsummarized if 3rd time we touch the same page range; 2. vacuum the
affected index page if it's full, so we can maintain the index always up
to date without causing unduly bloat), but I haven't implemented
anything yet.
The "amcostestimate" routine is completely bogus; right now it returns
constant 0, meaning the index is always chosen if it exists.
There are opclasses for int4, numeric and text. The latter doesn't work
at all, because collation info is not passed down at all. I will have
to figure that out (even if I find unlikely that minmax indexes have any
usefulness on top of text columns). I admit that numeric hasn't been
tested, and it's quite likely that they won't work; mainly because of
lack of some datumCopy() calls, about which the code contains some
/* XXX */ lines. I think this should be relatively straightforward.
Ideally, the final version of this patch would contain opclasses for all
supported datatypes (i.e. the same that have got btree opclasses).
I have messed up the opclass information, as evidenced by failures in
opr_sanity regression test. I will research that later.
There's working contrib/pageinspect support; pg_xlogdump (and wal_debug)
seems to work sanely too.
This patch compiles cleanly under -Werror.
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
PANIC: invalid xlog record length 0
Thom
On 15.09.2013 03:14, Alvaro Herrera wrote: > + Partial indexes are not supported; since an index is concerned with minimum and > + maximum values of the involved columns across all the pages in the table, it > + doesn't make sense to exclude values. Another way to see "partial" indexes > + here would be those that only considered some pages in the table instead of all > + of them; but this would be difficult to implement and manage and, most likely, > + pointless. Something like this seems completely sensible to me: create index i_accounts on accounts using minmax (ts) where valid = true; The situation where that would be useful is if 'valid' accounts are fairly well clustered, but invalid ones are scattered all over the table. The minimum and maximum stoed in the index would only concern valid accounts. - Heikki
> On 16 September 2013 at 11:03 Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
> Something like this seems completely sensible to me:
>
> create index i_accounts on accounts using minmax (ts) where valid = true;
>
> The situation where that would be useful is if 'valid' accounts are
> fairly well clustered, but invalid ones are scattered all over the
> table. The minimum and maximum stoed in the index would only concern
> valid accounts.
Here's one that occurs to me:
CREATE INDEX i_billing_id_mm ON billing(id) WHERE paid_in_full IS NOT TRUE;
Note that this would be a frequently moving target and over years of billing, the subset would be quite small compared to the full system (imagine, say, 50k rows out of 20M).
Best Wises,
Chris Travers
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
Best Wishes,
Chris Travers
http://www.2ndquadrant.com
PostgreSQL Services, Training, and Support
Chris Travers
http://www.2ndquadrant.com
PostgreSQL Services, Training, and Support
On 2013-09-16 11:19:19 +0100, Chris Travers wrote: > > > > On 16 September 2013 at 11:03 Heikki Linnakangas <hlinnakangas@vmware.com> > > wrote: > > > > > Something like this seems completely sensible to me: > > > > create index i_accounts on accounts using minmax (ts) where valid = true; > > > > The situation where that would be useful is if 'valid' accounts are > > fairly well clustered, but invalid ones are scattered all over the > > table. The minimum and maximum stoed in the index would only concern > > valid accounts. Yes, I wondered the same myself. > Here's one that occurs to me: > > CREATE INDEX i_billing_id_mm ON billing(id) WHERE paid_in_full IS NOT TRUE; > > Note that this would be a frequently moving target and over years of billing, > the subset would be quite small compared to the full system (imagine, say, 50k > rows out of 20M). In that case you'd just use a normal btree index, no? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote: > On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >> Hi, >> >> Here's a reviewable version of what I've dubbed Minmax indexes. >> > Thanks for the patch, but I seem to have immediately hit a snag: > > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid); > PANIC: invalid xlog record length 0 > fwiw, this seems to be triggered by ANALYZE. At least i can trigger it by executing ANALYZE on the table (attached is a stacktrace of a backend exhibiting the failure) Another thing is this messages i got when compiling: """ mmxlog.c: In function ‘minmax_xlog_revmap_set’: mmxlog.c:161:14: warning: unused variable ‘blkno’ [-Wunused-variable] bufpage.c: In function ‘PageIndexDeleteNoCompact’: bufpage.c:1066:18: warning: ‘lastused’ may be used uninitialized in this function [-Wmaybe-uninitialized] """ -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
Attachment
On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote:
--
Thom
On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
> On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Here's a reviewable version of what I've dubbed Minmax indexes.
>>> Thanks for the patch, but I seem to have immediately hit a snag:fwiw, this seems to be triggered by ANALYZE.
>
> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
> PANIC: invalid xlog record length 0
>
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)
Another thing is this messages i got when compiling:
"""
mmxlog.c: In function ‘minmax_xlog_revmap_set’:
mmxlog.c:161:14: warning: unused variable ‘blkno’ [-Wunused-variable]
bufpage.c: In function ‘PageIndexDeleteNoCompact’:
bufpage.c:1066:18: warning: ‘lastused’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
"""
I'm able to run ANALYSE manually without it dying:
pgbench=# analyse pgbench_accounts;
ANALYZE
pgbench=# analyse pgbench_accounts;
ANALYZE
pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);
PANIC: invalid xlog record length 0
Thom
On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote: > On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote: >> >> On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote: >> > On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> >> > wrote: >> >> >> >> Hi, >> >> >> >> Here's a reviewable version of what I've dubbed Minmax indexes. >> >> >> > Thanks for the patch, but I seem to have immediately hit a snag: >> > >> > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax >> > (aid); >> > PANIC: invalid xlog record length 0 >> > >> >> fwiw, this seems to be triggered by ANALYZE. >> At least i can trigger it by executing ANALYZE on the table (attached >> is a stacktrace of a backend exhibiting the failure) >> > > I'm able to run ANALYSE manually without it dying: > try inserting some data before the ANALYZE, that will force a resumarization which is mentioned in the stack trace of the failure -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
On 17 September 2013 14:37, Jaime Casanova <jaime@2ndquadrant.com> wrote:
I've tried inserting 1 row then ANALYSE and 10,000 rows then ANALYSE, and in both cases there's no error. But then trying to create the index again results in my original error.
--
Thom
On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote:
> On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote:
>>
>> On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
>> > On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Here's a reviewable version of what I've dubbed Minmax indexes.
>> >>
>> > Thanks for the patch, but I seem to have immediately hit a snag:
>> >
>> > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax
>> > (aid);
>> > PANIC: invalid xlog record length 0
>> >
>>
>> fwiw, this seems to be triggered by ANALYZE.
>> At least i can trigger it by executing ANALYZE on the table (attached
>> is a stacktrace of a backend exhibiting the failure)
>>
>> I'm able to run ANALYSE manually without it dying:try inserting some data before the ANALYZE, that will force a
>
resumarization which is mentioned in the stack trace of the failure
Thom
On Tue, Sep 17, 2013 at 8:43 AM, Thom Brown <thom@linux.com> wrote: > On 17 September 2013 14:37, Jaime Casanova <jaime@2ndquadrant.com> wrote: >> >> On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote: >> > On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> >> > wrote: >> >> >> >> On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote: >> >> > On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> >> >> > wrote: >> >> >> >> >> >> Hi, >> >> >> >> >> >> Here's a reviewable version of what I've dubbed Minmax indexes. >> >> >> >> >> > Thanks for the patch, but I seem to have immediately hit a snag: >> >> > >> >> > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax >> >> > (aid); >> >> > PANIC: invalid xlog record length 0 >> >> > >> >> >> >> fwiw, this seems to be triggered by ANALYZE. >> >> At least i can trigger it by executing ANALYZE on the table (attached >> >> is a stacktrace of a backend exhibiting the failure) >> >> >> > >> > I'm able to run ANALYSE manually without it dying: >> > >> >> try inserting some data before the ANALYZE, that will force a >> resumarization which is mentioned in the stack trace of the failure > > > I've tried inserting 1 row then ANALYSE and 10,000 rows then ANALYSE, and in > both cases there's no error. But then trying to create the index again > results in my original error. > Ok So, please confirm if this is the pattern you are following: CREATE TABLE t1(i int); INSERT INTO t1 SELECT generate_series(1, 10000); CREATE INDEX idx1 ON t1 USING minmax (i); if that, then the attached stack trace (index_failure_thom.txt) should correspond to the failure you are looking. My test was slightly different: CREATE TABLE t1(i int); CREATE INDEX idx1 ON t1 USING minmax (i); INSERT INTO t1 SELECT generate_series(1, 10000); ANALYZE t1; and the failure happened in a different time, in resumarization (attached index_failure_jcm.txt) but in the end, both failures seems to happen for the same reason: a record of length 0... at XLogInsert time #4 XLogInsert at xlog.c:966 #5 mmSetHeapBlockItemptr at mmrevmap.c:169 #6 mm_doinsert at minmax.c:1410 actually, if you create a temp table both tests works fine -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
Attachment
Thom Brown wrote: Thanks for testing. > Thanks for the patch, but I seem to have immediately hit a snag: > > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid); > PANIC: invalid xlog record length 0 Silly mistake I had already made in another patch. Here's an incremental patch which fixes this bug. Apply this on top of previous minmax-1.patch. I also renumbered the duplicate OID pointed out by Peter, and fixed the two compiler warnings reported by Jaime. Note you'll need to re-initdb in order to get the right catalog entries. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 17 September 2013 22:03, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Thom
Thom Brown wrote:
Thanks for testing.Silly mistake I had already made in another patch. Here's an
> Thanks for the patch, but I seem to have immediately hit a snag:
>
> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
> PANIC: invalid xlog record length 0
incremental patch which fixes this bug. Apply this on top of previous
minmax-1.patch.
Thanks.
Hit another issue with exactly the same procedure:
pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);
ERROR: lock 176475 is not held
--
On Tue, September 17, 2013 23:03, Alvaro Herrera wrote: > [minmax-1.patch. + minmax-2-incr.patch. (and initdb)] The patches apply and compile OK. I've not yet really tested; I just wanted to mention that make check gives the following differences: *** /home/aardvark/pg_stuff/pg_sandbox/pgsql.minmax/src/test/regress/expected/opr_sanity.out 2013-09-17 23:18:31.427356703 +0200 --- /home/aardvark/pg_stuff/pg_sandbox/pgsql.minmax/src/test/regress/results/opr_sanity.out 2013-09-17 23:20:48.208150824 +0200 *************** *** 1076,1081 **** --- 1076,1086 ---- 2742 | 2 | @@@ 2742 | 3 | <@ 2742 | 4 | = + 3847 | 1 | < + 3847 | 2 | <= + 3847 | 3 | = + 3847 | 4 | >= + 3847 | 5 | > 4000 | 1 | << 4000 | 1 | ~<~ 4000 | 2 | &< *************** *** 1098,1104 **** 4000 | 15 | > 4000 | 16 | @> 4000 | 18 | = ! (62 rows) -- Check that all opclass search operators have selectivity estimators. -- This is not absolutely required, but it seemsa reasonable thing --- 1103,1109 ---- 4000 | 15 | > 4000 | 16 | @> 4000 | 18 | = ! (67 rows) -- Check that all opclass search operators have selectivity estimators. -- This is not absolutely required, but it seemsa reasonable thing *************** *** 1272,1280 **** WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' GROUP BY amname, amsupport,opcname, amprocfamily HAVING count(*) != amsupport OR amprocfamily IS NULL; ! amname | opcname | count ! --------+---------+------- ! (0 rows) SELECT amname, opcname, count(*) FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid --- 1277,1288 ---- WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' GROUP BY amname, amsupport,opcname, amprocfamily HAVING count(*) != amsupport OR amprocfamily IS NULL; ! amname | opcname | count ! --------+-------------+------- ! minmax | int4_ops | 1 ! minmax | text_ops | 1 ! minmax | numeric_ops | 1 ! (3 rows) SELECT amname, opcname, count(*) FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid ====================================================================== Erik Rijkers
Thom Brown wrote: > Hit another issue with exactly the same procedure: > > pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid); > ERROR: lock 176475 is not held That's what I get for restructuring the way buffers are acquired to use the FSM, and then neglecting to test creation on decently-sized indexes. Fix attached. I just realized that xlog replay is also broken. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Erik Rijkers wrote: > On Tue, September 17, 2013 23:03, Alvaro Herrera wrote: > > > [minmax-1.patch. + minmax-2-incr.patch. (and initdb)] > > > The patches apply and compile OK. > > I've not yet really tested; I just wanted to mention that make check gives the following differences: Oops, I forgot to update the expected file. I had to comment on this when submitting minmax-2-incr.patch and forgot. First, those extra five operators are supposed to be there; expected file needs an update. As for this: > --- 1277,1288 ---- > WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' > GROUP BY amname, amsupport, opcname, amprocfamily > HAVING count(*) != amsupport OR amprocfamily IS NULL; > ! amname | opcname | count > ! --------+-------------+------- > ! minmax | int4_ops | 1 > ! minmax | text_ops | 1 > ! minmax | numeric_ops | 1 > ! (3 rows) I think the problem is that the query is wrong. This is the complete query: SELECT amname, opcname, count(*) FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid LEFT JOIN pg_amproc p ON amprocfamily = opcfamily AND amproclefttype= amprocrighttype AND amproclefttype = opcintype WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' GROUP BY amname, amsupport, opcname, amprocfamily HAVING count(*) != amsupport OR amprocfamily IS NULL; I should be, instead, this: SELECT amname, opcname, count(*) FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid LEFT JOIN pg_amproc p ON amprocfamily = opcfamily AND amproclefttype= amprocrighttype AND amproclefttype = opcintype WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' GROUP BY amname, amsupport, opcname, amprocfamily HAVING count(*) != amsupport AND (amprocfamily IS NOT NULL); This query is supposed to check that there are no opclasses with mismatching number of support procedures; but if the left join returns a null-extended row for pg_amproc, that means there is no support proc, yet count(*) will return 1. So count(*) will not match amsupport, and the row is supposed to be excluded by the amprocfamily IS NULL clause in HAVING. Both queries return empty in HEAD, but only the second one correctly returns empty with the patch applied. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 17, 2013 at 4:03 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thom Brown wrote: > > Thanks for testing. > >> Thanks for the patch, but I seem to have immediately hit a snag: >> >> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid); >> PANIC: invalid xlog record length 0 > > Silly mistake I had already made in another patch. Here's an > incremental patch which fixes this bug. Apply this on top of previous > minmax-1.patch. > > I also renumbered the duplicate OID pointed out by Peter, and fixed the > two compiler warnings reported by Jaime. > > Note you'll need to re-initdb in order to get the right catalog entries. > Hi, Found another problem with the this steps: create table t1 (i int); create index idx_t1_i on t1 using minmax(i); insert into t1 select generate_series(1, 2000000); ERROR: could not read block 1 in file "base/12645/16397_vm": read only 0 of 8192 bytes STATEMENT: insert into t1 select generate_series(1, 2000000); ERROR: could not read block 1 in file "base/12645/16397_vm": read only 0 of 8192 bytes After that, i keep receiving these messages (when autovacuum tries to vacuum this table): ERROR: could not truncate file "base/12645/16397_vm" to 2 blocks: it's only 1 blocks now CONTEXT: automatic vacuum of table "postgres.public.t1" ERROR: could not truncate file "base/12645/16397_vm" to 2 blocks: it's only 1 blocks now CONTEXT: automatic vacuum of table "postgres.public.t1" -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
Jaime Casanova wrote: > Found another problem with the this steps: > > create table t1 (i int); > create index idx_t1_i on t1 using minmax(i); > insert into t1 select generate_series(1, 2000000); > ERROR: could not read block 1 in file "base/12645/16397_vm": read > only 0 of 8192 bytes Thanks. This was a trivial off-by-one bug; fixed in the attached patch. While studying it, I noticed that I was also failing to notice extension of the fork by another process. I have tried to fix that also in the current patch, but I'm afraid that a fully robust solution for this will involve having a cached fork size in the index's relcache entry -- just like we have smgr_vm_nblocks. In fact, since the revmap fork is currently reusing the VM forknum, I might even be able to use the same variable to keep track of the fork size. But I don't really like this bit of reusing the VM forknum for revmap, so I've refrained from extending that assumption into further code for the time being. There was also a bug that we would try to initialize a revmap page twice during recovery, if two backends thought they needed to extend it; that would cause the data written by the first extender to be lost. This patch applies on top of the two previous incremental patches. I will send a full patch later, including all those fixes and the fix for the opr_sanity regression test. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, September 25, 2013 00:14, Alvaro Herrera wrote: > [minmax-4-incr.patch] After a --data-checksums initdb (successful), the following error came up: after the statement: create index t_minmax_idx on t using minmax (r); WARNING: page verification failed, calculated checksum 25951 but expected 0 ERROR: invalid page in block 1 of relation base/21324/26267_vm it happens reliably. every time I run the program. Below is the whole program that I used. Thanks, Erik Rijkers #!/bin/sh t=t if [[ 1 -eq 1 ]]; then echo " drop table if exists $t ; create table $t as select i, cast( random() * 10^9 asinteger ) as r from generate_series(1, 1000000) as f(i) ; analyze $t; table $t limit 5; select count(*)from $t; explain analyze select min(r), max(r) from $t; select min(r), max(r) from $t; create index ${t}_minmax_idx on $t using minmax (r); analyze $t; explain analyze select min(r), max(r) from $t; select min(r), max(r) from $t; " | psql fi
On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Hi, > > Here's a reviewable version of what I've dubbed Minmax indexes. Some > people said they would like to use some other name for this feature, but > I have yet to hear usable ideas, so for now I will keep calling them > this way. I'm open to proposals, but if you pick something that cannot > be abbreviated "mm" I might have you prepare a rebased version which > renames the files and structs. > > The implementation here has been simplified from what I originally > proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular, > I noticed that there's no need to involve aggregate functions at all; we > can just use inequality operators. So the pg_amproc entries are gone; > only the pg_amop entries are necessary. > > I've somewhat punted on the question of doing resummarization separately > from vacuuming. Right now, resummarization (as well as other necessary > index cleanup) takes place in amvacuumcleanup. This is not optimal; I > have stated elsewhere that I'd like to create separate maintenance > actions that can be carried out by autovacuum. That would be useful > both for Minmax indexes and GIN indexes (pending insertion list); maybe > others. That's not part of this patch, however. > > The design of this stuff is in the file "minmax-proposal" at the top of > the tree. That file is up to date, though it still contains some open > questions that were present in the original proposal. (I have not fixed > some bogosities pointed out by Noah, for instance. I will do that > shortly.) In a final version, that file would be applied as > src/backend/access/minmax/README, most likely. > > One area on which I needed to modify core code is IndexBuildHeapScan. I > needed a version that was able to scan only a certain range of pages, > not the entire table, so I introduced a new IndexBuildHeapRangeScan, and > added a quick "heap_scansetlimits" function. I haven't tested that this > works outside of the HeapRangeScan thingy, so it's probably completely > bogus; I'm open to suggestions if people think this should be > implemented differently. In any case, keeping that implementation > together with vanilla IndexBuildHeapScan makes a lot of sense. > > One thing still to tackle is when to mark ranges as unsummarized. Right > now, any new tuple on a page range would cause a new index entry to be > created and a new revmap update. This would cause huge index bloat if, > say, a page is emptied and vacuumed and filled with new tuples with > increasing values outside the original range; each new tuple would > create a new index tuple. I have two ideas about this (1. mark range as > unsummarized if 3rd time we touch the same page range; Why only at 3rd time? Doesn't it need to be precise, like if someone inserts a row having value greater than max value of corresponding index tuple, then that index tuple's corresponding max value needs to be updated and I think its updated with the help of validity map. For example: considering we need to store below info for each index tuple: In each index tuple (corresponding to onepage range), we store: - first block this tuple applies to - last block this tuple applies to - for each indexedcolumn: * min() value across all tuples in the range * max() value across all tuples in the range Assume first and last block for index tuple is same (assume block no. 'x') and min value is 5 and max is 10. Now user insert/update value in block 'x' such that max value of index col. is 11, if we don't update corresponding index tuple or at least invalidate it, won't it lead to wrong results? > 2. vacuum the > affected index page if it's full, so we can maintain the index always up > to date without causing unduly bloat), but I haven't implemented > anything yet. > > The "amcostestimate" routine is completely bogus; right now it returns > constant 0, meaning the index is always chosen if it exists. I think for first version, you might want to keep things simple, but there should be some way for optimizer to select this index. So rather than choose if it is present, we can make optimizerchoose when some-one says set enable_minmax index to true. How about keeping this up-to-date during foreground operations. Vacuum/Maintainer task maintaining things usually have problems of bloat and then we need optimize/workaround issues. Lot of people have raised this or similar point previously and what I read you are of opinion that it seems to be slow. I really don't think that it can be so slow that adding so much handling to get it up-to-date by some maintainer task is useful. Currently there are systems like Oracle where index clean-up is mainly done during foreground operation, so this alone cannot be reason for slowness. Comparing the logic with IOS is also not completely right as for IOS, we need to know each tuple's visibility, which is not the case here. Now it can so happen that min and max values are sometimes not right because later the operation is rolled back, but I think such cases will be less and we can find some way to handle such cases may be maintainer task only, but the handling will be quite simpler. On Windows, patch gives below compilation errors: src\backend\access\minmax\mmtuple.c(96): error C2057: expected constant expression src\backend\access\minmax\mmtuple.c(96): error C2466: cannot allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(96): error C2133: 'values' : unknown size src\backend\access\minmax\mmtuple.c(97):error C2057: expected constant expression src\backend\access\minmax\mmtuple.c(97): error C2466: cannot allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(97): error C2133: 'nulls' : unknown size src\backend\access\minmax\mmtuple.c(102):error C2057: expected constant expression src\backend\access\minmax\mmtuple.c(102): error C2466: cannot allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(102): error C2133: 'phony_nullbitmap' : unknown size src\backend\access\minmax\mmtuple.c(110): warning C4034: sizeof returns 0 src\backend\access\minmax\mmtuple.c(246):error C2057: expected constant expression src\backend\access\minmax\mmtuple.c(246): error C2466: cannot allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(246): error C2133: 'values' : unknown size src\backend\access\minmax\mmtuple.c(247):error C2057: expected constant expression src\backend\access\minmax\mmtuple.c(247): error C2466: cannot allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(247): error C2133: 'allnulls' : unknown size src\backend\access\minmax\mmtuple.c(248): error C2057: expected constant expression src\backend\access\minmax\mmtuple.c(248): error C2466: cannot allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(248): error C2133: 'hasnulls' : unknown size With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila escribió: > On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > One thing still to tackle is when to mark ranges as unsummarized. Right > > now, any new tuple on a page range would cause a new index entry to be > > created and a new revmap update. This would cause huge index bloat if, > > say, a page is emptied and vacuumed and filled with new tuples with > > increasing values outside the original range; each new tuple would > > create a new index tuple. I have two ideas about this (1. mark range as > > unsummarized if 3rd time we touch the same page range; > > Why only at 3rd time? > Doesn't it need to be precise, like if someone inserts a row having > value greater than max value of corresponding index tuple, > then that index tuple's corresponding max value needs to be updated > and I think its updated with the help of validity map. Of course. Note I no longer have the concept of a validity map; I have switched things to use a "reverse range map", or revmap for short. The revmap is responsible for mapping each page number to an individual index TID. If the TID stored in the revmap is InvalidTid, that means the range is not summarized. Summarized ranges are always considered as "match query quals", and thus all tuples in them are returned in the bitmap for heap recheck. The way it works currently, is that any tuple insert (that's outside the bounds of the current index tuple) causes a new index tuple to be created, and the revmap is updated to point to the new index tuple. The old index tuple is orphaned and will be deleted at next vacuum. This works fine. However the problem is excess orphaned tuples; I don't want a long series of updates to create many orphaned dead tuples. Instead I would like the system to, at some point, stop creating new index tuples and instead set the revmap to InvalidTid. That would stop the index bloat. > For example: > considering we need to store below info for each index tuple: > In each index tuple (corresponding to one page range), we store: > - first block this tuple applies to > - last block this tuple applies to > - for each indexed column: > * min() value across all tuples in the range > * max() value across all tuples in the range > > Assume first and last block for index tuple is same (assume block > no. 'x') and min value is 5 and max is 10. > Now user insert/update value in block 'x' such that max value of > index col. is 11, if we don't update corresponding > index tuple or at least invalidate it, won't it lead to wrong results? Sure, that would result in wrong results. Fortunately that's not how I am suggesting to do it. I note you're reading an old version of the design. I realize now that this is my mistake because instead of posting the new design in the cover letter for the patch, I only put it in the "minmax-proposal" file. Please give that file a read to see how the design differs from the design I originally posted in the old thread. > > The "amcostestimate" routine is completely bogus; right now it returns > > constant 0, meaning the index is always chosen if it exists. > > I think for first version, you might want to keep things simple, but > there should be some way for optimizer to select this index. > So rather than choose if it is present, we can make optimizer choose > when some-one says set enable_minmax index to true. Well, enable_bitmapscan already disables minmax indexes, just like it disables other indexes. > How about keeping this up-to-date during foreground operations. > Vacuum/Maintainer task maintaining things usually have problems of > bloat and > then we need optimize/workaround issues. > Lot of people have raised this or similar point previously and what > I read you are of opinion that it seems to be slow. Well, the current code does keep the index up to date -- I did choose to implement what people suggested :-) > Now it can so happen that min and max values are sometimes not right > because later the operation is rolled back, but I think such cases > will > be less and we can find some way to handle such cases may be > maintainer task only, but the handling will be quite simpler. Agreed. > On Windows, patch gives below compilation errors: > src\backend\access\minmax\mmtuple.c(96): error C2057: expected > constant expression I have fixed all these compile errors (fix attached). Thanks for reporting them. I'll post a new version shortly. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Erik Rijkers wrote: > After a --data-checksums initdb (successful), the following error came up: > > after the statement: create index t_minmax_idx on t using minmax (r); > > WARNING: page verification failed, calculated checksum 25951 but expected 0 > ERROR: invalid page in block 1 of relation base/21324/26267_vm > > it happens reliably. every time I run the program. Thanks for the report. That's fixed with the attached. > Below is the whole program that I used. Hmm, this test program shows that you're trying to use the index to optimize min() and max() queries, but that's not what these indexes do. You will need to use operators > >= = <= < (or BETWEEN, which is the same thing) to see your index in action. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Here's an updated version of this patch, with fixes to all the bugs reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and Amit Kapila for the reports. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: > [minmax-5.patch] I have the impression it's not quite working correctly. The attached program returns different results for different values of enable_bitmapscan (consistently). ( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100M row table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually. ) >From eyeballing the results it looks like the minmax result (i.e. the result set with enable_bitmapscan = 1) yields only the last part because the only 'last' rows seem to be present (see the values in column i in table tmm in the attached program). Thanks, Erikjan Rijkers
Attachment
On Thu, September 26, 2013 00:34, Erik Rijkers wrote: > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: > >> [minmax-5.patch] > > I have the impression it's not quite working correctly. > > The attached program returns different results for different values of enable_bitmapscan (consistently). > > ( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100Mrow > table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually.) > > From eyeballing the results it looks like the minmax result (i.e. the result set with enable_bitmapscan = 1) yields only > the last part because the only 'last' rows seem to be present (see the values in column i in table tmm in the attached > program). Looking back at that, I realize I should have added a bit more detail on that test.sh program and its output (attached on previous mail). test.sh creates a table tmm and a minmax index on that table: testdb=# \d tmm Table "public.tmm"Column | Type | Modifiers --------+---------+-----------i | integer |r | integer | Indexes: "tmm_minmax_idx" minmax (r) The following shows the problem: the same search with minax index on versus off gives different result sets: testdb=# set enable_bitmapscan=0; select count(*) from tmm where r between symmetric 19494484 and 145288238; SET Time: 0.473 mscount ------- 1261 (1 row) Time: 7.764 ms testdb=# set enable_bitmapscan=1; select count(*) from tmm where r between symmetric 19494484 and 145288238; SET Time: 0.471 mscount ------- 3 (1 row) Time: 1.014 ms testdb=# set enable_bitmapscan =1; select * from tmm where r between symmetric 19494484 and 145288238; SET Time: 0.615 ms i | r ------+-----------9945 | 454056039951 | 1025524859966 | 63763962 (3 rows) Time: 0.984 ms testdb=# set enable_bitmapscan=0; select * from ( select * from tmm where r between symmetric 19494484 and 145288238 order by i desc limit 10) f order by i ; SET Time: 0.470 ms i | r ------+-----------9852 | 1149969069858 | 699071699875 | 433415839894 | 1278626579895 | 447400339911 | 517975539916 | 585387749945 | 454056039951 | 1025524859966 | 63763962 (10 rows) Time: 8.704 ms testdb=# If enable_bitmapscan=1 (i.e. using the minmax index), then only some values are retrieved (in this case 3 rows). It turns out those are always the last N rows of the full resultset (i.e. with enable_bitmapscan=0). Erikjan Rijkers
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Here's an updated version of this patch, with fixes to all the bugs > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > Amit Kapila for the reports. I'm not very happy with the use of a separate relation fork for storing this data. Using an existing fork number rather than creating a new one avoids some of them (like, the fact that we loop over all known fork numbers in various places, and adding another one will add latency in all of those places, particularly when there is a system call in the loop) but not all of them (like, what happens if the index is unlogged? we have provisions to reset the main fork but any others are just removed; is that OK?), and it also creates some new ones (like, files having misleading names). More generally, I fear we really opened a bag of worms with this relation fork stuff. Every time I turn around I run into a problem that could be solved by adding another relation fork. I'm not terribly sure that it was a good idea to go that way to begin with, because we've got customers who are unhappy about 3 files/heap due to inode consumption and slow directory lookups. I think we would have been smarter to devise a strategy for storing the fsm and vm pages within the main fork in some fashion, and I tend to think that's the right solution here as well. Of course, it may be hopeless to put the worms back in the can at this point, and surely these indexes will be lightly used compared to heaps, so it's not incrementally exacerbating the problems all that much. But I still feel uneasy about widening use of that mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas escribió: > On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Here's an updated version of this patch, with fixes to all the bugs > > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > > Amit Kapila for the reports. > > I'm not very happy with the use of a separate relation fork for > storing this data. I understand this opinion, as I considered it myself while developing it. Also, GIN already does things this way. Perhaps I should just bite the bullet and do this. > Using an existing fork number rather than creating > a new one avoids some of them (like, the fact that we loop over all > known fork numbers in various places, and adding another one will add > latency in all of those places, particularly when there is a system > call in the loop) but not all of them (like, what happens if the index > is unlogged? we have provisions to reset the main fork but any others > are just removed; is that OK?), and it also creates some new ones > (like, files having misleading names). All good points. Index scans will normally access the revmap in sequential fashion; it would be enough to chain revmap pages, keeping a single block number in the metapage pointing to the first one, and subsequent ones are accessed from a "next" block number in each page. However, heap insertion might need to access a random revmap page, and this would be too slow. I think it would be enough to keep an array of block numbers in the index's metapage; the metapage would be share locked on every scan and insert, but that's not a big deal because exclusive lock would only be needed on the metapage to extend the revmap, which would be a very infrequent operation. As this will require some rework to this code, I think it's fair to mark this as returned with feedback for the time being. I will return with an updated version soon, fixing the relation fork issue as well as the locking and visibility bugs reported by Erik. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Erik Rijkers wrote: > I have the impression it's not quite working correctly. > > The attached program returns different results for different values of enable_bitmapscan (consistently). Clearly there's some bug somewhere. I'll investigate it more. > ( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100Mrow > table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually.) Not documentation -- that would also be a bug which needs to be fixed. Thanks for testing. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 9/26/13 12:00 PM, Robert Haas wrote: > More generally, I fear we really opened a bag of worms with this > relation fork stuff. Every time I turn around I run into a problem > that could be solved by adding another relation fork. I'm not > terribly sure that it was a good idea to go that way to begin with, > because we've got customers who are unhappy about 3 files/heap due to > inode consumption and slow directory lookups. I think we would have > been smarter to devise a strategy for storing the fsm and vm pages > within the main fork in some fashion, and I tend to think that's the > right solution here as well. Of course, it may be hopeless to put the > worms back in the can at this point, and surely these indexes will be > lightly used compared to heaps, so it's not incrementally exacerbating > the problems all that much. But I still feel uneasy about widening > use of that mechanism. Why would we add additional code complexity when forks do the trick? That seems like a step backwards, not forward. If the only complaint about forks is directory traversal why wouldn't we go with the well established practice of using multipledirectories instead of glomming everything into one place? -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby <jim@nasby.net> wrote: > Why would we add additional code complexity when forks do the trick? That > seems like a step backwards, not forward. Well, they sorta do the trick, but see e.g. commit ece01aae479227d9836294b287d872c5a6146a11. I doubt that's the only code that's poorly-optimized for multiple forks; IOW, every time someone adds a new fork, there's a system-wide cost to that, even if that fork is only used in a tiny percentage of the relations that exist in the system. It's tempting to think that we can use the fork mechanism every time we have multiple logical "streams" of blocks within a relation and don't want to figure out a way to multiplex them onto the same physical file. However, the reality is that the fork mechanism isn't up to the job. I certainly don't want to imply that we shouldn't have gone in that direction - both the fsm and the vm are huge steps forward, and we wouldn't have gotten them in 8.4 without that mechanism. But they haven't been entirely without their own pain, too, and that pain is going to grow the more we push in the direction of relying on forks. > If the only complaint about forks is directory traversal why wouldn't we go > with the well established practice of using multiple directories instead of > glomming everything into one place? That's not the only complaint about forks - but I support what you're proposing there anyhow, because it will be helpful to users with lots of relations regardless of what we do or do not decide to do about forks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Amit Kapila escribió: >> On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: > > >> On Windows, patch gives below compilation errors: >> src\backend\access\minmax\mmtuple.c(96): error C2057: expected >> constant expression > > I have fixed all these compile errors (fix attached). Thanks for > reporting them. I'll post a new version shortly. Thanks for fixing it. In last few days I had spent some time reading about minmax or equivalent indexes in other databases (Netezza and Oracle) and going through some parts of your proposal. Its a bit bigger patch and needs much more time, but I would like to share my findings/thoughts I had developed till now. Firstly about interface and use case, as far as I could understand other databases provide this index automatically rather than having a separate Create Index command which may be because such an index can be mainly useful when the data is ordered or if it's distributed in such a way that it's quite useful for repeatedly executing queries. You have proposed it as a command which means user needs to take care of it which I find is okay for first version, later may be we can also have some optimisations so that it can get created automatically. For the page range, If I read correctly, currently you have used hash define, do you want to expose it to user in some way like GUC or maintain it internally and assign the right value based on performance of different queries? Operations on this index seems to be very fast, like Oracle has this as an in-memory structure and I read in Netezza that write operations doesn't carry any significant overhead for zone maps as compare to other indexes, so shouldn't we consider it to be without WAL logged? OTOH I think because these structures get automatically created in those databases, so it might be okay but if we provide it as a command, then user might be bothered if he didn't find it automatically on server restart. Few Questions and observations: 1. + When a new heap tuple is inserted in a summarized page range, it is possible to + compare the existing index tuple with the new heap tuple. If the heap tuple is + outside the minimum/maximum boundaries given by the index tuple for any indexed + column (or if the new heap tuple contains null values but the index tuple + indicate there are no nulls), it is necessary to create a new index tuple with + the new values. To do this, a new index tuple is inserted, and the reverse range + map is updated to point to it. The old index tuple is left in place, for later + garbage collection. Is there a reason why we can't directly update the value rather then new insert in index, as I understand for other indexes like btree we do this because we might need to rollback, but here even if after updating the min or max value, rollback happens, it will not cause any harm (tuple loss). 2. + If the reverse range map points to an invalid TID, the corresponding page range + is not summarized. 3. It might be better if you can mention when range map will point to an invalid TID, it's not explained in your proposal, but you have used it in you proposal to explain some other things. 4. Range reverse map is a good terminology, but isn't Range translation map better. I don't mind either way, it's just a thought came to my mind while understanding concept of Range Reverse map. 5. /** As above, except that instead of scanning the complete heap, only the given* range is scanned. Scan to end-of-rel canbe signalled by passing* InvalidBlockNumber as end block number.*/ double IndexBuildHeapRangeScan(Relation heapRelation, Relation indexRelation, IndexInfo *indexInfo, bool allow_sync, BlockNumber start_blockno, BlockNumber numblocks, IndexBuildCallback callback, void *callback_state) In comments you have used end block number, which parameter does it refer to? I could see only start_blockno and numb locks? 6. currently you are passing 0 as start block and InvalidBlockNumber as number of blocks, what's the logic for it? return IndexBuildHeapRangeScan(heapRelation, indexRelation, indexInfo, allow_sync, 0, InvalidBlockNumber, callback, callback_state); 7. In mmbuildCallback, it only add's tuple to minmax index, if it satisfies page range, else this can lead to waste of big scan incase page range is large (1280 pages as you mentiones in one of your mails). Why can't we include it end of scan? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 27, 2013 at 11:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Amit Kapila escribió: >>> On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera >>> <alvherre@2ndquadrant.com> wrote: >> > >> >>> On Windows, patch gives below compilation errors: >>> src\backend\access\minmax\mmtuple.c(96): error C2057: expected >>> constant expression >> >> I have fixed all these compile errors (fix attached). Thanks for >> reporting them. I'll post a new version shortly. > > Thanks for fixing it. In last few days I had spent some time > reading about minmax or equivalent indexes in other databases (Netezza > and Oracle) and going through some parts of your proposal. Its a bit > bigger patch and needs much more time, but I would like to share my > findings/thoughts I had developed till now. > > Firstly about interface and use case, as far as I could understand > other databases provide this index automatically rather than having a > separate Create Index command which may be because such an index can > be mainly useful when the data is ordered or if it's distributed in > such a way that it's quite useful for repeatedly executing queries. > You have proposed it as a command which means user needs to take care > of it which I find is okay for first version, later may be we can also > have some optimisations so that it can get created automatically. > For the page range, If I read correctly, currently you have used hash > define, do you want to expose it to user in some way like GUC or > maintain it internally and assign the right value based on performance > of different queries? > > Operations on this index seems to be very fast, like Oracle has this > as an in-memory structure and I read in Netezza that write operations > doesn't carry any significant overhead for zone maps as compare to > other indexes, so shouldn't we consider it to be without WAL logged? > OTOH I think because these structures get automatically created in > those databases, so it might be okay but if we provide it as a > command, then user might be bothered if he didn't find it > automatically on server restart. > > Few Questions and observations: > 1. > + When a new heap tuple is inserted in a summarized page range, it is > possible to > + compare the existing index tuple with the new heap tuple. If the > heap tuple is > + outside the minimum/maximum boundaries given by the index tuple for > any indexed > + column (or if the new heap tuple contains null values but the index tuple > + indicate there are no nulls), it is necessary to create a new index tuple with > + the new values. To do this, a new index tuple is inserted, and the > reverse range > + map is updated to point to it. The old index tuple is left in > place, for later > + garbage collection. > > > Is there a reason why we can't directly update the value rather then > new insert in index, as I understand for other indexes like btree > we do this because we might need to rollback, but here even if after > updating the min or max value, rollback happens, it will not cause > any harm (tuple loss). > > 2. > + If the reverse range map points to an invalid TID, the corresponding > page range > + is not summarized. > > 3. > It might be better if you can mention when range map will point to an > invalid TID, it's not explained in your proposal, but you have used it > in you proposal to explain some other things. > > 4. > Range reverse map is a good terminology, but isn't Range translation > map better. I don't mind either way, it's just a thought came to my > mind while understanding concept of Range Reverse map. > > 5. > /* > * As above, except that instead of scanning the complete heap, only the given > * range is scanned. Scan to end-of-rel can be signalled by passing > * InvalidBlockNumber as end block number. > */ > double > IndexBuildHeapRangeScan(Relation heapRelation, > Relation indexRelation, > IndexInfo *indexInfo, > bool allow_sync, > BlockNumber start_blockno, > BlockNumber numblocks, > IndexBuildCallback callback, > void *callback_state) > > In comments you have used end block number, which parameter does it > refer to? I could see only start_blockno and numb locks? > > 6. > currently you are passing 0 as start block and InvalidBlockNumber as > number of blocks, what's the logic for it? > return IndexBuildHeapRangeScan(heapRelation, indexRelation, > indexInfo, allow_sync, > 0, InvalidBlockNumber, > callback, callback_state); I got it, I think here it means scan all the pages. > 7. > In mmbuildCallback, it only add's tuple to minmax index, if it > satisfies page range, else this can lead to waste of big scan incase > page range is large (1280 pages as you mentiones in one of your > mails). Why can't we include it end of scan? > > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com
On 9/26/13 2:46 PM, Robert Haas wrote: > On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby <jim@nasby.net> wrote: >> Why would we add additional code complexity when forks do the trick? That >> seems like a step backwards, not forward. > > Well, they sorta do the trick, but see e.g. commit > ece01aae479227d9836294b287d872c5a6146a11. I doubt that's the only > code that's poorly-optimized for multiple forks; IOW, every time > someone adds a new fork, there's a system-wide cost to that, even if > that fork is only used in a tiny percentage of the relations that > exist in the system. Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvements thatneed to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost of improving themvs what we gain by using them. Of course there's always some added cost so we shouldn't just blindly use them all over the place without considering thefork cost either... > It's tempting to think that we can use the fork mechanism every time > we have multiple logical "streams" of blocks within a relation and > don't want to figure out a way to multiplex them onto the same > physical file. However, the reality is that the fork mechanism isn't > up to the job. I certainly don't want to imply that we shouldn't have > gone in that direction - both the fsm and the vm are huge steps > forward, and we wouldn't have gotten them in 8.4 without that > mechanism. But they haven't been entirely without their own pain, > too, and that pain is going to grow the more we push in the direction > of relying on forks. Agreed. Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres... we'restarting to look at areas where that would help. For example, the recent idea of being able to truncate individual relationfiles and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is apretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably shouldn'tjust blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we betterstill consider where we're headed). >> If the only complaint about forks is directory traversal why wouldn't we go >> with the well established practice of using multiple directories instead of >> glomming everything into one place? > > That's not the only complaint about forks - but I support what you're > proposing there anyhow, because it will be helpful to users with lots > of relations regardless of what we do or do not decide to do about > forks. > -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby <jim@nasby.net> wrote: > > Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvementsthat need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost ofimproving them vs what we gain by using them. We think this gives short change to the decision to introduce forks. If you go back to the discussion at the time it was a topic of debate and the argument which won the day is that interleaving different streams of data in one storage system is exactly what the file system is designed to do and we would just be reinventing the wheel if we tried to do it ourselves. I think that makes a lot of sense for things like the fsm or vm which grow indefinitely and are maintained by a different piece of code from the main heap. The tradeoff might be somewhat different for the pieces of a data structure like a bitmap index or gin index where the code responsible for maintaining it is all the same. > Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres...we're starting to look at areas where that would help. For example, the recent idea of being able to truncateindividual relation files and not being limited to only truncating the end of the relation. My concern in that caseis that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storagewe probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reducecomplexity, but we better still consider where we're headed). The ultimate goal here would be to get the filesystem to issue a TRIM call so an SSD storage system can reuse the underlying blocks. Truncating 1GB files might be a convenient way to do it, especially if we have some new kind of vacuum full that can pack tuples within each 1GB file. But there may be easier ways to achieve the same thing. If we can notify the filesystem that we're not using some of the blocks in the middle of the file we might be able to just leave things where they are and have holes in the files. Or we might be better off not depending on truncate and just look for ways to mark entire 1GB files as "deprecated" and move tuples out of them until we can just remove that whole file. -- greg
On 9/27/13 1:43 PM, Greg Stark wrote: >> Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres...we're starting to look at areas where that would help. For example, the recent idea of being able to truncateindividual relation files and not being limited to only truncating the end of the relation. My concern in that caseis that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storagewe probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reducecomplexity, but we better still consider where we're headed). > The ultimate goal here would be to get the filesystem to issue a TRIM > call so an SSD storage system can reuse the underlying blocks. > Truncating 1GB files might be a convenient way to do it, especially if > we have some new kind of vacuum full that can pack tuples within each > 1GB file. > > But there may be easier ways to achieve the same thing. If we can > notify the filesystem that we're not using some of the blocks in the > middle of the file we might be able to just leave things where they > are and have holes in the files. Or we might be better off not > depending on truncate and just look for ways to mark entire 1GB files > as "deprecated" and move tuples out of them until we can just remove > that whole file. Yeah, there's a ton of different things we might do. And dealing with free space is just one example... things like the VMgive us the ability to detect areas of the heap that have gone "dormant"; imagine if we could seamlessly move that datato it's own storage, possibly compressing it at the same time. (Yes, I realize there's partitioning and tablespaces andcompressing filesystems, but those are a lot more work and will never be as efficient as what the database itself cando). Anyway, I think we're all on the same page. We should stop hijacking Alvaro's thread... ;) -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 27.09.2013 21:43, Greg Stark wrote: > On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby<jim@nasby.net> wrote: >> >> Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvementsthat need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost ofimproving them vs what we gain by using them. > > We think this gives short change to the decision to introduce forks. > If you go back to the discussion at the time it was a topic of debate > and the argument which won the day is that interleaving different > streams of data in one storage system is exactly what the file system > is designed to do and we would just be reinventing the wheel if we > tried to do it ourselves. I think that makes a lot of sense for things > like the fsm or vm which grow indefinitely and are maintained by a > different piece of code from the main heap. > > The tradeoff might be somewhat different for the pieces of a data > structure like a bitmap index or gin index where the code responsible > for maintaining it is all the same. There are quite a dfew cases where we have several "streams" of data, all related to a single relation. We've solved them all in slightly different ways: 1. TOAST. A separate heap relation with accompanying b-tree index is created. 2. GIN. GIN contains a b-tree, and data pages (and somer other kinds of pages too IIRC). It would be natural to use the regular B-tree code for the B-tree, but instead it contains a completely separate implementation. All the different kinds of streams are stored in the main fork. 3. Free space map. Stored as a separate fork. 4. Visibility map. Stored as a separate fork. And upcoming: 5. Minmax indexes, with the linearly-addressed range reverse map and variable lenghth index tuples. 6. Bitmap indexes. Like in GIN, there's a B-tree and the data pages containing the bitmaps. A nice property of the VM and FSM forks currently is that they are just auxiliary information to speed things up. You can safely remove them (when the server is shut down), and the system will recreate them on next vacuum. It's not carved in stone that it has to be that way for all extra forks, but it is today and I like it. I feel we need a new kind of a relation fork, something more heavy-weight than the current forks, but not as heavy-weight as the way TOAST does it. It would be nice if GIN and bitmap indexes could use the regular nbtree code. Or any other index type - imagine a bitmap index using a SP-GiST index instead of a B-tree! You could create a bitmap index for 2d points, and use it to speed up operations like overlap for example. The nbtree code expects the data to be in the main fork and uses the FSM fork too. Maybe it could be abstracted, so that the regular b-tree could be used as part of another index type. Same with other indexams. Perhaps relation forks need to be made more flexible, allowing access methods to define what forks exists. IOW, let's not avoid using relation forks, let's make them better instead. - Heikki
What would it take to abstract the minmax indexes to allow maintaing a bounding box for points, instead of a plain min/max? Or for ranges. In other words, why is this restricted to b-tree operators? - Heikki
On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote: > What would it take to abstract the minmax indexes to allow maintaing > a bounding box for points, instead of a plain min/max? Or for > ranges. In other words, why is this restricted to b-tree operators? If I had to guess, I'd guess, "first cut." I take it this also occurred to you and that you believe that this approach makes the more general case or at least further out than it would need to be. Am I close? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter wrote: > On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote: > > What would it take to abstract the minmax indexes to allow maintaing > > a bounding box for points, instead of a plain min/max? Or for > > ranges. In other words, why is this restricted to b-tree operators? > > If I had to guess, I'd guess, "first cut." Yeah, there were a few other simplifications in the design too, though I admit allowing for multidimensional dataypes hadn't occured to me (though I will guess Simon did think about it and just didn't tell me to avoid me going overboard with stuff that would make the first version take forever). I think we'd better add version numbers and stuff to the metapage to allow for extensions and proper upgradability. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 30.09.2013 19:49, Alvaro Herrera wrote: > David Fetter wrote: >> On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote: >>> What would it take to abstract the minmax indexes to allow maintaing >>> a bounding box for points, instead of a plain min/max? Or for >>> ranges. In other words, why is this restricted to b-tree operators? >> >> If I had to guess, I'd guess, "first cut." > > Yeah, there were a few other simplifications in the design too, though I > admit allowing for multidimensional dataypes hadn't occured to me You can almost create a bounding box opclass in the current implementation, by mapping < operator to "contains" and > to "not contains". But there's no support for creating a new, larger, bounding box on insert. It will just replace the max with the new value if it's "greater than", when it should create a whole new value to store in the index that covers both the old and the new values. (or less than? I'm not sure which way those operators would work..) When you think of the general case, it's weird that the current implementation requires storing both the min and the max. For a bounding box, you store the bounding box that covers all heap tuples in the range. If that corresponds to "max", what does "min" mean? In fact, even with regular b-tree operators, over integers for example, you don't necessarily want to store both min and max. If you only ever perform queries like "WHERE col > ?", there's no need to track the min value. So to make this really general, you should be able to create an index on only the minimum or maximum. Or if you want both, you can store them as separate index columns. Something like: CREATE INDEX minindex ON table (col ASC); -- For min CREATE INDEX minindex ON table (col DESC); -- For max CREATE INDEX minindex ON table (col ASC, col DESC); -- For both That said, in practice most people probably want to store both min and max. Maybe it's a bit too finicky if we require people to write "col ASC, col DESC" to get that. Some kind of a shorthand probably makes sense. > (though I will guess Simon did think about it and just didn't tell me to > avoid me going overboard with stuff that would make the first version > take forever). Heh, and I ruined that great plan :-). - Heikki
On Mon, Sep 30, 2013 at 1:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > You can almost create a bounding box opclass in the current implementation, > by mapping < operator to "contains" and > to "not contains". But there's no > support for creating a new, larger, bounding box on insert. It will just > replace the max with the new value if it's "greater than", when it should > create a whole new value to store in the index that covers both the old and > the new values. (or less than? I'm not sure which way those operators would > work..) This sounds an awful lot like GiST's "union" operation. Actually, following the GiST model of having "union" and "consistent" operations might be a smart way to go. Then the exact index semantics could be decided by the opclass. This might not even be that much extra code; the existing consistent and union functions for GiST are pretty short.That way, it'd be easy to add new opclasses with somewhatdifferent behavior; the common thread would be that every opclass of this new AM works by summarizing a physical page range into a single indexed value. You might call the AM something like "summary" or "sparse" and then have "minmax_ops" for your first opclass. > In fact, even with regular b-tree operators, over integers for example, you > don't necessarily want to store both min and max. If you only ever perform > queries like "WHERE col > ?", there's no need to track the min value. So to > make this really general, you should be able to create an index on only the > minimum or maximum. Or if you want both, you can store them as separate > index columns. Something like: > > CREATE INDEX minindex ON table (col ASC); -- For min > CREATE INDEX minindex ON table (col DESC); -- For max > CREATE INDEX minindex ON table (col ASC, col DESC); -- For both This doesn't seem very general, since you're relying on the fact that ASC and DESC map to < and >. It's not clear what you'd write here if you wanted to optimize #$ and @!. But something based on opclasses will work, since each opclass can support an arbitrary set of operators. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas escribió: > On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Here's an updated version of this patch, with fixes to all the bugs > > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > > Amit Kapila for the reports. > > I'm not very happy with the use of a separate relation fork for > storing this data. I have been playing with having the revmap in the main fork of the index rather than a separate one. On the surface many things stay just what they are; I only had to add a layer beneath the revmap that maps its logical block numbers to physical block numbers. The problem with this is that it needs more disk access, because revmap block numbers cannot be hardcoded. After doing some quick math, what I ended up with was to keep an array of BlockNumbers in the metapage. Each element in this array points to array pages; each array page is, in turn, filled with more BlockNumbers, which this time correspond to the logical revmap pages we used to have in the revmap fork. (I initially feared that this design would not allow me to address enough revmap pages for the largest of tables; but fortunately this is sufficient unless you configure very small pages, say BLCKSZ 2kB, use small page ranges, and use small datatypes, say "char". I have no problem with saying that that scenario is not supported if you want to have minmax indexes on 32 TB tables. I mean, who uses BLCKSZ smaller than 8kB anyway?). The advantage of this design is that in order to find any particular logical revmap page, you always have to do a constant number of page accesses. You read the metapage, then read the array page, then read the revmap page; done. Another idea I considered was chaining revmap pages (so each would have a pointer-to-next), or chaining array pages; but this would have meant that to locate an individual page to the end of the revmap, you might need to do many accesses. Not good. As an optimization for relatively small indexes, we hardcode the page number for the first revmap page: it's always the page right after the metapage (so BlockNumber 1). A revmap page can store, with the default page size, about 1350 item pointers; so with an index built for page ranges of 1000 pages per range, you can point to enough index entries for a ~10 GB table without having the need to examine the first array page. This seems pretty acceptable; people with larger tables can likely spare one extra page accessed every now and then. (For comparison, each regular minmax page can store about 500 index tuples, if it's built for a single 4-byte column; this means that the 10 GB table requires a 5-page index.) This is not complete yet; although I have a proof-of-concept working, I still need to write XLog support code and update the pageinspect code to match. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Erik Rijkers wrote: > On Thu, September 26, 2013 00:34, Erik Rijkers wrote: > > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: > > > >> [minmax-5.patch] > > > > I have the impression it's not quite working correctly. Here's a version 7 of the patch, which fixes these bugs and adds opclasses for a bunch more types (timestamp, timestamptz, date, time, timetz), courtesy of Martín Marqués. It's also been rebased to apply cleanly on top of today's master branch. I have also added a selectivity function, but I'm not positive that it's very useful yet. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera escribió: > I have been playing with having the revmap in the main fork of the index > rather than a separate one. ... > This is not complete yet; although I have a proof-of-concept working, I > still need to write XLog support code and update the pageinspect code to > match. Just to be clear: the v7 published elsewhere in this thread does not contain this revmap-in-main-fork code. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, November 8, 2013 21:11, Alvaro Herrera wrote: > > Here's a version 7 of the patch, which fixes these bugs and adds > opclasses for a bunch more types (timestamp, timestamptz, date, time, > timetz), courtesy of Martín Marqués. It's also been rebased to apply > cleanly on top of today's master branch. > > I have also added a selectivity function, but I'm not positive that it's > very useful yet. > > [minmax-7.patch] The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query that improves with minmax index use. (it gets used sometimes but speedup is negligable). That probably means I'm doing something wrong; could you (or anyone) give some hints about use-case would be expected? (Or is it just the unfinished selectivity function?) Thanks, Erikjan Rijkers
Attachment
On Mon, November 11, 2013 09:53, Erik Rijkers wrote: > On Fri, November 8, 2013 21:11, Alvaro Herrera wrote: >> >> Here's a version 7 of the patch, which fixes these bugs and adds >> opclasses for a bunch more types (timestamp, timestamptz, date, time, >> timetz), courtesy of Martín Marqués. It's also been rebased to apply >> cleanly on top of today's master branch. >> >> I have also added a selectivity function, but I'm not positive that it's >> very useful yet. >> >> [minmax-7.patch] > > The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query that > improves with minmax index use. (it gets used sometimes but speedup is negligable). > Another issue (I think): Attached is a program (and output as a .txt file) that gives the following (repeatable) error: $ ./casanova_test.sh \timing on drop table if exists t1; Time: 333.159 ms create table t1 (i int); Time: 155.827 ms create index t1_i_idx on t1 using minmax(i); Time: 204.031 ms insert into t1 select generate_series(1, 25000000); Time: 126312.302 ms analyze t1; ERROR: could not truncate file base/21324/26339_vm to 41 blocks: it's only 1 blocks now Time: 472.504 ms [...] Thanks, Erik Rijkers
Attachment
On Mon, Nov 11, 2013 at 12:53 AM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:> [minmax-7.patch]
>
> Here's a version 7 of the patch, which fixes these bugs and adds
> opclasses for a bunch more types (timestamp, timestamptz, date, time,
> timetz), courtesy of Martín Marqués. It's also been rebased to apply
> cleanly on top of today's master branch.
>
> I have also added a selectivity function, but I'm not positive that it's
> very useful yet.
>
The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query that
improves with minmax index use. (it gets used sometimes but speedup is negligable).
Your data set seems to be completely random. I believe that minmax indices would only be expected to be useful when the data is clustered. Perhaps you could try it on a table where it is populated something like i+random()/10*max_i.
Cheers,
Jeff
On Mon, November 11, 2013 09:53, Erik Rijkers wrote: > On Fri, November 8, 2013 21:11, Alvaro Herrera wrote: >> >> Here's a version 7 of the patch, which fixes these bugs and adds >> >> [minmax-7.patch] [...] > some hints about use-case would be expected? > I've been messing with minmax indexes some more so here are some results of that. Perhaps someone finds these timings useful. Centos 5.7, 32 GB memory, 2 quadcores. '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--enable-cassert' '--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--enable-dtrace' Detail is in the attached files; the below is a grep through these. -- rowcount (size_string): 10_000 368,640 | size table 245,760 | size btree index 16,384 | size minmax index Total runtime: 0.167 ms <-- btree (4x) ( last 2x disabled index-only ) Total runtime: 0.046 ms Total runtime: 0.046 ms Total runtime: 0.049 ms Total runtime: 0.102 ms <-- minmax (4x) Total runtime: 0.047 ms Total runtime: 0.047 ms Total runtime: 0.047 ms Total runtime: 1.066 ms <-- seqscan -- rowcount (size_string): 100_000 3,629,056 | size table 2,260,992 | size btree index 16,384 | size minmax index Total runtime: 0.090 ms <-- btree (4x) ( last 2x disabled index-only ) Total runtime: 0.046 ms Total runtime: 0.426 ms Total runtime: 0.287 ms Total runtime: 0.391 ms <-- minmax (4x) Total runtime: 0.285 ms Total runtime: 0.285 ms Total runtime: 0.291 ms Total runtime: 14.065 ms <-- seqscan -- rowcount (size_string): 1_000_000 36,249,600 | size table 22,487,040 | size btree index 57,344 | size minmax index Total runtime: 0.077 ms <-- btree (4x) ( last 2x disabled index-only ) Total runtime: 0.048 ms Total runtime: 0.044 ms Total runtime: 0.038 ms Total runtime: 2.284 ms <-- minmax (4x) Total runtime: 1.812 ms Total runtime: 1.813 ms Total runtime: 1.809 ms Total runtime: 142.958 ms <-- seqscan -- rowcount (size_string): 100_000_000 3,624,779,776 | size table 2,246,197,248 | size btree index 4,456,448 | size minmax index Total runtime: 0.091 ms <-- btree (4x) ( last 2x disabled index-only ) Total runtime: 0.047 ms Total runtime: 0.046 ms Total runtime: 0.038 ms Total runtime: 181.874 ms <-- minmax (4x) Total runtime: 175.084 ms Total runtime: 175.104 ms Total runtime: 174.349 ms Total runtime: 14833.994 ms <-- seqscan -- rowcount (size_string): 1_000_000_000 36,247,789,568 | size table 22,461,628,416 | size btree index 44,433,408 | size minmax index Total runtime: 14.735 ms <-- btree (4x) ( last 2x disabled index-only ) Total runtime: 0.046 ms Total runtime: 0.044 ms Total runtime: 0.041 ms Total runtime: 1790.591 ms <-- minmax (4x) Total runtime: 1750.129 ms Total runtime: 1747.987 ms Total runtime: 1748.476 ms Total runtime: 169770.455 ms <-- seqscan The messy "program" is attached too (although it still has Jaime's name, the mess is mine). hth, Erik Rijkers PS. The bug I reported earlier is (of course) still there; but I noticed that it only occurs on larger table sizes (e.g. +1M rows).
Attachment
On 2013-11-15 17:11:46 +0100, Erik Rijkers wrote: > I've been messing with minmax indexes some more so here are some results of that. > > Perhaps someone finds these timings useful. > > > Centos 5.7, 32 GB memory, 2 quadcores. > > '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--enable-cassert' > '--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--enable-dtrace' Just some general advice: doing timings with --enale-cassert isn't that meaningful - it often can distort results significantly. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Erik Rijkers <er@xs4all.nl> wrote: > Perhaps someone finds these timings useful. > '--enable-cassert' Assertions can really distort the timings, and not always equally for all code paths. Any chance of re-running those tests without that? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, November 15, 2013 17:33, Kevin Grittner wrote: > Erik Rijkers <er@xs4all.nl> wrote: > >> Perhaps someone finds these timings useful. > >> '--enable-cassert' > > Assertions can really distort the timings, and not always equally > for all code paths. Any chance of re-running those tests without > that? > Fair enough. It seems it doesn't make all that much difference for this case, here are the results: '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--with-perl' '--with-openssl' '--with-libxml' -- rowcount (size_string): 10_000 368640 | size table | 360 kB 245760 | size btree index | 240 kB 16384 | size minmax index | 16 kB Total runtime: 0.121 ms Total runtime: 0.041 ms Total runtime: 0.039 ms Total runtime: 0.040 ms Total runtime: 0.043 ms Total runtime: 0.041 ms Total runtime: 0.040 ms Total runtime: 0.040 ms Total runtime: 0.948 ms -- rowcount (size_string): 100_000 3629056 | size table | 3544 kB 2260992 | size btree index | 2208 kB 16384 | size minmax index | 16 kB Total runtime: 0.082 ms Total runtime: 0.039 ms Total runtime: 0.396 ms Total runtime: 0.252 ms Total runtime: 0.339 ms Total runtime: 0.245 ms Total runtime: 0.240 ms Total runtime: 0.241 ms Total runtime: 13.268 ms -- rowcount (size_string): 1_000_000 36249600 | size table | 35 MB 22487040 | size btree index | 21 MB 57344 | size minmax index | 56 kB Total runtime: 0.096 ms Total runtime: 0.039 ms Total runtime: 0.039 ms Total runtime: 0.034 ms Total runtime: 1.975 ms Total runtime: 1.527 ms Total runtime: 1.523 ms Total runtime: 1.519 ms Total runtime: 145.125 ms -- rowcount (size_string): 100_000_000 3624779776 | size table | 3457 MB 2246197248 | size btree index | 2142 MB 4456448 | size minmax index | 4352 kB Total runtime: 0.074 ms Total runtime: 0.039 ms Total runtime: 0.040 ms Total runtime: 0.033 ms Total runtime: 150.450 ms Total runtime: 147.039 ms Total runtime: 145.410 ms Total runtime: 145.142 ms Total runtime: 15068.171 ms -- rowcount (size_string): 1_000_000_000 36247789568 | size table | 34 GB 22461628416 | size btree index | 21 GB 44433408 | size minmax index | 42 MB Total runtime: 15.454 ms <-- 4x btree Total runtime: 0.040 ms Total runtime: 0.040 ms Total runtime: 0.034 ms Total runtime: 1502.353 ms <-- 4x minmax Total runtime: 1482.322 ms Total runtime: 1489.522 ms Total runtime: 1481.424 ms Total runtime: 162213.392 ms <-- seqscan I'd say minmax indexes give spectacular gains for very small indexsize. Erik Rijkers
Attachment
On Fri, Nov 8, 2013 at 12:11 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Erik Rijkers wrote:Here's a version 7 of the patch, which fixes these bugs and adds
> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
> >
> >> [minmax-5.patch]
> >
> > I have the impression it's not quite working correctly.
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués. It's also been rebased to apply
cleanly on top of today's master branch.
I have also added a selectivity function, but I'm not positive that it's
very useful yet.
I tested it with attached script, but broke out of the "for" loop after 5 iterations (when it had 300,000,005 rows inserted)
Then I did an analyze, and got an error message below:
jjanes=# analyze;
ERROR: could not truncate file "base/16384/16388_vm" to 488 blocks: it's only 82 blocks now
16388 is the index's relfilenode.
Here is the backtrace upon entry to the truncate that is going to fail:
#0 mdtruncate (reln=0x23c91b0, forknum=VISIBILITYMAP_FORKNUM, nblocks=488) at md.c:858
#1 0x000000000048eb4a in mmRevmapTruncate (rmAccess=0x26ad878, heapNumBlocks=1327434) at mmrevmap.c:360
#2 0x000000000048d37a in mmvacuumcleanup (fcinfo=<value optimized out>) at minmax.c:1264
#3 0x000000000072dcef in FunctionCall2Coll (flinfo=<value optimized out>, collation=<value optimized out>, arg1=<value optimized out>,
arg2=<value optimized out>) at fmgr.c:1323
#4 0x000000000048c1e5 in index_vacuum_cleanup (info=<value optimized out>, stats=0x0) at indexam.c:715
#5 0x000000000052a7ce in do_analyze_rel (onerel=0x7f59798589e8, vacstmt=0x23b0bd8, acquirefunc=0x5298d0 <acquire_sample_rows>, relpages=1327434,
inh=0 '\000', elevel=13) at analyze.c:634
#6 0x000000000052b320 in analyze_rel (relid=<value optimized out>, vacstmt=0x23b0bd8, bstrategy=<value optimized out>) at analyze.c:267
#7 0x000000000057cba7 in vacuum (vacstmt=0x23b0bd8, relid=<value optimized out>, do_toast=1 '\001', bstrategy=<value optimized out>,
for_wraparound=0 '\000', isTopLevel=<value optimized out>) at vacuum.c:249
#8 0x0000000000663177 in standard_ProcessUtility (parsetree=0x23b0bd8, queryString=<value optimized out>, context=<value optimized out>, params=0x0,
dest=<value optimized out>, completionTag=<value optimized out>) at utility.c:682
#9 0x00007f598290b791 in pgss_ProcessUtility (parsetree=0x23b0bd8, queryString=0x23b0220 "analyze \n;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
dest=0x23b0f18, completionTag=0x7fffd3442f30 "") at pg_stat_statements.c:825
#10 0x000000000065fcf7 in PortalRunUtility (portal=0x24195e0, utilityStmt=0x23b0bd8, isTopLevel=1 '\001', dest=0x23b0f18, completionTag=0x7fffd3442f30 "")
at pquery.c:1187
#11 0x0000000000660c6d in PortalRunMulti (portal=0x24195e0, isTopLevel=1 '\001', dest=0x23b0f18, altdest=0x23b0f18, completionTag=0x7fffd3442f30 "")
at pquery.c:1318
#12 0x0000000000661323 in PortalRun (portal=0x24195e0, count=9223372036854775807, isTopLevel=1 '\001', dest=0x23b0f18, altdest=0x23b0f18,
completionTag=0x7fffd3442f30 "") at pquery.c:816
#13 0x000000000065dbb4 in exec_simple_query (query_string=0x23b0220 "analyze \n;") at postgres.c:1048
#14 0x000000000065f259 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x2347be8 "jjanes", username=<value optimized out>)
at postgres.c:3992
#15 0x000000000061b7d0 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4085
#16 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:3774
#17 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1585
#18 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1240
#19 0x00000000005b5e90 in main (argc=3, argv=0x2346cd0) at main.c:196
Cheers,
Jeff
Attachment
On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Erik Rijkers wrote: >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: >> > >> >> [minmax-5.patch] >> > >> > I have the impression it's not quite working correctly. > > Here's a version 7 of the patch, which fixes these bugs and adds > opclasses for a bunch more types (timestamp, timestamptz, date, time, > timetz), courtesy of Martín Marqués. It's also been rebased to apply > cleanly on top of today's master branch. > > I have also added a selectivity function, but I'm not positive that it's > very useful yet. This patch doesn't appear to have been submitted to any Commitfest. Is this still a feature undergoing research then? -- Thom
Thom Brown wrote: > On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Erik Rijkers wrote: > >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: > >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: > >> > > >> >> [minmax-5.patch] > >> > > >> > I have the impression it's not quite working correctly. > > > > Here's a version 7 of the patch, which fixes these bugs and adds > > opclasses for a bunch more types (timestamp, timestamptz, date, time, > > timetz), courtesy of Martín Marqués. It's also been rebased to apply > > cleanly on top of today's master branch. > > > > I have also added a selectivity function, but I'm not positive that it's > > very useful yet. > > This patch doesn't appear to have been submitted to any Commitfest. > Is this still a feature undergoing research then? It's still a planned feature, but I didn't have time to continue work for 2014-01. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 24 January 2014 17:53, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thom Brown wrote: >> On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > Erik Rijkers wrote: >> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: >> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: >> >> > >> >> >> [minmax-5.patch] >> >> > >> >> > I have the impression it's not quite working correctly. >> > >> > Here's a version 7 of the patch, which fixes these bugs and adds >> > opclasses for a bunch more types (timestamp, timestamptz, date, time, >> > timetz), courtesy of Martín Marqués. It's also been rebased to apply >> > cleanly on top of today's master branch. >> > >> > I have also added a selectivity function, but I'm not positive that it's >> > very useful yet. >> >> This patch doesn't appear to have been submitted to any Commitfest. >> Is this still a feature undergoing research then? > > It's still a planned feature, but I didn't have time to continue work > for 2014-01. Alles klar. Thanks -- Thom
On Fri, Jan 24, 2014 at 2:54 PM, Thom Brown <thom@linux.com> wrote: > On 24 January 2014 17:53, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Thom Brown wrote: >>> On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> > Erik Rijkers wrote: >>> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: >>> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: >>> >> > >>> >> >> [minmax-5.patch] >>> >> > >>> >> > I have the impression it's not quite working correctly. >>> > >>> > Here's a version 7 of the patch, which fixes these bugs and adds >>> > opclasses for a bunch more types (timestamp, timestamptz, date, time, >>> > timetz), courtesy of Martín Marqués. It's also been rebased to apply >>> > cleanly on top of today's master branch. >>> > >>> > I have also added a selectivity function, but I'm not positive that it's >>> > very useful yet. >>> >>> This patch doesn't appear to have been submitted to any Commitfest. >>> Is this still a feature undergoing research then? >> >> It's still a planned feature, but I didn't have time to continue work >> for 2014-01. What's the status? I believe I have more than a use for minmax indexes, and wouldn't mind lending a hand if it's within my grasp.
On Fri, Jan 24, 2014 at 12:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > > What's the status? > > I believe I have more than a use for minmax indexes, and wouldn't mind > lending a hand if it's within my grasp. I'm also interested in looking at this. Mostly because I have ideas for other "summary" functions that would be interesting and could use the same infrastructure otherwise. -- greg
Robert Haas wrote: > On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Here's an updated version of this patch, with fixes to all the bugs > > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > > Amit Kapila for the reports. > > I'm not very happy with the use of a separate relation fork for > storing this data. Here's a new version of this patch. Now the revmap is not stored in a separate fork, but together with all the regular data, as explained elsewhere in the thread. I added a few pageinspect functions that let one explore the data in the index. With this you can start by reading the metapage, and from there obtain the block numbers for the revmap array pages; and explore revmap array pages to read regular revmap pages, which contain the TIDs to index entries. All these pageinspect functions don't currently have any documentation, but it's as easy as with idxname as (select 'ti'::text as idxname) select * from idxname, generate_series(0, pg_relation_size(idxname) / 8192 - 1) i, minmax_page_type(get_raw_page(idxname, i::int)); select * -- data in metapage from minmax_metapage_info(get_raw_page('ti', 0)); select * -- data in revmap array pages from minmax_revmap_array_data(get_raw_page('ti', 6)); select logblk, unnest(pages) -- data in regular revmap pages from minmax_revmap_data(get_raw_page('ti', 15)); select * -- data in regular index pages from minmax_page_items(get_raw_page('ti', 2), 'ti'::regclass); Note that in this last case you need to give it the OID of the index as the second parameter, so that it can construct a tupledesc for decoding the min/max data. I have followed the suggestion by Amit to overwrite the index tuple when a new heap tuple is inserted, instead of creating a separate index tuple. This saves a lot of index bloat. This required a new entry point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new function PageIndexDeleteNoCompact which is similar in spirit to PageIndexMultiDelete except that item pointers do not change. This is necessary because the revmap stores item pointers, and such reference would break if we were to renumber items in index pages. I have also added a reloption for the size of each page range, so you can do create index ti on t using minmax (a) with (pages_per_range = 2); The default is 128 pages per range, and I have an arbitrary maximum of 131072 (default size of a 1GB segment). There doesn't seem to be much point in having larger page ranges; intuitively I think page ranges should be more or less the size of kernel readahead, but I haven't tested this. I didn't want to rebase past 0ef0b6784 in a hurry. I only know this applies cleanly on top of fe7337f2dc, so please use that if you want to play with it. I will post a rebased version shortly. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >> > Here's an updated version of this patch, with fixes to all the bugs >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and >> > Amit Kapila for the reports. >> >> I'm not very happy with the use of a separate relation fork for >> storing this data. > > Here's a new version of this patch. Now the revmap is not stored in a > separate fork, but together with all the regular data, as explained > elsewhere in the thread. Cool. Have you thought more about this comment from Heikki? http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com I'm concerned that we could end up with one index type of this general nature for min/max type operations, and then another very similar index type for geometric operators or text-search operators or what have you. Considering the overhead in adding and maintaining an index AM, I think we should try to be sure that we've done a reasonably solid job making each one as general as we reasonably can. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-06-17 10:26:11 -0400, Robert Haas wrote: > On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Robert Haas wrote: > >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > >> <alvherre@2ndquadrant.com> wrote: > >> > Here's an updated version of this patch, with fixes to all the bugs > >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > >> > Amit Kapila for the reports. > >> > >> I'm not very happy with the use of a separate relation fork for > >> storing this data. > > > > Here's a new version of this patch. Now the revmap is not stored in a > > separate fork, but together with all the regular data, as explained > > elsewhere in the thread. > > Cool. > > Have you thought more about this comment from Heikki? > > http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com Is there actually a significant usecase behind that wish or just a general demand for being generic? To me it seems fairly unlikely you'd end up with something useful by doing a minmax index over bounding boxes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 17, 2014 at 3:31 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Is there actually a significant usecase behind that wish or just a > general demand for being generic? To me it seems fairly unlikely you'd > end up with something useful by doing a minmax index over bounding > boxes. Isn't min/max just a 2d bounding box? If you do a bulk data load of something like the census data then sure, every page will have data points for some geometrically clustered set of data. I had in mind to do a small bloom filter per block. In general any kind of predicate like bounding box should work. -- greg
On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-17 10:26:11 -0400, Robert Haas wrote: >> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >> > Robert Haas wrote: >> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera >> >> <alvherre@2ndquadrant.com> wrote: >> >> > Here's an updated version of this patch, with fixes to all the bugs >> >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and >> >> > Amit Kapila for the reports. >> >> >> >> I'm not very happy with the use of a separate relation fork for >> >> storing this data. >> > >> > Here's a new version of this patch. Now the revmap is not stored in a >> > separate fork, but together with all the regular data, as explained >> > elsewhere in the thread. >> >> Cool. >> >> Have you thought more about this comment from Heikki? >> >> http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com > > Is there actually a significant usecase behind that wish or just a > general demand for being generic? To me it seems fairly unlikely you'd > end up with something useful by doing a minmax index over bounding > boxes. Well, I'm not the guy who does things with geometric data, but I don't want to ignore the significant percentage of our users who are. As you must surely know, the GIST implementations for geometric data types store bounding boxes on internal pages, and that seems to be useful to people. What is your reason for thinking that it would be any less useful in this context? I do also think that a general demand for being generic ought to carry some weight. We have gone to great lengths to make sure that our indexing can handle more than just < and >, where a lot of other products have not bothered. I think we have gotten a lot of mileage out of that decision and feel that we shouldn't casually back away from it. Obviously, we do already have some special-case optimizations and will likely have more in the future, and there are can certainly be valid reasons for taking that approach. But it needs to be justified in some way; we shouldn't accept a less-generic approach blindly, without questioning whether it's possible to do better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-06-17 11:48:10 -0400, Robert Haas wrote: > On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-06-17 10:26:11 -0400, Robert Haas wrote: > >> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera > >> <alvherre@2ndquadrant.com> wrote: > >> > Robert Haas wrote: > >> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > >> >> <alvherre@2ndquadrant.com> wrote: > >> >> > Here's an updated version of this patch, with fixes to all the bugs > >> >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > >> >> > Amit Kapila for the reports. > >> >> > >> >> I'm not very happy with the use of a separate relation fork for > >> >> storing this data. > >> > > >> > Here's a new version of this patch. Now the revmap is not stored in a > >> > separate fork, but together with all the regular data, as explained > >> > elsewhere in the thread. > >> > >> Cool. > >> > >> Have you thought more about this comment from Heikki? > >> > >> http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com > > > > Is there actually a significant usecase behind that wish or just a > > general demand for being generic? To me it seems fairly unlikely you'd > > end up with something useful by doing a minmax index over bounding > > boxes. > > Well, I'm not the guy who does things with geometric data, but I don't > want to ignore the significant percentage of our users who are. As > you must surely know, the GIST implementations for geometric data > types store bounding boxes on internal pages, and that seems to be > useful to people. What is your reason for thinking that it would be > any less useful in this context? For me minmax indexes are helpful because they allow to generate *small* 'coarse' indexes over large volumes of data. From my pov that's possible possible because they don't contain item pointers for every contained row. That'ill imo work well if there are consecutive rows in the table that can be summarized into one min/max range. That's quite likely to happen for common applications of number of scalar datatypes. But the likelihood of placing sufficiently many rows with very similar bounding boxes close together seems much less relevant in practice. And I think that's generally likely for operations which can't be well represented as btree opclasses - the substructure that implies inside a Datum will make correlation between consecutive rows less likely. Maybe I've a major intuition failure here though... > I do also think that a general demand for being generic ought to carry > some weight. Agreed. It's always a balance act. But it's not like this doesn't use a datatype abstraction concept... > We have gone to great lengths to make sure that our > indexing can handle more than just < and >, where a lot of other > products have not bothered. I think we have gotten a lot of mileage > out of that decision and feel that we shouldn't casually back away > from it. I don't see this as a case of backing away from that though? > we shouldn't accept a less-generic > approach blindly, without questioning whether it's possible to do > better. But the aim shouldn't be to add genericity that's not going to be used, but to add it where it's somewhat likely to help... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Well, I'm not the guy who does things with geometric data, but I don't >> want to ignore the significant percentage of our users who are. As >> you must surely know, the GIST implementations for geometric data >> types store bounding boxes on internal pages, and that seems to be >> useful to people. What is your reason for thinking that it would be >> any less useful in this context? > > For me minmax indexes are helpful because they allow to generate *small* > 'coarse' indexes over large volumes of data. From my pov that's possible > possible because they don't contain item pointers for every contained > row. > That'ill imo work well if there are consecutive rows in the table that > can be summarized into one min/max range. That's quite likely to happen > for common applications of number of scalar datatypes. But the > likelihood of placing sufficiently many rows with very similar bounding > boxes close together seems much less relevant in practice. And I think > that's generally likely for operations which can't be well represented > as btree opclasses - the substructure that implies inside a Datum will > make correlation between consecutive rows less likely. Well, I don't know: suppose you're loading geospatial data showing the location of every building in some country. It might easily be the case that the data is or can be loaded in an order that provides pretty good spatial locality, leading to tight bounding boxes over physically consecutive data ranges. But I'm not trying to say that we absolutely have to support that kind of thing; what I am trying to say is that there should be a README or a mailing list post or some such that says: "We thought about how generic to make this. We considered A, B, and C. We rejected C as too narrow, and A because if we made it that general it would have greatly enlarged the disk footprint for the following reasons. Therefore we selected B." Basically, I think Heikki asked a good question - which was "could we abstract this more?" - and I can't recall seeing a clear answer explaining why we could or couldn't and what the trade-offs would be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-06-17 12:14:00 -0400, Robert Haas wrote: > On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Well, I'm not the guy who does things with geometric data, but I don't > >> want to ignore the significant percentage of our users who are. As > >> you must surely know, the GIST implementations for geometric data > >> types store bounding boxes on internal pages, and that seems to be > >> useful to people. What is your reason for thinking that it would be > >> any less useful in this context? > > > > For me minmax indexes are helpful because they allow to generate *small* > > 'coarse' indexes over large volumes of data. From my pov that's possible > > possible because they don't contain item pointers for every contained > > row. > > That'ill imo work well if there are consecutive rows in the table that > > can be summarized into one min/max range. That's quite likely to happen > > for common applications of number of scalar datatypes. But the > > likelihood of placing sufficiently many rows with very similar bounding > > boxes close together seems much less relevant in practice. And I think > > that's generally likely for operations which can't be well represented > > as btree opclasses - the substructure that implies inside a Datum will > > make correlation between consecutive rows less likely. > > Well, I don't know: suppose you're loading geospatial data showing the > location of every building in some country. It might easily be the > case that the data is or can be loaded in an order that provides > pretty good spatial locality, leading to tight bounding boxes over > physically consecutive data ranges. Well, it might be doable to correlate them along one axis, but along both? That's more complicated... And even alongside one axis you already get into problems if your geometries are irregularly sized. Asingle large polygon will completely destroy indexability for anything stored physically close by because suddently the minmax range will be huge... So you'll need to cleverly sort for that as well. I think hierarchical datastructures are so much better suited for this, that there's little point trying to fit them into minmax. I can very well imagine that there's benefit in a gist support for only storing one pointer per block instead of one pointer per item or such. But it seems like separate feature. > But I'm not trying to say that we absolutely have to support that kind > of thing; what I am trying to say is that there should be a README or > a mailing list post or some such that says: "We thought about how > generic to make this. We considered A, B, and C. We rejected C as > too narrow, and A because if we made it that general it would have > greatly enlarged the disk footprint for the following reasons. > Therefore we selected B." Isn't 'simpler implementation' a valid reason that's already been discussed onlist? Obviously simpler implementation doesn't trump everything, but it's one valid reason... Note that I have zap to do with the design of this feature. I work for the same company as Alvaro, but that's pretty much it... > Basically, I think Heikki asked a good > question - which was "could we abstract this more?" - and I can't > recall seeing a clear answer explaining why we could or couldn't and > what the trade-offs would be. 'could we abstract more' imo is a pretty bad design guideline. It's 'is there benefit in abstracting more'. Otherwise you end up with way to complicated systems. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 17, 2014 at 1:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > For me minmax indexes are helpful because they allow to generate *small* > 'coarse' indexes over large volumes of data. From my pov that's possible > possible because they don't contain item pointers for every contained > row. But minmax is just a specific form of bloom filter. This could certainly be generalized to a bloom filter index with some set of bloom&hashing operators (minmax being just one).
On 06/17/2014 09:14 AM, Robert Haas wrote: > Well, I don't know: suppose you're loading geospatial data showing the > location of every building in some country. It might easily be the > case that the data is or can be loaded in an order that provides > pretty good spatial locality, leading to tight bounding boxes over > physically consecutive data ranges. I admin a production application which has exactly this. However, that application doesn't have big enough data to benefit from minmax indexes; it uses the basic spatial indexes. So, my $0.02: bounding box minmax falls under the heading of "would be nice to have, but not if it delays the feature". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Well, it might be doable to correlate them along one axis, but along > both? That's more complicated... And even alongside one axis you > already get into problems if your geometries are irregularly sized. > Asingle large polygon will completely destroy indexability for anything > stored physically close by because suddently the minmax range will be > huge... So you'll need to cleverly sort for that as well. I think there's a misunderstanding here, possibly mine. My understanding is that a min/max index will always be exactly the same size for a given size table. It stores the minimum and maximum value of the key for each page. Then you can do a bitmap scan by comparing the search key with each page's minimum and maximum to see if that page needs to be included in the scan. The failure mode is not that the index is large but that a page that has an outlier will be included in every scan as a false positive incurring an extra iop. I don't think it's implausible at all that Geometric data would work well. If you load Geometric data it's very common to load data by geographic area so that all objects in San Francisco in one part of the data load, probably even by zip code or census block. What operations would an opclass for min/max need? I think it would be pretty similar to the operators that GiST needs (thankfully minus the baroque page split function): An aggregate to generate a min/max "bounding box" from several values A function which takes an "bounding box" and a new value and returns the new "bounding box" A function which tests if a value is in a "bounding box" A function which tests if a "bounding box" overlaps a "bounding box" The nice thing is this would let us add things like range @> (contains element) to the plain integer min/max case. -- greg
On 06/17/2014 09:16 PM, Andres Freund wrote: > On 2014-06-17 12:14:00 -0400, Robert Haas wrote: >> On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>>> Well, I'm not the guy who does things with geometric data, but I don't >>>> want to ignore the significant percentage of our users who are. As >>>> you must surely know, the GIST implementations for geometric data >>>> types store bounding boxes on internal pages, and that seems to be >>>> useful to people. What is your reason for thinking that it would be >>>> any less useful in this context? >>> >>> For me minmax indexes are helpful because they allow to generate *small* >>> 'coarse' indexes over large volumes of data. From my pov that's possible >>> possible because they don't contain item pointers for every contained >>> row. >>> That'ill imo work well if there are consecutive rows in the table that >>> can be summarized into one min/max range. That's quite likely to happen >>> for common applications of number of scalar datatypes. But the >>> likelihood of placing sufficiently many rows with very similar bounding >>> boxes close together seems much less relevant in practice. And I think >>> that's generally likely for operations which can't be well represented >>> as btree opclasses - the substructure that implies inside a Datum will >>> make correlation between consecutive rows less likely. >> >> Well, I don't know: suppose you're loading geospatial data showing the >> location of every building in some country. It might easily be the >> case that the data is or can be loaded in an order that provides >> pretty good spatial locality, leading to tight bounding boxes over >> physically consecutive data ranges. > > Well, it might be doable to correlate them along one axis, but along > both? That's more complicated... And even alongside one axis you > already get into problems if your geometries are irregularly sized. Sure, there are cases where it would be useless. But it's easy to imagine scenarios where it would work well, where points are loaded in clusters and points that are close to each other also end up physically close to each other. > Asingle large polygon will completely destroy indexability for anything > stored physically close by because suddently the minmax range will be > huge... So you'll need to cleverly sort for that as well. That's an inherent risk with minmax indexes: insert a few rows to the "wrong" locations in the heap, and the selectivity of the index degrades rapidly. The main problem with using it for geometric types is that you can't easily CLUSTER the table to make the minmax index effective again. But there are ways around that. >> But I'm not trying to say that we absolutely have to support that kind >> of thing; what I am trying to say is that there should be a README or >> a mailing list post or some such that says: "We thought about how >> generic to make this. We considered A, B, and C. We rejected C as >> too narrow, and A because if we made it that general it would have >> greatly enlarged the disk footprint for the following reasons. >> Therefore we selected B." > > Isn't 'simpler implementation' a valid reason that's already been > discussed onlist? Obviously simpler implementation doesn't trump > everything, but it's one valid reason... > Note that I have zap to do with the design of this feature. I work for > the same company as Alvaro, but that's pretty much it... Without some analysis (e.g implementing it and comparing), I don't buy that it makes the implementation simpler to restrict it in this way. Maybe it does, but often it's actually simpler to solve the general case. - Heikki
On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote: > On 06/17/2014 09:16 PM, Andres Freund wrote: > >Well, it might be doable to correlate them along one axis, but along > >both? That's more complicated... And even alongside one axis you > >already get into problems if your geometries are irregularly sized. > > Sure, there are cases where it would be useless. But it's easy to imagine > scenarios where it would work well, where points are loaded in clusters and > points that are close to each other also end up physically close to each > other. > >Asingle large polygon will completely destroy indexability for anything > >stored physically close by because suddently the minmax range will be > >huge... So you'll need to cleverly sort for that as well. > > That's an inherent risk with minmax indexes: insert a few rows to the > "wrong" locations in the heap, and the selectivity of the index degrades > rapidly. Sure. But it's fairly normal to have natural clusteredness in many columns (surrogate keys, dateseries type of data). Even if you insert geometric types in a geographic clusters you'll have worse results because some bounding boxes will be big and such. And: > The main problem with using it for geometric types is that you can't easily > CLUSTER the table to make the minmax index effective again. But there are > ways around that. Which are? Sure you can try stuff like recreating the table, sorting rows with boundary boxes area above threshold first, and then go on to sort by the lop left corner of the bounding box. But that'll be neither builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER for types with a btree opclass which will yield the perfect order. > >Isn't 'simpler implementation' a valid reason that's already been > >discussed onlist? Obviously simpler implementation doesn't trump > >everything, but it's one valid reason... > >Note that I have zap to do with the design of this feature. I work for > >the same company as Alvaro, but that's pretty much it... > > Without some analysis (e.g implementing it and comparing), I don't buy that > it makes the implementation simpler to restrict it in this way. Maybe it > does, but often it's actually simpler to solve the general case. So to implement a feature one now has to implement the most generic variant as a prototype first? Really? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-06-17 16:48:07 -0700, Greg Stark wrote: > On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Well, it might be doable to correlate them along one axis, but along > > both? That's more complicated... And even alongside one axis you > > already get into problems if your geometries are irregularly sized. > > Asingle large polygon will completely destroy indexability for anything > > stored physically close by because suddently the minmax range will be > > huge... So you'll need to cleverly sort for that as well. > > I think there's a misunderstanding here, possibly mine. My > understanding is that a min/max index will always be exactly the same > size for a given size table. It stores the minimum and maximum value > of the key for each page. Then you can do a bitmap scan by comparing > the search key with each page's minimum and maximum to see if that > page needs to be included in the scan. The failure mode is not that > the index is large but that a page that has an outlier will be > included in every scan as a false positive incurring an extra iop. I just rechecked, and no, it doesn't, by default, store a range for each page. It's MINMAX_DEFAULT_PAGES_PER_RANGE=128 pages by default... Haven't checked what's the lowest it can be se tto. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/18/2014 01:46 PM, Andres Freund wrote: > On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote: >> The main problem with using it for geometric types is that you can't easily >> CLUSTER the table to make the minmax index effective again. But there are >> ways around that. > > Which are? Sure you can try stuff like recreating the table, sorting > rows with boundary boxes area above threshold first, and then go on to > sort by the lop left corner of the bounding box. Right, something like that. Or cluster using some other column that correlates with the geometry, like a zip code. > But that'll be neither > builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER > for types with a btree opclass which will yield the perfect order. Sure. BTW, CLUSTERing by a geometric type would be useful anyway, even without minmax indexes. >>> Isn't 'simpler implementation' a valid reason that's already been >>> discussed onlist? Obviously simpler implementation doesn't trump >>> everything, but it's one valid reason... >>> Note that I have zap to do with the design of this feature. I work for >>> the same company as Alvaro, but that's pretty much it... >> >> Without some analysis (e.g implementing it and comparing), I don't buy that >> it makes the implementation simpler to restrict it in this way. Maybe it >> does, but often it's actually simpler to solve the general case. > > So to implement a feature one now has to implement the most generic > variant as a prototype first? Really? Implementing something is a good way to demonstrate how it would look like. But no, I don't insist on implementing every possible design whenever a new feature is proposed. I liked Greg's sketch of what the opclass support functions would be. It doesn't seem significantly more complicated than what's in the patch now. - Heikki
On Tue, Jun 17, 2014 at 2:16 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> But I'm not trying to say that we absolutely have to support that kind >> of thing; what I am trying to say is that there should be a README or >> a mailing list post or some such that says: "We thought about how >> generic to make this. We considered A, B, and C. We rejected C as >> too narrow, and A because if we made it that general it would have >> greatly enlarged the disk footprint for the following reasons. >> Therefore we selected B." > > Isn't 'simpler implementation' a valid reason that's already been > discussed onlist? Obviously simpler implementation doesn't trump > everything, but it's one valid reason... > Note that I have zap to do with the design of this feature. I work for > the same company as Alvaro, but that's pretty much it... It really *hasn't* been discussed on-list. See these emails, discussing the same ideas, from 8 months ago: http://www.postgresql.org/message-id/5249B2D3.6030606@vmware.com http://www.postgresql.org/message-id/CA+TgmoYSCbW-UC8LQV96sziGnXSuzAyQbfdQmK-FGu22HdKkaw@mail.gmail.com Now, Alvaro did not respond to those emails, nor did anyone involved in the development of the feature. There may be an argument that implementing that would be too complicated, but Heikki said he didn't think it would be, and nobody's made a concrete argument as to why he's wrong (and Heikki knows a lot about indexing). >> Basically, I think Heikki asked a good >> question - which was "could we abstract this more?" - and I can't >> recall seeing a clear answer explaining why we could or couldn't and >> what the trade-offs would be. > > 'could we abstract more' imo is a pretty bad design guideline. It's 'is > there benefit in abstracting more'. Otherwise you end up with way to > complicated systems. On the flip side, if you don't abstract enough, you end up being able to cover only a small set of the relevant use cases, or else you end up with a bunch of poorly-coordinated tools to cover slightly different use cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/18/2014 12:46 PM, Andres Freund wrote: >>> Isn't 'simpler implementation' a valid reason that's already been >>> > >discussed onlist? Obviously simpler implementation doesn't trump >>> > >everything, but it's one valid reason... >>> > >Note that I have zap to do with the design of this feature. I work for >>> > >the same company as Alvaro, but that's pretty much it... >> > >> > Without some analysis (e.g implementing it and comparing), I don't buy that >> > it makes the implementation simpler to restrict it in this way. Maybe it >> > does, but often it's actually simpler to solve the general case. > > So to implement a feature one now has to implement the most generic > variant as a prototype first? Really? Well, there is the inventor's paradox to consider. -- Vik
On 06/18/2014 06:09 PM, Claudio Freire wrote: > On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote: >> An aggregate to generate a min/max "bounding box" from several values >> A function which takes an "bounding box" and a new value and returns >> the new "bounding box" >> A function which tests if a value is in a "bounding box" >> A function which tests if a "bounding box" overlaps a "bounding box" > > Which I'd generalize a bit further by renaming "bounding box" with > "compressed set", and allow it to be parameterized. What do you mean by parameterized? > So, you have: > > An aggregate to generate a "compressed set" from several values > A function which adds a new value to the "compressed set" and returns > the new "compressed set" > A function which tests if a value is in a "compressed set" > A function which tests if a "compressed set" overlaps another > "compressed set" of equal type Yeah, something like that. I'm not sure I like the "compressed set" term any more than bounding box, though. GiST seems to have avoided naming the thing, and just talks about "index entries". But if we can come up with a good name, that would be more clear. > One problem with such a generalized implementation would be, that I'm > not sure in-place modification of the "compressed set" on-disk can be > assumed to be safe on all cases. Surely, for strictly-enlarging sets > it would, but while min/max and bloom filters both fit the bill, it's > not clear that one can assume this for all structures. I don't understand what you mean. It's a fundamental property of minmax indexes that you can always replace the "min" or "max" or "compressing set" or "bounding box" or whatever with another datum that represents all the keys that the old one did, plus some. - Heikki
Vik Fearing <vik.fearing@dalibo.com> writes: > On 06/18/2014 12:46 PM, Andres Freund wrote: >> So to implement a feature one now has to implement the most generic >> variant as a prototype first? Really? > Well, there is the inventor's paradox to consider. I have not seen anyone demanding a different implementation in this thread. What *has* been asked for, and not supplied, is a concrete defense of the particular level of generality that's been selected in this implementation. It's not at all clear to the rest of us whether it was the right choice, and that is something that ought to be asked now not later. regards, tom lane
On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Implementing something is a good way to demonstrate how it would look like. > But no, I don't insist on implementing every possible design whenever a new > feature is proposed. > > I liked Greg's sketch of what the opclass support functions would be. It > doesn't seem significantly more complicated than what's in the patch now. As a counter-point to my own point there will be nothing stopping us in the future from generalizing things. Dealing with catalogs is mostly book-keeping headaches and careful work. it's something that might be well-suited for a GSOC or first patch from someone looking to familiarize themselves with the system architecture. It's hard to invent a whole new underlying infrastructure at the same time as dealing with all that book-keeping and it's hard for someone familiarizing themselves with the system to also have a great new idea. Having tasks like this that are easy to explain and that mentor understands well can be easier to manage than tasks where the newcomer has some radical new idea. -- greg
On Thu, Jun 19, 2014 at 10:06 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 06/18/2014 06:09 PM, Claudio Freire wrote: >> >> On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote: >>> >>> An aggregate to generate a min/max "bounding box" from several values >>> A function which takes an "bounding box" and a new value and returns >>> the new "bounding box" >>> A function which tests if a value is in a "bounding box" >>> A function which tests if a "bounding box" overlaps a "bounding box" >> >> >> Which I'd generalize a bit further by renaming "bounding box" with >> "compressed set", and allow it to be parameterized. > > > What do you mean by parameterized? Bloom filters can be paired with number of hashes, number of bit positions, and hash function, so it's not a simple bloom filter index, but a bloom filter index with N SHA-1-based hashes spread on a K-length bitmap. >> So, you have: >> >> An aggregate to generate a "compressed set" from several values >> A function which adds a new value to the "compressed set" and returns >> the new "compressed set" >> A function which tests if a value is in a "compressed set" >> A function which tests if a "compressed set" overlaps another >> "compressed set" of equal type > > > Yeah, something like that. I'm not sure I like the "compressed set" term any > more than bounding box, though. GiST seems to have avoided naming the thing, > and just talks about "index entries". But if we can come up with a good > name, that would be more clear. I don't want to use the term bloom filter since it's very specific of a specific technique, but it's basically that - an approximate set without false negatives. Ie: compressed set. It's not a bounding box either when using bloom filters. So... >> One problem with such a generalized implementation would be, that I'm >> not sure in-place modification of the "compressed set" on-disk can be >> assumed to be safe on all cases. Surely, for strictly-enlarging sets >> it would, but while min/max and bloom filters both fit the bill, it's >> not clear that one can assume this for all structures. > > > I don't understand what you mean. It's a fundamental property of minmax > indexes that you can always replace the "min" or "max" or "compressing set" > or "bounding box" or whatever with another datum that represents all the > keys that the old one did, plus some. Yes, and bloom filters happen to fall on that category too. Never mind what I said. I was thinking of other potential and imaginary implementation that supports removal or updates, that might need care with transaction lifetimes, but that's easily fixed by letting vacuum or some lazy process do the deleting just as it happens with other indexes anyway. So, I guess the interface must include also the invariant that compressed sets only grow, never shrink unless from a rebuild or a vacuum operation.
On Wed, Jun 18, 2014 at 8:51 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > > I liked Greg's sketch of what the opclass support functions would be. It > doesn't seem significantly more complicated than what's in the patch now. Which was On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote: > An aggregate to generate a min/max "bounding box" from several values > A function which takes an "bounding box" and a new value and returns > the new "bounding box" > A function which tests if a value is in a "bounding box" > A function which tests if a "bounding box" overlaps a "bounding box" Which I'd generalize a bit further by renaming "bounding box" with "compressed set", and allow it to be parameterized. So, you have: An aggregate to generate a "compressed set" from several values A function which adds a new value to the "compressed set" and returns the new "compressed set" A function which tests if a value is in a "compressed set" A function which tests if a "compressed set" overlaps another "compressed set" of equal type If you can define different compressed sets, you can use this to generate both min/max indexes as well as bloom filter indexes. Whether we'd want to have both is perhaps questionable, but having the ability to is probably desirable. One problem with such a generalized implementation would be, that I'm not sure in-place modification of the "compressed set" on-disk can be assumed to be safe on all cases. Surely, for strictly-enlarging sets it would, but while min/max and bloom filters both fit the bill, it's not clear that one can assume this for all structures. Adding also a "in-place updateable" bit to the "type" would perhaps inflate the complexity of the patch due to the need to provide both code paths?
I'm sorry if I missed something, but ISTM this is beginning to look a lot like GiST. This was pointed out by Robert Haas last year. On Wed, Jun 18, 2014 at 12:09:42PM -0300, Claudio Freire wrote: > So, you have: > > An aggregate to generate a "compressed set" from several values Which GiST does by calling 'compress' on each value, and the 'unions' the results together. > A function which adds a new value to the "compressed set" and returns > the new "compressed set" Again, 'compress' + 'union' > A function which tests if a value is in a "compressed set" Which GiST does using 'compress' +'consistant' > A function which tests if a "compressed set" overlaps another > "compressed set" of equal type Which GiST calls 'consistant' So I'm wondering why you can't just reuse the btree_gist functions we already have in contrib. It seems to me that these MinMax indexes are in fact a variation on GiST that indexes the pages of a table based upon the 'union' of all the elements in a page. By reusing the GiST operator class you get support for many datatypes for free. > If you can define different compressed sets, you can use this to > generate both min/max indexes as well as bloom filter indexes. Whether > we'd want to have both is perhaps questionable, but having the ability > to is probably desirable. You could implement bloom filter in GiST too. It's been discussed before but I can't find any implementation. Probably because the filter needs to be parameterised and if you store the bloom filter for each element it gets expensive very quickly. However, hooked into a minmax structure which only indexes whole pages it could be quite efficient. > One problem with such a generalized implementation would be, that I'm > not sure in-place modification of the "compressed set" on-disk can be > assumed to be safe on all cases. Surely, for strictly-enlarging sets > it would, but while min/max and bloom filters both fit the bill, it's > not clear that one can assume this for all structures. I think GiST has already solved this problem. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Some comments, aside from the design wrt. bounding boxes etc. : On 06/15/2014 05:34 AM, Alvaro Herrera wrote: > Robert Haas wrote: >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Here's an updated version of this patch, with fixes to all the bugs >>> reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and >>> Amit Kapila for the reports. >> >> I'm not very happy with the use of a separate relation fork for >> storing this data. > > Here's a new version of this patch. Now the revmap is not stored in a > separate fork, but together with all the regular data, as explained > elsewhere in the thread. Thanks! Please update the README accordingly. If I understand the code correctly, the revmap is a three-level deep structure. The bottom level consists of "regular revmap pages", and each regular revmap page is filled with ItemPointerDatas, which point to the index tuples. The middle level consists of "array revmap pages", and each array revmap page contains an array of BlockNumbers of the "regular revmap" pages. The top level is an array of BlockNumbers of the array revmap pages, and it is stored in the metapage. With 8k block size, that's just enough to cover the full range of 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each regular revmap page can store about 8192/6 = 1365 item pointers, each array revmap page can store about 8192/4 = 2048 block references, and the size of the top array is 8192/4. That's just enough; to store the required number of array pages in the top array, the array needs to be (2^32/1365)/2048)=1536 elements large. But with 4k or smaller blocks, it's not enough. I wonder if it would be simpler to just always store the revmap pages in the beginning of the index, before any other pages. Finding the revmap page would then be just as easy as with a separate fork. When the table/index is extended so that a new revmap page is needed, move the existing page at that block out of the way. Locking needs some consideration, but I think it would be feasible and simpler than you have now. > I have followed the suggestion by Amit to overwrite the index tuple when > a new heap tuple is inserted, instead of creating a separate index > tuple. This saves a lot of index bloat. This required a new entry > point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new > function PageIndexDeleteNoCompact which is similar in spirit to > PageIndexMultiDelete except that item pointers do not change. This is > necessary because the revmap stores item pointers, and such reference > would break if we were to renumber items in index pages. ISTM that when the old tuple cannot be updated in-place, the new index tuple is inserted with mm_doinsert(), but the old tuple is never deleted. - Heikki -- - Heikki
Heikki Linnakangas wrote: > Some comments, aside from the design wrt. bounding boxes etc. : Thanks. I haven't commented on that sub-thread because I think it's possible to come up with a reasonable design that solves the issue by adding a couple of amprocs. I need to do some more thinking to ensure it is really workable, and then I'll post my ideas. > On 06/15/2014 05:34 AM, Alvaro Herrera wrote: > >Robert Haas wrote: > If I understand the code correctly, the revmap is a three-level deep > structure. The bottom level consists of "regular revmap pages", and > each regular revmap page is filled with ItemPointerDatas, which > point to the index tuples. The middle level consists of "array > revmap pages", and each array revmap page contains an array of > BlockNumbers of the "regular revmap" pages. The top level is an > array of BlockNumbers of the array revmap pages, and it is stored in > the metapage. Yep, that's correct. Essentially, we still have the revmap as a linear space (containing TIDs); the other two levels on top of that are only there to enable locating the physical page numbers for each revmap logical page. I make one exception that the first logical revmap page is always stored in page 1, to optimize the case of a smallish table (~1360 page ranges; approximately 1.3 gigabytes of data at 128 pages per range, or 170 megabytes at 16 pages per range.) Each page has a page header (24 bytes) and special space (4 bytes), so it has 8192-28=8164 bytes available for data, so 1360 item pointers. > With 8k block size, that's just enough to cover the full range of > 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each > regular revmap page can store about 8192/6 = 1365 item pointers, > each array revmap page can store about 8192/4 = 2048 block > references, and the size of the top array is 8192/4. That's just > enough; to store the required number of array pages in the top > array, the array needs to be (2^32/1365)/2048)=1536 elements large. > > But with 4k or smaller blocks, it's not enough. Yeah. As I said elsewhere, actual useful values are likely to be close to the read-ahead setting of the underlying disk; by default that'd be 16 pages (128kB), but I think it's common advice to increase the kernel setting to improve performance. I don't think we don't need to prevent minmax indexes with pages_per_range=1, but I don't think we need to ensure that that setting works with the largest tables, either, because it doesn't make any sense to set it up like that. Also, while there are some recommendations to set up a system with larger page sizes (32kB), I have never seen any recommendation to set it lower. It wouldn't make sense to build a system that has very large tables and use a smaller page size. So in other words, yes, you're correct that the mechanism doesn't work in some cases (small page size and index configured for highest level of detail), but the conditions are such that I don't think it matters. ISTM the thing to do here is to do the math at index creation time, and if we find that we don't have enough space in the metapage for all array revmap pointers we need, bail out and require the index to be created with a larger pages_per_range setting. > I wonder if it would be simpler to just always store the revmap > pages in the beginning of the index, before any other pages. Finding > the revmap page would then be just as easy as with a separate fork. > When the table/index is extended so that a new revmap page is > needed, move the existing page at that block out of the way. Locking > needs some consideration, but I think it would be feasible and > simpler than you have now. Moving index items around is not easy, because you'd have to adjust the revmap to rewrite the item pointers. > >I have followed the suggestion by Amit to overwrite the index tuple when > >a new heap tuple is inserted, instead of creating a separate index > >tuple. This saves a lot of index bloat. This required a new entry > >point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new > >function PageIndexDeleteNoCompact which is similar in spirit to > >PageIndexMultiDelete except that item pointers do not change. This is > >necessary because the revmap stores item pointers, and such reference > >would break if we were to renumber items in index pages. > > ISTM that when the old tuple cannot be updated in-place, the new > index tuple is inserted with mm_doinsert(), but the old tuple is > never deleted. It's deleted by the next vacuum. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06/23/2014 08:07 PM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: >> With 8k block size, that's just enough to cover the full range of >> 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each >> regular revmap page can store about 8192/6 = 1365 item pointers, >> each array revmap page can store about 8192/4 = 2048 block >> references, and the size of the top array is 8192/4. That's just >> enough; to store the required number of array pages in the top >> array, the array needs to be (2^32/1365)/2048)=1536 elements large. >> >> But with 4k or smaller blocks, it's not enough. > > Yeah. As I said elsewhere, actual useful values are likely to be close > to the read-ahead setting of the underlying disk; by default that'd be > 16 pages (128kB), but I think it's common advice to increase the kernel > setting to improve performance. My gut feeling is that it might well be best to set pages_per_page=1. Even if you do the same amount of I/O, thanks to kernel read-ahead, you might still avoid processing a lot of tuples. But would need to see some benchmarks to know.. > I don't think we don't need to prevent > minmax indexes with pages_per_range=1, but I don't think we need to > ensure that that setting works with the largest tables, either, because > it doesn't make any sense to set it up like that. > > Also, while there are some recommendations to set up a system with > larger page sizes (32kB), I have never seen any recommendation to set it > lower. It wouldn't make sense to build a system that has very large > tables and use a smaller page size. > > So in other words, yes, you're correct that the mechanism doesn't work > in some cases (small page size and index configured for highest level of > detail), but the conditions are such that I don't think it matters. > > ISTM the thing to do here is to do the math at index creation time, and > if we find that we don't have enough space in the metapage for all array > revmap pointers we need, bail out and require the index to be created > with a larger pages_per_range setting. Yeah, I agree that would be acceptable. I feel that the below would nevertheless be simpler: >> I wonder if it would be simpler to just always store the revmap >> pages in the beginning of the index, before any other pages. Finding >> the revmap page would then be just as easy as with a separate fork. >> When the table/index is extended so that a new revmap page is >> needed, move the existing page at that block out of the way. Locking >> needs some consideration, but I think it would be feasible and >> simpler than you have now. > > Moving index items around is not easy, because you'd have to adjust the > revmap to rewrite the item pointers. Hmm. Two alternative schemes come to mind: 1. Move each index tuple off the page individually, updating the revmap while you do it, until the page is empty. Updating the revmap for a single index tuple isn't difficult; you have to do it anyway when an index tuple is replaced. (MMTuples don't contain a heap block number ATM, but IMHO they should, see below) 2. Store the new block number of the page that you moved out of the way in the revmap page, and leave the revmap pointers unchanged. The revmap pointers can be updated later, lazily. Both of those seem pretty straightforward. >>> I have followed the suggestion by Amit to overwrite the index tuple when >>> a new heap tuple is inserted, instead of creating a separate index >>> tuple. This saves a lot of index bloat. This required a new entry >>> point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new >>> function PageIndexDeleteNoCompact which is similar in spirit to >>> PageIndexMultiDelete except that item pointers do not change. This is >>> necessary because the revmap stores item pointers, and such reference >>> would break if we were to renumber items in index pages. >> >> ISTM that when the old tuple cannot be updated in-place, the new >> index tuple is inserted with mm_doinsert(), but the old tuple is >> never deleted. > > It's deleted by the next vacuum. Ah I see. Vacuum reads the whole index, and builds an in-memory hash table that contains an ItemPointerData for every tuple in the index. Doesn't that require a lot of memory, for a large index? That might be acceptable - you ought to have plenty of RAM if you're pushing around multi-terabyte tables - but it would nevertheless be nice to not have a hard requirement for something as essential as vacuum. In addition to the hash table, remove_deletable_tuples() pallocs an array to hold an ItemPointer for every index tuple about to be removed. A single palloc is limited to 1GB, so that will fail outright if there are more than ~179 million dead index tuples. You're unlikely to hit that in practice, but if you do, you'll never be able to vacuum the index. So that's not very nice. Wouldn't it be simpler to remove the old tuple atomically with inserting the new tuple and updating the revmap? Or at least mark the old tuple as deletable, so that vacuum can just delete it, without building the large hash table to determine that it's deletable. As it is, remove_deletable_tuples looks racy: 1. Vacuum begins, and remove_deletable_tuples performs the first pass over the regular, non-revmap index pages, building the hash table of all items in the index. 2. Another process inserts a new row to the heap, which causes a new minmax tuple to be inserted and the revmap to be updated to point to the new tuple. 3. Vacuum proceeds to scan the revmap. It will find the updated revmap entry that points to the new index tuple. The new index tuples is not found in the hash table, so it throws an error: "reverse map references nonexistant (sic) index tuple". I think to fix that you can just ignore tuples that are not found in the hash table. (Although as I said above I think it would be simpler to not leave behind any dead index tuples in the first place and get rid of the vacuum scans altogether) Regarding locking, I think it would be good to mention explicitly the order that the pages must be locked if you need to lock multiple pages at the same time, to avoid deadlock. Based on the Locking considerations-section in the README, I believe the order is that you always lock the regular index page first, and then the revmap page. There's no mention of the order of locking two regular or two revmap pages, but I guess you never do that ATM. I'm quite surprised by the use of LockTuple on the index tuples. I think the main reason for needing that is the fact that MMTuple doesn't store the heap (range) block number that the tuple points to: LockTuple is required to ensure that the tuple doesn't go away while a scan is following a pointer from the revmap to it. If the MMTuple contained the BlockNumber, a scan could check that and go back to the revmap if it doesn't match. Alternatively, you could keep the revmap page locked when you follow a pointer to the regular index page. The lack of a block number on index tuples also makes my idea of moving tuples out of the way of extending the revmap much more difficult; there's no way to find the revmap entry pointing to an index tuple, short of scanning the whole revmap. And also on general robustness grounds, and for debugging purposes, it would be nice to have the block number there. - Heikki
On Thu, Jun 19, 2014 at 12:32 PM, Greg Stark <stark@mit.edu> wrote: > On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Implementing something is a good way to demonstrate how it would look like. >> But no, I don't insist on implementing every possible design whenever a new >> feature is proposed. >> >> I liked Greg's sketch of what the opclass support functions would be. It >> doesn't seem significantly more complicated than what's in the patch now. > > As a counter-point to my own point there will be nothing stopping us > in the future from generalizing things. Dealing with catalogs is > mostly book-keeping headaches and careful work. it's something that > might be well-suited for a GSOC or first patch from someone looking to > familiarize themselves with the system architecture. It's hard to > invent a whole new underlying infrastructure at the same time as > dealing with all that book-keeping and it's hard for someone > familiarizing themselves with the system to also have a great new > idea. Having tasks like this that are easy to explain and that mentor > understands well can be easier to manage than tasks where the newcomer > has some radical new idea. Generalizing this in the future would be highly likely to change the on-disk format for existing indexes, which would be a problem for pg_upgrade. I think we will likely be stuck with whatever the initial on-disk format looks like for a very long time, which is why I think we need to try rather hard to get this right the first time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Claudio Freire wrote: > An aggregate to generate a "compressed set" from several values > A function which adds a new value to the "compressed set" and returns > the new "compressed set" > A function which tests if a value is in a "compressed set" > A function which tests if a "compressed set" overlaps another > "compressed set" of equal type > > If you can define different compressed sets, you can use this to > generate both min/max indexes as well as bloom filter indexes. Whether > we'd want to have both is perhaps questionable, but having the ability > to is probably desirable. Here's a new version of this patch, which is more generic the original versions, and similar to what you describe. The way it works now, each opclass needs to have three support procedures; I've called them getOpers, maybeUpdateValues, and compare. (I realize these names are pretty bad, and will be changing them.) getOpers is used to obtain information about what is stored for that data type; it says how many datum values are stored for a column of that type (two for sortable: min and max), and how many operators it needs setup. Then, the generic code fills in a MinmaxDesc(riptor) and creates an initial DeformedMMTuple (which is a rather ugly name for a minmax tuple held in memory). The maybeUpdateValues amproc can then be called when there's a new heap tuple, which updates the DeformedMMTuple to account for the new tuple (in essence, it's a union of the original values and the new tuple). This can be done repeatedly (when a new index is being created) or only once (when a new heap tuple is inserted into an existing index). There is no need for an "aggregate". This DeformedMMTuple can easily be turned into the on-disk representation; there is no hardcoded assumption on the number of index values stored per heap column, so it is possible to build an opclass that stores a bounding box column for a geometry heap column, for instance. Then we have the "compare" amproc. This is used during index scans; after extracting an index tuple, it is turned into DeformedMMTuple, and the "compare" amproc for each column is called with the values of scan keys. (Now that I think about this, it seems pretty much what "consistent" is for GiST opclasses). A true return value indicates that the scan key matches the page range boundaries and thus all pages in the range are added to the output TID bitmap. Of course, you can have multicolumn indexes, and (as would be expected) each column can have totally different opclasses; so for instance you could have an integer column and a geometric column in the same index, and it should work fine. In a query that constrained both columns, only those page ranges that satisfied the scan keys for both columns would be returned. I think this level of abstraction is good --- AFAICS it is easy to implement opclasses for datatypes not suitable for btree opclasses such as geometric ones, etc. This answers the concerns of those who wanted to see this support datatypes that don't have the concept of min/max at all. I'm not sure about bloom filters, as I've forgotten how those work. Of course, the implementation of min/max is there: that logic has been abstracted out into what I've called "sortable opfamilies" for now (name suggestions welcome) --- it can be used for any datatype with a btree opclass. I think design-wise it ended up making a lot of sense, because all the opclass-specific assumptions about usage of "min" and "max" values and comparisons using the less-than etc operators are contained in the mmsortable.c file, and the basic minmax.c file only needs to know to call the right opclass-specific procedures. The basic code might need some tweaks to ensure that we're not assuming anything about the datatypes of the stored values (as opposed to the datatypes of the indexed columns), but this is a matter of tweaking the MinmaxDesc and the getOpers amprocs; it wouldn't require changing the on-disk representation, and thus is upgrade-compatible. There's a bit of boilerplate code in the amproc routines which would be nice to be able to remove (mainly involving null values), but I think to do that we would need to split those three amprocs into maybe four or five, which is not as nice, so I'm not real sure about doing it. All this being said, I'm sticking to the name "Minmax indexes". There was a poll in pgsql-advocacy http://www.postgresql.org/message-id/53A0B4F8.8080803@agliodbs.com about a new name, but there were no suggestions supported by more than one person. If a brilliant new name comes up, I'm open to changing it. Another thing I noticed is that version 8 of the patch blindly believed the "pages_per_range" declared in catalogs. This meant that if somebody did "alter index foo set pages_per_range=123" the index would immediately break (i.e. return corrupted results when queried). I have fixed this by storing the pages_per_range value used to construct the index in the metapage. Now if you do the ALTER INDEX thing, the new value is only used when the index is recreated by REINDEX. There are still things to go over before this is committable, particularly regarding vacuuming the index, but as far as index creation and scanning it should be good to test. (Vacuuming should work just fine most of the time also, but there are a few wrinkles pointed out by Heikki.) One thing I've disabled for now is the pageinspect code that displays index items. I need to rewrite that. Closing thought: thinking more about it, the bit about returning function OIDs by getOpers when creating a MinmaxDesc seems unnecessary. I think we could go by with just returning the number of values stored in the column, and have the operators be part of an opaque struct that's initialized and only touched by the opclass amprocs, not by the generic code. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Heikki Linnakangas wrote: > On 06/23/2014 08:07 PM, Alvaro Herrera wrote: > I feel that the below would nevertheless be simpler: > > >>I wonder if it would be simpler to just always store the revmap > >>pages in the beginning of the index, before any other pages. Finding > >>the revmap page would then be just as easy as with a separate fork. > >>When the table/index is extended so that a new revmap page is > >>needed, move the existing page at that block out of the way. Locking > >>needs some consideration, but I think it would be feasible and > >>simpler than you have now. > > > >Moving index items around is not easy, because you'd have to adjust the > >revmap to rewrite the item pointers. > > Hmm. Two alternative schemes come to mind: > > 1. Move each index tuple off the page individually, updating the > revmap while you do it, until the page is empty. Updating the revmap > for a single index tuple isn't difficult; you have to do it anyway > when an index tuple is replaced. (MMTuples don't contain a heap > block number ATM, but IMHO they should, see below) > > 2. Store the new block number of the page that you moved out of the > way in the revmap page, and leave the revmap pointers unchanged. The > revmap pointers can be updated later, lazily. > > Both of those seem pretty straightforward. The trouble I have with moving blocks around to make space, is that it would cause the index to have periodic hiccups to make room for the new revmap pages. One nice property that these indexes are supposed to have is that the effect into insertion times should be pretty minimal. That would cease to be the case if we have to do your proposed block moves. > >>ISTM that when the old tuple cannot be updated in-place, the new > >>index tuple is inserted with mm_doinsert(), but the old tuple is > >>never deleted. > > > >It's deleted by the next vacuum. > > Ah I see. Vacuum reads the whole index, and builds an in-memory hash > table that contains an ItemPointerData for every tuple in the index. > Doesn't that require a lot of memory, for a large index? That might > be acceptable - you ought to have plenty of RAM if you're pushing > around multi-terabyte tables - but it would nevertheless be nice to > not have a hard requirement for something as essential as vacuum. I guess if you're expecting that pages_per_range=1 is a common case, yeah it might become an issue eventually. One idea I just had is to have a bit for each index tuple, which is set whenever the revmap no longer points to it. That way, vacuuming is much easier: just scan the index and delete all tuples having that bit set. No need for this hash table stuff. I am still concerned with adding more overhead whenever a page range is modified, so that insertions in the table continue to be fast. If we're going to dirty the index every time, it might not be so fast anymore. But then maybe I'm worrying about nothing; I will have to measure how slower it is. > Wouldn't it be simpler to remove the old tuple atomically with > inserting the new tuple and updating the revmap? Or at least mark > the old tuple as deletable, so that vacuum can just delete it, > without building the large hash table to determine that it's > deletable. Yes, it might be simpler, but it'd require dirtying more pages on insertions (and holding more page-level locks, for longer. Not good for concurrent access). > I'm quite surprised by the use of LockTuple on the index tuples. I > think the main reason for needing that is the fact that MMTuple > doesn't store the heap (range) block number that the tuple points > to: LockTuple is required to ensure that the tuple doesn't go away > while a scan is following a pointer from the revmap to it. If the > MMTuple contained the BlockNumber, a scan could check that and go > back to the revmap if it doesn't match. Alternatively, you could > keep the revmap page locked when you follow a pointer to the regular > index page. There's the intention that these accesses be kept as concurrent as possible; this is why we don't want to block the whole page. Locking individual TIDs is fine in this case (which is not in SELECT FOR UPDATE) because we can only lock a single tuple in any one index scan, so there's no unbounded growth of the lock table. I prefer not to have BlockNumbers in index tuples, because that would make them larger for not much gain. That data would mostly be redundant, and would be necessary only for vacuuming. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07/09/2014 02:16 PM, Alvaro Herrera wrote: > The way it works now, each opclass needs to have three support > procedures; I've called them getOpers, maybeUpdateValues, and compare. > (I realize these names are pretty bad, and will be changing them.) I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update the values, maybe not." ;-) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > All this being said, I'm sticking to the name "Minmax indexes". There > was a poll in pgsql-advocacy > http://www.postgresql.org/message-id/53A0B4F8.8080803@agliodbs.com > about a new name, but there were no suggestions supported by more than > one person. If a brilliant new name comes up, I'm open to changing it. How about "summarizing indexes"? That seems reasonably descriptive. -- Peter Geoghegan
Josh Berkus wrote: > On 07/09/2014 02:16 PM, Alvaro Herrera wrote: > > The way it works now, each opclass needs to have three support > > procedures; I've called them getOpers, maybeUpdateValues, and compare. > > (I realize these names are pretty bad, and will be changing them.) > > I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update > the values, maybe not." ;-) :-) Well, that's exactly what happens. If we insert a new tuple into the table, and the existing summarizing tuple (to use Peter's term) already covers it, then we don't need to update the index tuple at all. What this name doesn't say is what values are to be maybe-updated. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 9, 2014 at 10:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > there is no hardcoded assumption on the number of index > values stored per heap column, so it is possible to build an opclass > that stores a bounding box column for a geometry heap column, for > instance. I think the more Postgresy thing to do is to store one datum per heap column. It's up to the opclass to find or make a composite data type that stores all the necessary state. So you could make a minmax_accum data type like NumericAggState in numeric.c:numeric_accum() or the array of floats in float8_accum. For a bounding box a 2d geometric min/max index could use the "box" data type for example. The way you've done it seems more convenient but there's something to be said for using the same style for different areas. A single bounding box accumulator function would probably suffice for both an aggregate and index opclass for example. But this sounds pretty great. I think it would let me do the bloom filter index I had in mind fairly straightforwardly. The result would be something very similar to a bitmap index. I'm not sure if there's a generic term that includes bitmap indexes or other summary functions like bounding boxes (which min/max is basically -- a 1D bounding box). Thanks a lot for listening and being so open, I think what you describe is a lot more flexible than what you had before and I can see some pretty great things coming out of it (including min/max itself of course). -- greg
On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Another thing I noticed is that version 8 of the patch blindly believed > the "pages_per_range" declared in catalogs. This meant that if somebody > did "alter index foo set pages_per_range=123" the index would > immediately break (i.e. return corrupted results when queried). I have > fixed this by storing the pages_per_range value used to construct the > index in the metapage. Now if you do the ALTER INDEX thing, the new > value is only used when the index is recreated by REINDEX. This seems a lot like parameterizing. So I guess the only thing left is to issue a NOTICE when said alter takes place (I don't see that on the patch, but maybe it's there?)
Claudio Freire wrote: > On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Another thing I noticed is that version 8 of the patch blindly believed > > the "pages_per_range" declared in catalogs. This meant that if somebody > > did "alter index foo set pages_per_range=123" the index would > > immediately break (i.e. return corrupted results when queried). I have > > fixed this by storing the pages_per_range value used to construct the > > index in the metapage. Now if you do the ALTER INDEX thing, the new > > value is only used when the index is recreated by REINDEX. > > This seems a lot like parameterizing. I don't understand what that means -- care to elaborate? > So I guess the only thing left is to issue a NOTICE when said alter > takes place (I don't see that on the patch, but maybe it's there?) That's not in the patch. I don't think we have an appropriate place to emit such a notice. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07/10/2014 12:20 PM, Alvaro Herrera wrote: >> So I guess the only thing left is to issue a NOTICE when said alter >> > takes place (I don't see that on the patch, but maybe it's there?) > That's not in the patch. I don't think we have an appropriate place to > emit such a notice. What do you mean by "don't have an appropriate place"? The suggestion is that when a user does: ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100 they should get a NOTICE: "NOTICE: changes to pages per range will not take effect until the index is REINDEXed" otherwise, we're going to get a lot of "I Altered the pages per range, but performance didn't change" emails. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 07/10/2014 12:20 PM, Alvaro Herrera wrote: >>> So I guess the only thing left is to issue a NOTICE when said alter >>> > takes place (I don't see that on the patch, but maybe it's there?) >> That's not in the patch. I don't think we have an appropriate place to >> emit such a notice. > > What do you mean by "don't have an appropriate place"? > > The suggestion is that when a user does: > > ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100 > > they should get a NOTICE: > > "NOTICE: changes to pages per range will not take effect until the index > is REINDEXed" > > otherwise, we're going to get a lot of "I Altered the pages per range, > but performance didn't change" emails. > How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ? we don't get a notice for these cases either -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157
Josh Berkus wrote: > On 07/10/2014 12:20 PM, Alvaro Herrera wrote: > >> So I guess the only thing left is to issue a NOTICE when said alter > >> > takes place (I don't see that on the patch, but maybe it's there?) > > That's not in the patch. I don't think we have an appropriate place to > > emit such a notice. > > What do you mean by "don't have an appropriate place"? What I think should happen is that if the value is changed, the index sholud be rebuilt right there. But there is no way to have this occur from the generic tablecmds.c code. Maybe we should extend the AM interface so that they are notified of changes and can take action. Inserting AM-specific code into tablecmds.c seems pretty wrong to me -- existing stuff for WITH CHECK OPTION views notwithstanding. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07/10/2014 02:30 PM, Jaime Casanova wrote: > How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or > from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ? > > we don't get a notice for these cases either Good idea. We should also emit notices for those. Well, maybe not for fillfactor. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Jul 10, 2014 at 2:30 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
> On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
>>> So I guess the only thing left is to issue a NOTICE when said alter
>>> > takes place (I don't see that on the patch, but maybe it's there?)
>> That's not in the patch. I don't think we have an appropriate place to
>> emit such a notice.
>
> What do you mean by "don't have an appropriate place"?
>
> The suggestion is that when a user does:
>
> ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
>
> they should get a NOTICE:
>
> "NOTICE: changes to pages per range will not take effect until the index
> is REINDEXed"
>
> otherwise, we're going to get a lot of "I Altered the pages per range,
> but performance didn't change" emails.
>
from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?
we don't get a notice for these cases either
I think those are different. They don't rewrite existing data in the table, but they are applied to new (and updated) data. My understanding is that changing PAGES_PER_RANGE will have no effect on future data until a re-index is done, even if the entire table eventually turns over.
Cheers,
Jeff
On Thu, Jul 10, 2014 at 10:29 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > What I think should happen is that if the value is changed, the index > sholud be rebuilt right there. I disagree. It would be a non-orthogonal interface if ALTER TABLE sometimes causes the index to be rebuilt and sometimes just makes a configuration change. I already see a lot of user confusion when some ALTER TABLE commands rewrite the table and some are quick meta data changes. Especially in this case where the type of configuration being changed is just an internal storage parameter and the user visible shape of the index is unchanged it would be weird to rebuild the index. IMHO the "right" thing to do is just to say this parameter is read-only and have the AM throw an error when the user changes it. But even that would require an AM callback for the AM to even know about the change. -- greg
On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Claudio Freire wrote: > >> An aggregate to generate a "compressed set" from several values >> A function which adds a new value to the "compressed set" and returns >> the new "compressed set" >> A function which tests if a value is in a "compressed set" >> A function which tests if a "compressed set" overlaps another >> "compressed set" of equal type >> >> If you can define different compressed sets, you can use this to >> generate both min/max indexes as well as bloom filter indexes. Whether >> we'd want to have both is perhaps questionable, but having the ability >> to is probably desirable. > > Here's a new version of this patch, which is more generic the original > versions, and similar to what you describe. I've not read the discussion so far at all, but I found the problem when I played with this patch. Sorry if this has already been discussed. =# create table test as select num from generate_series(1,10) num; SELECT 10 =# create index testidx on test using minmax (num); CREATE INDEX =# alter table test alter column num type text; ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. Regards, -- Fujii Masao
On 9 July 2014 23:54, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> All this being said, I'm sticking to the name "Minmax indexes". There >> was a poll in pgsql-advocacy >> http://www.postgresql.org/message-id/53A0B4F8.8080803@agliodbs.com >> about a new name, but there were no suggestions supported by more than >> one person. If a brilliant new name comes up, I'm open to changing it. > > How about "summarizing indexes"? That seems reasonably descriptive. -1 for another name change. That boat sailed some months back. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 10 July 2014 00:13, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Josh Berkus wrote: >> On 07/09/2014 02:16 PM, Alvaro Herrera wrote: >> > The way it works now, each opclass needs to have three support >> > procedures; I've called them getOpers, maybeUpdateValues, and compare. >> > (I realize these names are pretty bad, and will be changing them.) >> >> I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update >> the values, maybe not." ;-) > > :-) Well, that's exactly what happens. If we insert a new tuple into > the table, and the existing summarizing tuple (to use Peter's term) > already covers it, then we don't need to update the index tuple at all. > What this name doesn't say is what values are to be maybe-updated. There are lots of functions that maybe-do-things, that's just modular programming. Not sure we need to prefix things with maybe to explain that, otherwise we'd have maybeXXX everywhere. More descriptive name would be MaintainIndexBounds() or similar. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Fujii Masao wrote: > On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Here's a new version of this patch, which is more generic the original > > versions, and similar to what you describe. > > I've not read the discussion so far at all, but I found the problem > when I played with this patch. Sorry if this has already been discussed. > > =# create table test as select num from generate_series(1,10) num; > SELECT 10 > =# create index testidx on test using minmax (num); > CREATE INDEX > =# alter table test alter column num type text; > ERROR: could not determine which collation to use for string comparison > HINT: Use the COLLATE clause to set the collation explicitly. Ah, yes, I need to pass down collation OIDs to comparison functions. That's marked as XXX in various places in the code. Sorry I forgot to mention that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 10, 2014 at 4:20 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Claudio Freire wrote: >> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > Another thing I noticed is that version 8 of the patch blindly believed >> > the "pages_per_range" declared in catalogs. This meant that if somebody >> > did "alter index foo set pages_per_range=123" the index would >> > immediately break (i.e. return corrupted results when queried). I have >> > fixed this by storing the pages_per_range value used to construct the >> > index in the metapage. Now if you do the ALTER INDEX thing, the new >> > value is only used when the index is recreated by REINDEX. >> >> This seems a lot like parameterizing. > > I don't understand what that means -- care to elaborate? We've been talking about bloom filters, and how their shape differs according to the parameters of the bloom filter (number of hashes, hash type, etc). But after seeing this case of pages_per_range, I noticed it's an effective-enough mechanism. Like: CREATE INDEX ix_blah ON some_table USING bloom (somecol) WITH (BLOOM_HASHES=15, BLOOM_BUCKETS=1024, PAGES_PER_RANGE=64); Marking as read-only is ok, or emitting a NOTICE so that if anyone changes those parameters that change the shape of the index, they know it needs a rebuild would be OK too. Both mechanisms work for me.
On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Marking as read-only is ok, or emitting a NOTICE so that if anyone > changes those parameters that change the shape of the index, they know > it needs a rebuild would be OK too. Both mechanisms work for me. We don't actually have any of these mechanisms. They wouldn't be bad things to have but I don't think we should gate adding new types of indexes on adding them. In particular, the index could just hard code a value for these parameters and having them be parameterized is clearly better even if that doesn't produce all the warnings or rebuild things automatically or whatever. -- greg
On Fri, Jul 11, 2014 at 3:47 PM, Greg Stark <stark@mit.edu> wrote: > On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Marking as read-only is ok, or emitting a NOTICE so that if anyone >> changes those parameters that change the shape of the index, they know >> it needs a rebuild would be OK too. Both mechanisms work for me. > > We don't actually have any of these mechanisms. They wouldn't be bad > things to have but I don't think we should gate adding new types of > indexes on adding them. In particular, the index could just hard code > a value for these parameters and having them be parameterized is > clearly better even if that doesn't produce all the warnings or > rebuild things automatically or whatever. No, I agree, it's just a nice to have. But at least the docs should mention it.
On Wed, Jul 9, 2014 at 5:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > The way it works now, each opclass needs to have three support > procedures; I've called them getOpers, maybeUpdateValues, and compare. > (I realize these names are pretty bad, and will be changing them.) > getOpers is used to obtain information about what is stored for that > data type; it says how many datum values are stored for a column of that > type (two for sortable: min and max), and how many operators it needs > setup. Then, the generic code fills in a MinmaxDesc(riptor) and creates > an initial DeformedMMTuple (which is a rather ugly name for a minmax > tuple held in memory). The maybeUpdateValues amproc can then be called > when there's a new heap tuple, which updates the DeformedMMTuple to > account for the new tuple (in essence, it's a union of the original > values and the new tuple). This can be done repeatedly (when a new > index is being created) or only once (when a new heap tuple is inserted > into an existing index). There is no need for an "aggregate". > > This DeformedMMTuple can easily be turned into the on-disk > representation; there is no hardcoded assumption on the number of index > values stored per heap column, so it is possible to build an opclass > that stores a bounding box column for a geometry heap column, for > instance. > > Then we have the "compare" amproc. This is used during index scans; > after extracting an index tuple, it is turned into DeformedMMTuple, and > the "compare" amproc for each column is called with the values of scan > keys. (Now that I think about this, it seems pretty much what > "consistent" is for GiST opclasses). A true return value indicates that > the scan key matches the page range boundaries and thus all pages in the > range are added to the output TID bitmap. This sounds really great. I agree that it needs some renaming. I think renaming what you are calling "compare" to "consistent" would be an excellent idea, to match GiST. "maybeUpdateValues" sounds like it does the equivalent of GIST's "compress" on the new value followed by a "union" with the existing summary item. I don't think it's necessary to separate those out, though. You could perhaps call it something like "add_item". Also, FWIW, I liked Peter's idea of calling these "summarizing indexes" or perhaps "summary" would be a bit shorter and mean the same thing. "minmax" wouldn't be the end of the world, but since you've gone to the trouble of making this more generic I think giving it a more generic name would be a very good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/10/2014 12:41 AM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: >> On 06/23/2014 08:07 PM, Alvaro Herrera wrote: > >> I feel that the below would nevertheless be simpler: >> >>>> I wonder if it would be simpler to just always store the revmap >>>> pages in the beginning of the index, before any other pages. Finding >>>> the revmap page would then be just as easy as with a separate fork. >>>> When the table/index is extended so that a new revmap page is >>>> needed, move the existing page at that block out of the way. Locking >>>> needs some consideration, but I think it would be feasible and >>>> simpler than you have now. >>> >>> Moving index items around is not easy, because you'd have to adjust the >>> revmap to rewrite the item pointers. >> >> Hmm. Two alternative schemes come to mind: >> >> 1. Move each index tuple off the page individually, updating the >> revmap while you do it, until the page is empty. Updating the revmap >> for a single index tuple isn't difficult; you have to do it anyway >> when an index tuple is replaced. (MMTuples don't contain a heap >> block number ATM, but IMHO they should, see below) >> >> 2. Store the new block number of the page that you moved out of the >> way in the revmap page, and leave the revmap pointers unchanged. The >> revmap pointers can be updated later, lazily. >> >> Both of those seem pretty straightforward. > > The trouble I have with moving blocks around to make space, is that it > would cause the index to have periodic hiccups to make room for the new > revmap pages. One nice property that these indexes are supposed to have > is that the effect into insertion times should be pretty minimal. That > would cease to be the case if we have to do your proposed block moves. Approach 2 above is fairly quick, quick enough that no-one would notice the "hiccup". Moving the tuples individually (approach 1) would be slower. >>>> ISTM that when the old tuple cannot be updated in-place, the new >>>> index tuple is inserted with mm_doinsert(), but the old tuple is >>>> never deleted. >>> >>> It's deleted by the next vacuum. >> >> Ah I see. Vacuum reads the whole index, and builds an in-memory hash >> table that contains an ItemPointerData for every tuple in the index. >> Doesn't that require a lot of memory, for a large index? That might >> be acceptable - you ought to have plenty of RAM if you're pushing >> around multi-terabyte tables - but it would nevertheless be nice to >> not have a hard requirement for something as essential as vacuum. > > I guess if you're expecting that pages_per_range=1 is a common case, > yeah it might become an issue eventually. Not sure, but I find it easier to think of the patch that way. In any case, it would be nice to avoid the problem, even if it's not common. > One idea I just had is to > have a bit for each index tuple, which is set whenever the revmap no > longer points to it. That way, vacuuming is much easier: just scan the > index and delete all tuples having that bit set. The bit needs to be set atomically with the insertion of the new tuple, so why not just remove the old tuple right away? >> Wouldn't it be simpler to remove the old tuple atomically with >> inserting the new tuple and updating the revmap? Or at least mark >> the old tuple as deletable, so that vacuum can just delete it, >> without building the large hash table to determine that it's >> deletable. > > Yes, it might be simpler, but it'd require dirtying more pages on > insertions (and holding more page-level locks, for longer. Not good for > concurrent access). I wouldn't worry much about the performance and concurrency of this operation. Remember that the majority of updates are expected to not have to update the index, otherwise the minmax index will degenerate quickly and performance will suck anyway. And even when updating the index is needed, in most cases the new tuple fits on the same page, after removing the old one. So the case where you have to insert a new index tuple, remove old one (or mark it dead), and update the revmap to point to the new tuple, is rare. >> I'm quite surprised by the use of LockTuple on the index tuples. I >> think the main reason for needing that is the fact that MMTuple >> doesn't store the heap (range) block number that the tuple points >> to: LockTuple is required to ensure that the tuple doesn't go away >> while a scan is following a pointer from the revmap to it. If the >> MMTuple contained the BlockNumber, a scan could check that and go >> back to the revmap if it doesn't match. Alternatively, you could >> keep the revmap page locked when you follow a pointer to the regular >> index page. > > There's the intention that these accesses be kept as concurrent as > possible; this is why we don't want to block the whole page. Locking > individual TIDs is fine in this case (which is not in SELECT FOR UPDATE) > because we can only lock a single tuple in any one index scan, so > there's no unbounded growth of the lock table. > > I prefer not to have BlockNumbers in index tuples, because that would > make them larger for not much gain. That data would mostly be > redundant, and would be necessary only for vacuuming. Don't underestimate the value of easier debugging. I wouldn't worry much about shaving four bytes from the tuple, these indexes are tiny in any case. Keep it simple at first, and optimize later if necessary. In fact, I'd suggest just using normal IndexTuple instead of the custom MMTuple struct, store the block number in t_tid and leave offset number field of that unused. That wastes 2 more bytes per tuple, but that's insignificant too. I feel that it probably would be worth it just to keep thing simple, and you'd e.g. be able to use index_deform_tuple() as is. - Heikki
Thanks for all the feedback on version 9. Here's version 13. (The intermediate versions are just tags in my private tree which I created each time I rebased. Please bear with me here.) I have chosen to keep the name "minmax", even if the opclasses now let one implement completely different things on top of it such as geometry bounding boxes and bloom filters (aka bitmap indexes). I don't see a need for a rename: essentially, in PR we can just say "we have these neat minmax indexes that other databases also have, but instead of just being used for integer data, they can also be used for geometry, GIS and bitmap indexes, so as always we're more powerful than everyone else when implementing new database features". This new version includes some changes per feedback. Most notoriously, the opclass definition is different now: instead of relying on the "sortable" opclass implementation extracting the oprcode for each operator strategy (i.e. the functions that underlie < <= >= >), I chose to have catalog entries in pg_amproc for the underlying support functions. The new definition makes a lot of sense to me now, after thinking long about this stuff and carefully reading the "Catalog Entries for Indexes" chapter in docs. The way it works now is that there are five pg_amop entries in an opclass, just like previously (corresponding to the underlying < <= = >= > operators). This lets the optimizer choose the index when a query uses those operators. There are also seven pg_amproc entries. The first three are identical to all minmax opclasses: "opcinfo" (version 9 called it "getopers"), "consistent" (v9 name "compare") and "add_value" (v9 name "maybeUpdateValues", not a loved name evidently). A minmax opclass on top of a sortable datatype has four additional support functions: one for each function underlying the < <= >= > operators. Other opclasses would define their own support functions here, which would correspond to functions used to implement the "consistent" and "compare" functions internally. I don't claim this is 100% correct, but in particular I think it's now possible to implement cross-datatype comparisons, so that a minmax index defined on an int8 column works when the query uses an int4 operator, for example. (The current patch doesn't actually add such catalog entries, though. I think some minor code changes are required for this to actually work. However with the previous opclass definition it would have been outright impossible.) I fixed the bug reported by Masao-kun that collatable datatypes weren't cleanly supported. Collation OIDs are passed down now, although I don't claim that it is bulletproof. This could use some more testing. I haven't yet updated the revmap definition per Heikki's review. I am not sure I want to do that right away. I think we could live with what we have now, and see about changing this later on in the 9.5 cycle if we think a different definition is better. I think what we have is pretty solid even if there are some theoretical holes. As a very quick test, I created a 10 million tuples table with an int4 column on my laptop. The table is ~346 MB. Creating a btree index on it takes 8 seconds. A minmax index takes 1.6 seconds. The btree index is 214 MB. The minmax index, with pages_per_range=1 is 1 MB. With pages_per_range=16 (default) it is 48kB. Very unscientific results follow. This is the btree doing an index-only scan: alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Index Only Scan using bti2 on t (cost=0.43..1692.75 rows=54416 width=4) (actual time=0.106..23.329 rows=54518 loops=1) Index Cond: ((a > 991243) AND (a < 1045762)) Heap Fetches: 0 Buffers: shared hit=1 read=152 Planning time: 0.695 ms Execution time: 31.565 ms (6 filas) Duración: 33,662 ms Turn off index-only scan, do a regular index scan: alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Index Scan using bti2 on t (cost=0.43..1932.75 rows=54416 width=4) (actual time=0.066..31.027 rows=54518 loops=1) Index Cond: ((a > 991243) AND (a < 1045762)) Buffers: shared hit=394 Planning time: 0.250 ms Execution time: 39.218 ms (5 filas) Duración: 40,385 ms Use the 16-pages-per-range minmax index: alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762; QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t (cost=16.60..47402.01 rows=54416 width=4) (actual time=4.266..43.948 rows=54518 loops=1) Recheck Cond: ((a > 991243) AND (a < 1045762)) Rows Removed by Index Recheck: 32266 Heap Blocks: lossy=384 Buffers: shared hit=244 read=142 -> Bitmap Index Scan on ti2 (cost=0.00..3.00 rows=54416 width=0) (actual time=1.061..1.061 rows=3840 loops=1) Index Cond: ((a > 991243) AND (a < 1045762)) Buffers: shared hit=2 Planning time: 0.215 ms Execution time: 51.820 ms (10 filas) This is the 1-page-per-range minmax index: alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t (cost=157.60..47543.01 rows=54416 width=4) (actual time=82.479..98.642 rows=54518 loops=1) Recheck Cond: ((a > 991243) AND (a < 1045762)) Rows Removed by Index Recheck: 174 Heap Blocks: lossy=242 Buffers: shared hit=385 -> Bitmap Index Scan on ti (cost=0.00..144.00 rows=54416 width=0) (actual time=82.448..82.448 rows=2420 loops=1) Index Cond: ((a > 991243) AND (a < 1045762)) Buffers: shared hit=143 Planning time: 0.280 ms Execution time: 103.542 ms (10 filas) Duración: 104,952 ms This is a seqscan. Notice the high number of buffer accesses: alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..194248.00 rows=54416 width=4) (actual time=161.338..1201.535 rows=54518 loops=1) Filter: ((a > 991243) AND (a < 1045762)) Rows Removed by Filter: 9945482 Buffers: shared hit=10672 read=33576 Planning time: 0.189 ms Execution time: 1204.501 ms (6 filas) Duración: 1205,304 ms Of course, this isn't nearly a worst-case scenario for minmax, as the data is perfectly correlated. The pages_per_range=16 index benefits particularly from that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 08/05/2014 04:41 PM, Alvaro Herrera wrote: > I have chosen to keep the name "minmax", even if the opclasses now let > one implement completely different things on top of it such as geometry > bounding boxes and bloom filters (aka bitmap indexes). I don't see a > need for a rename: essentially, in PR we can just say "we have these > neat minmax indexes that other databases also have, but instead of just > being used for integer data, they can also be used for geometry, GIS and > bitmap indexes, so as always we're more powerful than everyone else when > implementing new database features". Plus we haven't come up with a better name ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
FWIW I think I haven't responded appropriately to the points raised by Heikki. Basically, as I see it there are three main items: 1. the revmap physical-to-logical mapping is too complex; let's use something else. We had revmap originally in a separate fork. The current approach grew out of the necessity of putting it in the main fork while ensuring that fast access to individual pages is possible. There are of course many ways to skin this cat; Heikki's proposal is to have it always occupy the first few physical pages, rather than require a logical-to-physical mapping table. To implement this he proposes to move other pages out of the way as the index grows. I don't really have much love for this idea. We can change how this is implemented later in the cycle, if we find that a different approach is better than my proposal. I don't want to spend endless time meddling with this (and I definitely don't want to have this delay the eventual commit of the patch.) 2. vacuuming is not optimal Right now, to eliminate garbage index tuples we need to first scan the revmap to figure out which tuples are unreferenced. There is a concern that if there's an excess of dead tuples, the index becomes unvacuumable because palloc() fails due to request size. This is largely theoretical because in order for this to happen there need to be several million dead index tuples. As a minimal fix to alleviate this problem without requiring a complete rework of vacuuming, we can cap that palloc request to maintenance_work_mem and remove dead tuples in a loop instead of trying to remove all of them in a single pass. Another thing proposed was to store range numbers (or just heap page numbers) within each index tuple. I felt that this would add more bloat unnecessarily. However, there is some padding space in index tuple that maybe we can use to store range numbers. I will think some more about how we can use this to simplify vacuuming. 3. avoid MMTuple as it is just unnecessary extra complexity. The main thing that MMTuple adds is not the fact that we save 2 bytes by storing BlockNumber as is instead of within a TID field. Instead, it's that we can construct and deconstruct using our own design, which means we can use however many Datum entries we want and however many "null" flags. In normal heap and index tuples, there are always the same number of datum/nulls. In minmax, the number of nulls is twice the number of indexed columns; the number of datum values is determined by how many datum values are stored per opclass ("sortable" opclasses store 2 columns, but geometry would store only one). If we were to use regular IndexTuples, we would lose that .. and I have no idea how it would work. In other words, MMTuples look fine to me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Aug 5, 2014 at 7:55 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 08/05/2014 04:41 PM, Alvaro Herrera wrote: >> I have chosen to keep the name "minmax", even if the opclasses now let >> one implement completely different things on top of it such as geometry >> bounding boxes and bloom filters (aka bitmap indexes). I don't see a >> need for a rename: essentially, in PR we can just say "we have these >> neat minmax indexes that other databases also have, but instead of just >> being used for integer data, they can also be used for geometry, GIS and >> bitmap indexes, so as always we're more powerful than everyone else when >> implementing new database features". > > Plus we haven't come up with a better name ... Several good suggestions have been made, like "summarizing" or "summary" indexes and "compressed range" indexes. I still really dislike the present name - you might think this is a type of index that has something to do with optimizing "min" and "max", but what it really is is a kind of small index for a big table. The current name couldn't make that less clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > "Summary" seems good. If I get enough votes I can change it to that. > > CREATE INDEX foo ON t USING summary (cols) > > "Summarizing" seems weird on that command. Not sure about "compressed > range", as you would have to use an abbreviation or run the words > together. Summarizing index sounds better to my ears, but both ideas based on "summary" are quite succint and to-the-point descriptions of what's happening, so I vote for those.
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? > CREATE INDEX foo ON t USING comprange (cols) > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh > -- or use an identifier with whitespace: > CREATE INDEX foo ON t USING "compressed range" (cols) The word you'd use there is not necessarily the one you use on the framework, since the framework applies to many such techniques, but the index type there is one specific one. The create command can still use minmax, or rangemap if you prefer that, while the framework's code uses summary or summarizing.
On Wed, Aug 6, 2014 at 01:31:14PM -0300, Claudio Freire wrote: > On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? > > CREATE INDEX foo ON t USING comprange (cols) > > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh > > -- or use an identifier with whitespace: > > CREATE INDEX foo ON t USING "compressed range" (cols) > > > The word you'd use there is not necessarily the one you use on the > framework, since the framework applies to many such techniques, but > the index type there is one specific one. "Block filter" indexes? > The create command can still use minmax, or rangemap if you prefer > that, while the framework's code uses summary or summarizing. "Summary" sounds like materialized views to me. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Aug 6, 2014 at 1:35 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Aug 6, 2014 at 01:31:14PM -0300, Claudio Freire wrote: >> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? >> > CREATE INDEX foo ON t USING comprange (cols) >> > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh >> > -- or use an identifier with whitespace: >> > CREATE INDEX foo ON t USING "compressed range" (cols) >> >> >> The word you'd use there is not necessarily the one you use on the >> framework, since the framework applies to many such techniques, but >> the index type there is one specific one. > > "Block filter" indexes? Nice one
On Wed, Aug 6, 2014 at 1:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Claudio Freire wrote: >> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? >> > CREATE INDEX foo ON t USING comprange (cols) >> > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh >> > -- or use an identifier with whitespace: >> > CREATE INDEX foo ON t USING "compressed range" (cols) >> >> The word you'd use there is not necessarily the one you use on the >> framework, since the framework applies to many such techniques, but >> the index type there is one specific one. >> >> The create command can still use minmax, or rangemap if you prefer >> that, while the framework's code uses summary or summarizing. > > I think you're confusing the AM name with the opclass name. The name > you specify in that part of the command is the access method name. You > can specify the opclass together with each column, like so: > > CREATE INDEX foo ON t USING blockfilter > (order_date date_minmax_ops, geometry gis_bbox_ops); Oh, uh... no, I'm not confusing them, but now I just realized how one would implement other classes of block filtering indexes, and yeah... you do it through the opclasses. I'm sticking to bloom filters: CREATE INDEX foo ON t USING blockfilter (order_date date_minmax_ops, path character_bloom_ops); Cool. Very cool. So, I like blockfilter a lot. I change my vote to blockfilter ;)
2014-08-06 Claudio Freire <klaussfreire@gmail.com>: > So, I like blockfilter a lot. I change my vote to blockfilter ;) +1 for blockfilter, because it stresses the fact that the "physical" arrangement of rows in blocks matters for this index. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier <nicolas.barbier@gmail.com> wrote: > 2014-08-06 Claudio Freire <klaussfreire@gmail.com>: > >> So, I like blockfilter a lot. I change my vote to blockfilter ;) > > +1 for blockfilter, because it stresses the fact that the "physical" > arrangement of rows in blocks matters for this index. I don't like that quite as well as summary, but I'd prefer either to the current naming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier > <nicolas.barbier@gmail.com> wrote: >> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>: >> >>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >> >> +1 for blockfilter, because it stresses the fact that the "physical" >> arrangement of rows in blocks matters for this index. > > I don't like that quite as well as summary, but I'd prefer either to > the current naming. Yes, "summary index" isn't good. I'm not sure where the block or the filter part comes in though, so -1 to "block filter", not least because it doesn't have a good abbreviation (bfin??). A better description would be "block range index" since we are indexing a range of blocks (not just one block). Perhaps a better one would be simply "range index", which we could abbreviate to RIN or BRIN. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Aug 7, 2014 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier >> <nicolas.barbier@gmail.com> wrote: >>> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>: >>> >>>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >>> >>> +1 for blockfilter, because it stresses the fact that the "physical" >>> arrangement of rows in blocks matters for this index. >> >> I don't like that quite as well as summary, but I'd prefer either to >> the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). Block filter would refer to the index property that selects blocks, not tuples, and it does so through a "filter function" (for min-max, it's a range check, but for other opclasses it could be anything).
Simon Riggs wrote: > On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier > > <nicolas.barbier@gmail.com> wrote: > >> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>: > >> > >>> So, I like blockfilter a lot. I change my vote to blockfilter ;) > >> > >> +1 for blockfilter, because it stresses the fact that the "physical" > >> arrangement of rows in blocks matters for this index. > > > > I don't like that quite as well as summary, but I'd prefer either to > > the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). I was thinking just "blockfilter" (I did show a sample command). Claudio explained the name downthread; personally, of all the options suggested thus far, it's the one I like the most (including minmax). At this point, the naming issue is what is keeping me from committing this patch, so the quicker we can solve it, the merrier. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Aug 7, 2014 at 10:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier >> <nicolas.barbier@gmail.com> wrote: >>> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>: >>> >>>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >>> >>> +1 for blockfilter, because it stresses the fact that the "physical" >>> arrangement of rows in blocks matters for this index. >> >> I don't like that quite as well as summary, but I'd prefer either to >> the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). > > A better description would be "block range index" since we are > indexing a range of blocks (not just one block). Perhaps a better one > would be simply "range index", which we could abbreviate to RIN or > BRIN. range index might get confused with range types; block range index seems better. I like summary, but I'm fine with block range index or block filter index, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
+1 for BRIN ! On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier >> <nicolas.barbier@gmail.com> wrote: >>> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>: >>> >>>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >>> >>> +1 for blockfilter, because it stresses the fact that the "physical" >>> arrangement of rows in blocks matters for this index. >> >> I don't like that quite as well as summary, but I'd prefer either to >> the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). > > A better description would be "block range index" since we are > indexing a range of blocks (not just one block). Perhaps a better one > would be simply "range index", which we could abbreviate to RIN or > BRIN. > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
2014-08-07 Oleg Bartunov <obartunov@gmail.com>: > +1 for BRIN ! +1, rolls off the tongue smoothly and captures the essence :-). Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad?
On 07/08/14 16:16, Simon Riggs wrote: > > A better description would be "block range index" since we are > indexing a range of blocks (not just one block). Perhaps a better one > would be simply "range index", which we could abbreviate to RIN or > BRIN. > +1 for block range index (BRIN) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote: > A better description would be "block range index" since we are > indexing a range of blocks (not just one block). Perhaps a better one > would be simply "range index", which we could abbreviate to RIN or > BRIN. Seems a lot of people liked BRIN. I will be adopting that by renaming files and directories soon. Here's v14. I fixed a few bugs; most notably, queries with IS NULL and IS NOT NULL now work correctly. Also I made the pageinspect extension be able to display existing index tuples (I had disabled that when generalizing the opclass stuff). It only works with minmax opclasses for now; it should be easy to fix if/when we add more stuff though. I also added some docs. These are not finished by any means. They talk about the index using the BRIN term. All existing opclasses were renamed to "<type>_minmax_ops". -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 08/07/2014 08:38 AM, Oleg Bartunov wrote: > +1 for BRIN ! > > On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: >> A better description would be "block range index" since we are >> indexing a range of blocks (not just one block). Perhaps a better one >> would be simply "range index", which we could abbreviate to RIN or >> BRIN. How about Block Range Dynamic indexes? Or Range Usage Metadata indexes? You see what I'm getting at: BRanDy RUM ... to keep with our "new indexes" naming scheme ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 08/07/2014 08:38 AM, Oleg Bartunov wrote: >> +1 for BRIN ! >> >> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: >>> A better description would be "block range index" since we are >>> indexing a range of blocks (not just one block). Perhaps a better one >>> would be simply "range index", which we could abbreviate to RIN or >>> BRIN. > > How about Block Range Dynamic indexes? > > Or Range Usage Metadata indexes? > > You see what I'm getting at: > > BRanDy > > RUM > > ... to keep with our "new indexes" naming scheme ... Not the best fit for kids, fine for grad students. BRIN seems to be a perfect consensus, so +1 for it. -- Michael
On Thu, Aug 7, 2014 at 7:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > range index might get confused with range types; block range index > seems better. I like summary, but I'm fine with block range index or > block filter index, too. +1 -- Peter Geoghegan
On 08/07/2014 05:52 PM, Michael Paquier wrote: > On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus <josh@agliodbs.com> wrote: >> On 08/07/2014 08:38 AM, Oleg Bartunov wrote: >>> +1 for BRIN ! >>> >>> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote: >>>> A better description would be "block range index" since we are >>>> indexing a range of blocks (not just one block). Perhaps a better one >>>> would be simply "range index", which we could abbreviate to RIN or >>>> BRIN. >> >> How about Block Range Dynamic indexes? >> >> Or Range Usage Metadata indexes? >> >> You see what I'm getting at: >> >> BRanDy >> >> RUM >> >> ... to keep with our "new indexes" naming scheme ... > Not the best fit for kids, fine for grad students. But, it goes perfectly with our GIN and VODKA indexes. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 08/06/2014 05:35 AM, Alvaro Herrera wrote: > FWIW I think I haven't responded appropriately to the points raised by > Heikki. Basically, as I see it there are three main items: > > 1. the revmap physical-to-logical mapping is too complex; let's use > something else. > > We had revmap originally in a separate fork. The current approach grew > out of the necessity of putting it in the main fork while ensuring that > fast access to individual pages is possible. There are of course many > ways to skin this cat; Heikki's proposal is to have it always occupy the > first few physical pages, rather than require a logical-to-physical > mapping table. To implement this he proposes to move other pages out of > the way as the index grows. I don't really have much love for this > idea. We can change how this is implemented later in the cycle, if we > find that a different approach is better than my proposal. I don't want > to spend endless time meddling with this (and I definitely don't want to > have this delay the eventual commit of the patch.) Please also note that LockTuple is pretty expensive, compared to lightweight locks. Remember how Robert made hash indexes signifcantly faster a couple of years ago (commit 76837c15) by removing the need for heavy-weight locks during queries. To demonstrate that, I applied your patch, and ran a very simple test: create table numbers as select i*1000+j as n from generate_series(0, 19999) i, generate_series(1, 1000) j; create index number_minmax on numbers using minmax (n) with (pages_per_range=1); I ran "explain analyze select * from numbers where n = 10;" a few times under "perf" profiler. The full profile is attached, but here's the top 10: Samples: 3K of event 'cycles', Event count (approx.): 2332550418 + 24.15% postmaster postgres [.] hash_search_with_hash_value + 10.55% postmaster postgres [.] LWLockAcquireCommon + 7.12% postmaster postgres [.] hash_any + 6.77% postmaster postgres [.] minmax_deform_tuple + 6.67% postmaster postgres [.] LWLockRelease + 5.55% postmaster postgres [.] AllocSetAlloc + 4.37% postmaster postgres [.] SetupLockInTable.isra.2 + 2.79% postmaster postgres [.] LockRelease + 2.67% postmaster postgres [.] LockAcquireExtended + 2.54% postmaster postgres [.] mmgetbitmap If you drill into those functions, you'll see that most of the time spent in hash_search_with_hash_value, LWLockAcquireCommon and hash_any are coming from heavy-weight lock handling. At a rough estimate, about 1/3 of the CPU time is spent on LockTuple/UnlockTuple. Maybe we don't care because it's fast enough anyway, but it just seems like we're leaving a lot of money on the table. Because of that, and all the other reasons already discussed, I strongly feel that this design should be changed. > 3. avoid MMTuple as it is just unnecessary extra complexity. > > The main thing that MMTuple adds is not the fact that we save 2 bytes > by storing BlockNumber as is instead of within a TID field. Instead, > it's that we can construct and deconstruct using our own design, which > means we can use however many Datum entries we want and however many > "null" flags. In normal heap and index tuples, there are always the > same number of datum/nulls. In minmax, the number of nulls is twice the > number of indexed columns; the number of datum values is determined by > how many datum values are stored per opclass ("sortable" opclasses > store 2 columns, but geometry would store only one). Hmm. Why is the number of null bits 2x the number of indexed columns? I would expect there to be one null bit per stored Datum. (/me looks at the patch): > /* > * We need a double-length bitmap on an on-disk minmax index tuple; > * the first half stores the "allnulls" bits, the second stores > * "hasnulls". > */ So, one bit means whether there are any heap tuples with a NULL in the indexed column, and the other bit means if the value stored for that column is a NULL. Does that mean that it's not possible to store a NULL minimum, but non-NULL maximum, for a single column? I can't immediately think of an example where you'd want to do that, but I'm also not convinced that no opclass would ever want that. Individual bits are cheap, so I'm inclined to rather have too many of them than regret later. In any case, it should be documented in minmax_tuple.h what those null-bits are and how they're laid out in the bitmap. The comment there currently just says that there are "two null bits for each value stored" (which isn't actually wrong, because you're storing two bits per indexed column, not two bits per value stored (but I just suggested changing that, after which the comment would be correct)). PS. Please add regression tests. It would also be good to implement at least one other opclass than the b-tree based ones, to make sure that the code actually works with something else too. I'd suggest implementing the bounding box opclass for points, that seems simple. - Heikki
Attachment
I think there's a race condition in mminsert, if two backends insert a tuple to the same heap page range concurrently. mminsert does this: 1. Fetch the MMtuple for the page range 2. Check if any of the stored datums need updating 3. Unlock the page. 4. Lock the page again in exclusive mode. 5. Update the tuple. It's possible that two backends arrive at phase 3 at the same time, with different values. For example, backend A wants to update the minimum to contain 10, and and backend B wants to update it to 5. Now, if backend B gets to update the tuple first, to 5, backend A will update the tuple to 10 when it gets the lock, which is wrong. The simplest solution would be to get the buffer lock in exclusive mode to begin with, so that you don't need to release it between steps 2 and 5. That might be a significant hit on concurrency, though, when most of the insertions don't in fact have to update the value. Another idea is to re-check the updated values after acquiring the lock in exclusive mode, to see if they match the previous values. - Heikki
Another race condition: If a new tuple is inserted to the range while summarization runs, it's possible that the new tuple isn't included in the tuple that the summarization calculated, nor does the insertion itself udpate it. 1. There is no index tuple for page range 1-10 2. Summarization begins. It scans pages 1-5. 3. A new insertion inserts a heap tuple to page 1. 4. The insertion sees that there is no index tuple covering range 1-10, so it doesn't update it. 5. The summarization finishes scanning pages 5-10, and inserts the new index tuple. The summarization didn't see the newly inserted heap tuple, and hence it's not included in the calculated index tuple. One idea is to do the summarization in two stages. First, insert a placeholder tuple, with no real value in it. A query considers the placeholder tuple the same as a missing tuple, ie. always considers it a match. An insertion updates the placeholder tuple with the value inserted, as if it was a regular mmtuple. After summarization has finished scanning the page range, it turns the placeholder tuple into a regular tuple, by unioning the placeholder value with the value formed by scanning the heap. - Heikki
I couldn't resist starting to hack on this, and implemented the scheme I've been having in mind: 1. MMTuple contains the block number of the heap page (range) that the tuple represents. Vacuum is no longer needed to clean up old tuples; when an index tuples is updated, the old tuple is deleted atomically with the insertion of a new tuple and updating the revmap, so no garbage is left behind. 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the block number is used to check that you land on the right tuple. If not, the search is started over, looking at the revmap again. I'm sure this still needs some cleanup, but here's the patch, based on your v14. Now that I know what this approach looks like, I still like it much better. The insert and update code is somewhat more complicated, because you have to be careful to lock the old page, new page, and revmap page in the right order. But it's not too bad, and it gets rid of all the complexity in vacuum. - Heikki
Attachment
On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > 1. MMTuple contains the block number of the heap page (range) that the tuple > represents. Vacuum is no longer needed to clean up old tuples; when an index > tuples is updated, the old tuple is deleted atomically with the insertion of > a new tuple and updating the revmap, so no garbage is left behind. What happens if the transaction that does this aborts? Surely that means the new value is itself garbage? What cleans up that? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 8 August 2014 10:01, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > It's possible that two backends arrive at phase 3 at the same time, with > different values. For example, backend A wants to update the minimum to > contain 10, and and backend B wants to update it to 5. Now, if backend B > gets to update the tuple first, to 5, backend A will update the tuple to 10 > when it gets the lock, which is wrong. > > The simplest solution would be to get the buffer lock in exclusive mode to > begin with, so that you don't need to release it between steps 2 and 5. That > might be a significant hit on concurrency, though, when most of the > insertions don't in fact have to update the value. Another idea is to > re-check the updated values after acquiring the lock in exclusive mode, to > see if they match the previous values. Simplest solution is to re-apply the test just before update, so in the above example, if we think we want to lower the minimum to 10 and when we get there it is already 5, we just don't update. We don't need to do the re-check always, though. We can read the page LSN while holding share lock, then re-read it once we acquire exclusive lock. If LSN is the same, no need for datatype specific re-checks at all. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I couldn't resist starting to hack on this, and implemented the scheme I've > been having in mind: > > 1. MMTuple contains the block number of the heap page (range) that the tuple > represents. Vacuum is no longer needed to clean up old tuples; when an index > tuples is updated, the old tuple is deleted atomically with the insertion of > a new tuple and updating the revmap, so no garbage is left behind. > > 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the > block number is used to check that you land on the right tuple. If not, the > search is started over, looking at the revmap again. Part 2 sounds interesting, especially because of the reduction in CPU that it might allow. Part 1 doesn't sound good yet. Are they connected? More importantly, can't we tweak this after commit? Delaying commit just means less time for other people to see, test, understand tune and fix. I see you (Heikki) doing lots of incremental development, lots of small commits. Can't we do this one the same? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 08/10/2014 12:22 PM, Simon Riggs wrote: > On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > >> 1. MMTuple contains the block number of the heap page (range) that the tuple >> represents. Vacuum is no longer needed to clean up old tuples; when an index >> tuples is updated, the old tuple is deleted atomically with the insertion of >> a new tuple and updating the revmap, so no garbage is left behind. > > What happens if the transaction that does this aborts? Surely that > means the new value is itself garbage? What cleans up that? It's no different from Alvaro's patch. The updated MMTuple covers the aborted value, but that's OK from a correctnes point of view. - Heikki
On 08/10/2014 12:42 PM, Simon Riggs wrote: > On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > >> I couldn't resist starting to hack on this, and implemented the scheme I've >> been having in mind: >> >> 1. MMTuple contains the block number of the heap page (range) that the tuple >> represents. Vacuum is no longer needed to clean up old tuples; when an index >> tuples is updated, the old tuple is deleted atomically with the insertion of >> a new tuple and updating the revmap, so no garbage is left behind. >> >> 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the >> block number is used to check that you land on the right tuple. If not, the >> search is started over, looking at the revmap again. > > Part 2 sounds interesting, especially because of the reduction in CPU > that it might allow. > > Part 1 doesn't sound good yet. > Are they connected? Yes. The optimistic locking in part 2 is based on checking that the block number on the MMTuple matches what you're searching for, and that there is never more than one MMTuple in the index with the same block number. > More importantly, can't we tweak this after commit? Delaying commit > just means less time for other people to see, test, understand tune > and fix. I see you (Heikki) doing lots of incremental development, > lots of small commits. Can't we do this one the same? Well, I wouldn't consider "let's redesign how locking and vacuuming works and change the on-disk format" as incremental development ;-). It's more like, well, redesigning the whole thing. Any testing and tuning would certainly need to be redone after such big changes. If you agree that these changes make sense, let's do them now and not waste people's time testing and tuning a dead-end design. If you don't agree, then let's discuss that. - Heikki
On Fri, Aug 8, 2014 at 6:01 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > It's possible that two backends arrive at phase 3 at the same time, with > different values. For example, backend A wants to update the minimum to > contain 10, and and backend B wants to update it to 5. Now, if backend B > gets to update the tuple first, to 5, backend A will update the tuple to 10 > when it gets the lock, which is wrong. > > The simplest solution would be to get the buffer lock in exclusive mode to > begin with, so that you don't need to release it between steps 2 and 5. That > might be a significant hit on concurrency, though, when most of the > insertions don't in fact have to update the value. Another idea is to > re-check the updated values after acquiring the lock in exclusive mode, to > see if they match the previous values. No, the simplest solution is to re-check the bounds after acquiring the exclusive lock. So instead of doing the addValue with share lock, do a consistency check first, and if it's not consistent, do the addValue with exclusive lock.
Heikki Linnakangas wrote: > I couldn't resist starting to hack on this, and implemented the > scheme I've been having in mind: > > 1. MMTuple contains the block number of the heap page (range) that > the tuple represents. Vacuum is no longer needed to clean up old > tuples; when an index tuples is updated, the old tuple is deleted > atomically with the insertion of a new tuple and updating the > revmap, so no garbage is left behind. > > 2. LockTuple is gone. When following the pointer from revmap to > MMTuple, the block number is used to check that you land on the > right tuple. If not, the search is started over, looking at the > revmap again. Thanks, looks good, yeah. Did you just forget to attach the access/rmgrdesc/minmaxdesc.c file, or did you ignore it altogether? Anyway I hacked one up, and cleaned up some other things. > I'm sure this still needs some cleanup, but here's the patch, based > on your v14. Now that I know what this approach looks like, I still > like it much better. The insert and update code is somewhat more > complicated, because you have to be careful to lock the old page, > new page, and revmap page in the right order. But it's not too bad, > and it gets rid of all the complexity in vacuum. It seems there is some issue here, because pageinspect tells me the index is not growing properly for some reason. minmax_revmap_data gives me this array of TIDs after a bunch of insert/vacuum/delete/ etc: "(2,1)","(2,2)","(2,3)","(2,4)","(2,5)","(4,1)","(5,1)","(6,1)","(7,1)","(8,1)","(9,1)","(10,1)","(11,1)","(12,1)","(13,1)","(14,1)","(15,1)","(16,1)","(17,1)","(18,1)","(19,1)","(20,1)","(21,1)","(22,1)","(23,1)","(24,1)","(25,1)","(26,1)","(27,1)","(28,1)","(29,1)","(30,1)","(31,1)","(32,1)","(33,1)","(34,1)","(35,1)","(36,1)","(37,1)","(38,1)","(39,1)","(40,1)","(41,1)","(42,1)","(43,1)","(44,1)","(45,1)","(46,1)","(47,1)","(48,1)","(49,1)","(50,1)","(51,1)","(52,1)","(53,1)","(54,1)","(55,1)","(56,1)","(57,1)","(58,1)","(59,1)","(60,1)","(61,1)","(62,1)","(63,1)","(64,1)","(65,1)","(66,1)","(67,1)","(68,1)","(69,1)","(70,1)","(71,1)","(72,1)","(73,1)","(74,1)","(75,1)","(76,1)","(77,1)","(78,1)","(79,1)","(80,1)","(81,1)","(82,1)","(83,1)","(84,1)","(85,1)","(86,1)","(87,1)","(88,1)","(89,1)","(90,1)","(91,1)","(92,1)","(93,1)","(94,1)","(95,1)","(96,1)","(97,1)","(98,1)","(99,1)","(100,1)","(101,1)","(102,1)","(103,1)","(104,1)","(105,1)","(106,1)","(107,1)","(108,1)","(109,1)","(110,1)","(111,1)","(112,1)","(113,1)","(114,1)","(115,1)","(116,1)","(117,1)","(118,1)","(119,1)","(120,1)","(121,1)","(122,1)","(123,1)","(124,1)","(125,1)","(126,1)","(127,1)","(128,1)","(129,1)","(130,1)","(131,1)","(132,1)","(133,1)","(134,1)" There are some who would think that getting one item per page is suboptimal. (Maybe it's just a missing FSM update somewhere.) I've been hacking away a bit more at this; will post updated patch probably tomorrow (was about to post but just found a memory stomp in pageinspect.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > Heikki Linnakangas wrote: > > I'm sure this still needs some cleanup, but here's the patch, based > > on your v14. Now that I know what this approach looks like, I still > > like it much better. The insert and update code is somewhat more > > complicated, because you have to be careful to lock the old page, > > new page, and revmap page in the right order. But it's not too bad, > > and it gets rid of all the complexity in vacuum. > > It seems there is some issue here, because pageinspect tells me the > index is not growing properly for some reason. minmax_revmap_data gives > me this array of TIDs after a bunch of insert/vacuum/delete/ etc: I fixed this issue, and did a lot more rework and bugfixing. Here's v15, based on v14-heikki2. I think remaining issues are mostly minimal (pageinspect should output block number alongside each tuple, now that we have it, for example.) I haven't tested the new xlog records yet. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 08/15/2014 02:02 AM, Alvaro Herrera wrote: > Alvaro Herrera wrote: >> Heikki Linnakangas wrote: > >>> I'm sure this still needs some cleanup, but here's the patch, based >>> on your v14. Now that I know what this approach looks like, I still >>> like it much better. The insert and update code is somewhat more >>> complicated, because you have to be careful to lock the old page, >>> new page, and revmap page in the right order. But it's not too bad, >>> and it gets rid of all the complexity in vacuum. >> >> It seems there is some issue here, because pageinspect tells me the >> index is not growing properly for some reason. minmax_revmap_data gives >> me this array of TIDs after a bunch of insert/vacuum/delete/ etc: > > I fixed this issue, and did a lot more rework and bugfixing. Here's > v15, based on v14-heikki2. Thanks! > I think remaining issues are mostly minimal (pageinspect should output > block number alongside each tuple, now that we have it, for example.) There's this one issue I left in my patch version that I think we should do something about: > + /* > + * No luck. Assume that the revmap was updated concurrently. > + * > + * XXX: it would be nice to add some kind of a sanity check here to > + * avoid looping infinitely, if the revmap points to wrong tuple for > + * some reason. > + */ This happens when we follow the revmap to a tuple, but find that the tuple points to a different block than what the revmap claimed. Currently, we just assume that it's because the tuple was updated concurrently, but while hacking, I frequently had a broken index where the revmap pointed to bogus tuples or the tuples had a missing/wrong block number on them, and ran into infinite loop here. It's clearly a case of a corrupt index and shouldn't happen, but I would imagine that it's a fairly typical way this would fail in production too because of hardware issues or bugs. So I think we need to work a bit harder to stop the looping and throw an error instead. Perhaps something as simple as keeping a loop counter and giving up after 1000 attempts would be good enough. The window between releasing the lock on the revmap, and acquiring the lock on the page containing the MMTuple is very narrow, so the chances of losing that race to a concurrent update more than 1-2 times in a row is vanishingly small. - Heikki
On 08/15/2014 10:26 AM, Heikki Linnakangas wrote: > On 08/15/2014 02:02 AM, Alvaro Herrera wrote: >> Alvaro Herrera wrote: >>> Heikki Linnakangas wrote: >> >>>> I'm sure this still needs some cleanup, but here's the patch, based >>>> on your v14. Now that I know what this approach looks like, I still >>>> like it much better. The insert and update code is somewhat more >>>> complicated, because you have to be careful to lock the old page, >>>> new page, and revmap page in the right order. But it's not too bad, >>>> and it gets rid of all the complexity in vacuum. >>> >>> It seems there is some issue here, because pageinspect tells me the >>> index is not growing properly for some reason. minmax_revmap_data gives >>> me this array of TIDs after a bunch of insert/vacuum/delete/ etc: >> >> I fixed this issue, and did a lot more rework and bugfixing. Here's >> v15, based on v14-heikki2. > > Thanks! > >> I think remaining issues are mostly minimal (pageinspect should output >> block number alongside each tuple, now that we have it, for example.) > > There's this one issue I left in my patch version that I think we should > do something about: > >> + /* >> + * No luck. Assume that the revmap was updated concurrently. >> + * >> + * XXX: it would be nice to add some kind of a sanity check here to >> + * avoid looping infinitely, if the revmap points to wrong tuple for >> + * some reason. >> + */ > > This happens when we follow the revmap to a tuple, but find that the > tuple points to a different block than what the revmap claimed. > Currently, we just assume that it's because the tuple was updated > concurrently, but while hacking, I frequently had a broken index where > the revmap pointed to bogus tuples or the tuples had a missing/wrong > block number on them, and ran into infinite loop here. It's clearly a > case of a corrupt index and shouldn't happen, but I would imagine that > it's a fairly typical way this would fail in production too because of > hardware issues or bugs. So I think we need to work a bit harder to stop > the looping and throw an error instead. > > Perhaps something as simple as keeping a loop counter and giving up > after 1000 attempts would be good enough. The window between releasing > the lock on the revmap, and acquiring the lock on the page containing > the MMTuple is very narrow, so the chances of losing that race to a > concurrent update more than 1-2 times in a row is vanishingly small. Reading the patch more closely, I see that you added a check that when we loop, we throw an error if the new item pointer in the revmap is the same as before. In theory, it's possible that two concurrent updates happen: one that moves the tuple we're looking for elsewhere, and another that moves it back again. The probability of that is also vanishingly small, so maybe that's OK. Or we could check the LSN; if the revmap has been updated, its LSN must've changed. - Heikki
On Fri, Aug 15, 2014 at 8:02 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Alvaro Herrera wrote: >> Heikki Linnakangas wrote: > >> > I'm sure this still needs some cleanup, but here's the patch, based >> > on your v14. Now that I know what this approach looks like, I still >> > like it much better. The insert and update code is somewhat more >> > complicated, because you have to be careful to lock the old page, >> > new page, and revmap page in the right order. But it's not too bad, >> > and it gets rid of all the complexity in vacuum. >> >> It seems there is some issue here, because pageinspect tells me the >> index is not growing properly for some reason. minmax_revmap_data gives >> me this array of TIDs after a bunch of insert/vacuum/delete/ etc: > > I fixed this issue, and did a lot more rework and bugfixing. Here's > v15, based on v14-heikki2. I've not read the patch yet. But while testing the feature, I found that * Brin index cannot be created on CHAR(n) column. Maybe other data types have the same problem. * FILLFACTOR cannot be set in brin index. Are these intentional? Regards, -- Fujii Masao
On 08/15/2014 02:02 AM, Alvaro Herrera wrote: > Alvaro Herrera wrote: >> Heikki Linnakangas wrote: > >>> I'm sure this still needs some cleanup, but here's the patch, based >>> on your v14. Now that I know what this approach looks like, I still >>> like it much better. The insert and update code is somewhat more >>> complicated, because you have to be careful to lock the old page, >>> new page, and revmap page in the right order. But it's not too bad, >>> and it gets rid of all the complexity in vacuum. >> >> It seems there is some issue here, because pageinspect tells me the >> index is not growing properly for some reason. minmax_revmap_data gives >> me this array of TIDs after a bunch of insert/vacuum/delete/ etc: > > I fixed this issue, and did a lot more rework and bugfixing. Here's > v15, based on v14-heikki2. So, the other design change I've been advocating is to store the revmap in the first N blocks, instead of having the two-level structure with array pages and revmap pages. Attached is a patch for that, to be applied after v15. When the revmap needs to be expanded, all the tuples on it are moved elsewhere one-by-one. That adds some latency to the unfortunate guy who needs to do that, but as the patch stands, the revmap is only ever extended by VACUUM or CREATE INDEX, so I think that's fine. Like with my previous patch, the point is to demonstrate how much simpler the code becomes this way; I'm sure there are bugs and cleanup still necessary. PS. Spotted one oversight in patch v15: callers of mm_doupdate must check the return value, and retry the operation if it returns false. - Heikki
Attachment
Fujii Masao wrote: > I've not read the patch yet. But while testing the feature, I found that > > * Brin index cannot be created on CHAR(n) column. > Maybe other data types have the same problem. Yeah, it's just a matter of adding an opclass for it -- pretty simple stuff really, because you don't need to write any code, just add a bunch of catalog entries and an OPCINFO line in mmsortable.c. Right now there are opclasses for the following types: int4 numeric text date timestamp with time zone timestamp time with time zone time "char" We can eventually extend to cover all types that have btree opclasses, but we can do that in a separate commit. I'm also considering removing the opclass for time with time zone, as it's a pretty useless type. I mostly added the ones that are there as a way to test that it behaved reasonably in the various cases (pass by val vs. not, variable width vs. fixed, different alignment requirements) Of course, the real interesting part is adding a completely different opclass, such as one that stores bounding boxes. > * FILLFACTOR cannot be set in brin index. I hadn't added this one because I didn't think there was much point previously, but I think it might now be useful to allow same-page updates. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Heikki Linnakangas wrote: > So, the other design change I've been advocating is to store the > revmap in the first N blocks, instead of having the two-level > structure with array pages and revmap pages. > > Attached is a patch for that, to be applied after v15. When the > revmap needs to be expanded, all the tuples on it are moved > elsewhere one-by-one. That adds some latency to the unfortunate guy > who needs to do that, but as the patch stands, the revmap is only > ever extended by VACUUM or CREATE INDEX, so I think that's fine. > Like with my previous patch, the point is to demonstrate how much > simpler the code becomes this way; I'm sure there are bugs and > cleanup still necessary. Thanks for the prodding. I didn't like this too much initially, but after going over it a few times I agree that having less code and a less complex physical representation is better. Your proposed approach is to just call the update routine on every tuple in the page we're evacuating. There are optimizations possible (such as doing bulk updates; and instead of updating the revmap, keep a redirection pointer in the page we just evacuated, so that the revmap can be updated lazily later), but I have spent way too long on this already that I am fine with keeping what we have here. If somebody later wants to contribute improvements to this, it'd be welcome. But on the other hand the operation is not that frequent and as you say it's not executed by user-facing queries, so perhaps it's okay. I cleaned it up some: mainly I created a separate file (mmpageops.c) that now hosts the routines related to page operations: mm_doinsert, mm_doupdate, mm_start_evacuating_page, mm_evacuate_page. There are other rather very minor changes here and there; also added CHECK_FOR_INTERRUPTS in all relevant loops. This bit in mm_doupdate I just couldn't understand: /* If both tuples are in fact equal, there is nothing to do */ if (!minmax_tuples_equal(oldtup, oldsz, origtup, origsz)) { LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK); return false; } Isn't the test exactly reversed? I don't see how this would work. I updated it to /* * If both tuples are identical, there is nothing to do; except that if we * were requested to move the tuple across pages, we do it even if they are * equal. */ if (samepage && minmax_tuples_equal(oldtup, oldsz, origtup, origsz)) { LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK); return false; } > PS. Spotted one oversight in patch v15: callers of mm_doupdate must > check the return value, and retry the operation if it returns false. Right, thanks. Fixed. So here's v16, rebased on top of 9bac66020. As far as I am concerned, this is the last version before I start renaming everything to BRIN and then commit. contrib/pageinspect/Makefile | 2 +- contrib/pageinspect/mmfuncs.c | 407 +++++++++++++ contrib/pageinspect/pageinspect--1.2.sql | 36 ++ contrib/pg_xlogdump/rmgrdesc.c | 1 + doc/src/sgml/brin.sgml | 248 ++++++++ doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/indices.sgml | 36 +- doc/src/sgml/postgres.sgml | 1 + minmax-proposal | 306 ++++++++++ src/backend/access/Makefile | 2 +- src/backend/access/common/reloptions.c | 7 + src/backend/access/heap/heapam.c | 22 +- src/backend/access/minmax/Makefile | 17 + src/backend/access/minmax/minmax.c | 942 +++++++++++++++++++++++++++++++ src/backend/access/minmax/mmpageops.c | 638 +++++++++++++++++++++ src/backend/access/minmax/mmrevmap.c | 451 +++++++++++++++ src/backend/access/minmax/mmsortable.c | 287 ++++++++++ src/backend/access/minmax/mmtuple.c | 478 ++++++++++++++++ src/backend/access/minmax/mmxlog.c | 323 +++++++++++ src/backend/access/rmgrdesc/Makefile | 3 +- src/backend/access/rmgrdesc/minmaxdesc.c | 89 +++ src/backend/access/transam/rmgr.c | 1 + src/backend/catalog/index.c | 24 + src/backend/replication/logical/decode.c | 1 + src/backend/storage/page/bufpage.c | 179 +++++- src/backend/utils/adt/selfuncs.c | 24 + src/include/access/heapam.h | 2 + src/include/access/minmax.h | 52 ++ src/include/access/minmax_internal.h | 86 +++ src/include/access/minmax_page.h | 70 +++ src/include/access/minmax_pageops.h | 29 + src/include/access/minmax_revmap.h | 36 ++ src/include/access/minmax_tuple.h | 90 +++ src/include/access/minmax_xlog.h | 106 ++++ src/include/access/reloptions.h | 3 +- src/include/access/relscan.h | 4 +- src/include/access/rmgrlist.h | 1 + src/include/catalog/index.h | 8 + src/include/catalog/pg_am.h | 2 + src/include/catalog/pg_amop.h | 81 +++ src/include/catalog/pg_amproc.h | 73 +++ src/include/catalog/pg_opclass.h | 9 + src/include/catalog/pg_opfamily.h | 10 + src/include/catalog/pg_proc.h | 52 ++ src/include/storage/bufpage.h | 2 + src/include/utils/selfuncs.h | 1 + src/test/regress/expected/opr_sanity.out | 14 +- src/test/regress/sql/opr_sanity.sql | 7 +- 48 files changed, 5248 insertions(+), 16 deletions(-) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > So here's v16, rebased on top of 9bac66020. As far as I am concerned, > this is the last version before I start renaming everything to BRIN and > then commit. FWIW in case you or others have interest, here's the diff between your patch and v16. Also, for illustrative purposes, the diff between versions yours and mine of the code that got moved to mmpageops.c because it's difficult to see it from the partial patch. (There's nothing to do with that partial diff other than read it directly.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Here's version 18. I have renamed it: These are now BRIN indexes. I have fixed numerous race conditions and deadlocks. In particular I fixed this problem you noted: Heikki Linnakangas wrote: > Another race condition: > > If a new tuple is inserted to the range while summarization runs, > it's possible that the new tuple isn't included in the tuple that > the summarization calculated, nor does the insertion itself udpate > it. I did it mostly in the way you outlined, i.e. by way of a placeholder tuple that gets updated by concurrent inserters and then the tuple resulting from the scan is unioned with the values in the updated placeholder tuple. This required the introduction of one extra support proc for opclasses (pretty simple stuff anyhow). There should be only minor items left now, such as silencing the WARNING: concurrent insert in progress within table "sales" which is emitted by IndexBuildHeapScan (possibly thousands of times) when doing a summarization of a range being inserted into or otherwise modified. Basically the issue here is that IBHS assumes it's being run with ShareLock in the heap (which blocks inserts), but here we're using it with ShareUpdateExclusive only, which lets inserts in. There is no harm AFAICS because of the placeholder tuple stuff I describe above. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Mon, September 8, 2014 18:02, Alvaro Herrera wrote: > Here's version 18. I have renamed it: These are now BRIN indexes. > I get into a BadArgument after: $ cat crash.sql -- drop table if exists t_100_000_000 cascade; create table t_100_000_000 as select cast(i as integer) from generate_series(1,100000000) as f(i) ; -- drop index if exists t_100_000_000_i_brin_idx; create index t_100_000_000_i_brin_idx on t_100_000_000 using brin(i);select pg_size_pretty(pg_relation_size('t_100_000_000_i_brin_idx')); select i from t_100_000_000 where i between 10000 and 1009999; -- ( + 999999 ) Log file says: TRAP: BadArgument("!(((context) != ((void *)0) && (((((const Node*)((context)))->type) == T_AllocSetContext))))", File: "mcxt.c", Line: 752) 2014-09-08 19:54:46.071 CEST 30151 LOG: server process (PID 30336) was terminated by signal 6: Aborted 2014-09-08 19:54:46.071 CEST 30151 DETAIL: Failed process was running: select i from t_100_000_000 where i between 10000 and 1009999; The crash is caused by the last select statement; the table and index create are OK. it only happens with a largish table; small tables are OK. Linux / Centos / 32 GB. PostgreSQL 9.5devel_minmax_20140908_1809_0640c1bfc091 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.9.1, 64-bit setting | current_setting --------------------------+--------------------------------------------autovacuum | offport | 6444shared_buffers | 100MBeffective_cache_size | 4GBwork_mem | 10MBmaintenance_work_mem | 1GBcheckpoint_segments | 20data_checksums | onserver_version | 9.5devel_minmax_20140908_1809_0640c1bfc091pg_postmaster_start_time| 2014-09-08 19:53 (uptime: 0d 0h 6m 54s) '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--bindir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/bin' '--libdir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/lib' '--enable-depend' '--enable-cassert' '--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--with-extra-version=_minmax_20140908_1809_0640c1bfc091' pgpatches/0095/minmax/20140908/minmax-18.patch thanks, Erik Rijkers
Erik Rijkers wrote: > Log file says: > > TRAP: BadArgument("!(((context) != ((void *)0) && (((((const Node*)((context)))->type) == T_AllocSetContext))))", File: > "mcxt.c", Line: 752) > 2014-09-08 19:54:46.071 CEST 30151 LOG: server process (PID 30336) was terminated by signal 6: Aborted > 2014-09-08 19:54:46.071 CEST 30151 DETAIL: Failed process was running: select i from t_100_000_000 where i between 10000 > and 1009999; A double-free mistake -- here's a patch. Thanks. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 09/08/2014 07:02 PM, Alvaro Herrera wrote: > Here's version 18. I have renamed it: These are now BRIN indexes. > > I have fixed numerous race conditions and deadlocks. In particular I > fixed this problem you noted: > > Heikki Linnakangas wrote: >> Another race condition: >> >> If a new tuple is inserted to the range while summarization runs, >> it's possible that the new tuple isn't included in the tuple that >> the summarization calculated, nor does the insertion itself udpate >> it. > > I did it mostly in the way you outlined, i.e. by way of a placeholder > tuple that gets updated by concurrent inserters and then the tuple > resulting from the scan is unioned with the values in the updated > placeholder tuple. This required the introduction of one extra support > proc for opclasses (pretty simple stuff anyhow). Hmm. So the union support proc is only called if there is a race condition? That makes it very difficult to test, I'm afraid. It would make more sense to pass BrinValues to the support functions, rather than DeformedBrTuple. An opclass'es support function should never need to access the values for other columns. Does minmaxUnion handle NULLs correctly? minmaxUnion pfrees the old values. Is that necessary? What memory context does the function run in? If the code runs in a short-lived memory context, you might as well just let them leak. If it runs in a long-lived context, well, perhaps it shouldn't. It's nicer to write functions that can leak freely. IIRC, GiST and GIN runs the support functions in a temporary context. In any case, it might be worth noting explicitly in the docs which functions may leak and which may not. If you add a new datatype, and define b-tree operators for it, what is required to create a minmax opclass for it? Would it be possible to generalize the functions in brin_minmax.c so that they can be reused for any datatype (with b-tree operators) without writing any new C code? I think we're almost there; the only thing that differs between each data type is the opcinfo function. Let's pass the type OID as argument to the opcinfo function. You could then have just a single minmax_opcinfo function, instead of the macro to generate a separate function for each built-in datatype. In general, this patch is in pretty good shape now, thanks! - Heikki
<br /><div class="moz-cite-prefix">El 08/09/14 13:02, Alvaro Herrera escribió:<br /></div><blockquote cite="mid:20140908160219.GN14037@eldon.alvh.no-ip.org"type="cite"><pre wrap="">Here's version 18. I have renamed it: Theseare now BRIN indexes. I have fixed numerous race conditions and deadlocks. In particular I fixed this problem you noted: Heikki Linnakangas wrote: </pre><blockquote type="cite"><pre wrap="">Another race condition: If a new tuple is inserted to the range while summarization runs, it's possible that the new tuple isn't included in the tuple that the summarization calculated, nor does the insertion itself udpate it. </pre></blockquote><pre wrap=""> I did it mostly in the way you outlined, i.e. by way of a placeholder tuple that gets updated by concurrent inserters and then the tuple resulting from the scan is unioned with the values in the updated placeholder tuple. This required the introduction of one extra support proc for opclasses (pretty simple stuff anyhow). There should be only minor items left now, such as silencing the WARNING: concurrent insert in progress within table "sales" which is emitted by IndexBuildHeapScan (possibly thousands of times) when doing a summarization of a range being inserted into or otherwise modified. Basically the issue here is that IBHS assumes it's being run with ShareLock in the heap (which blocks inserts), but here we're using it with ShareUpdateExclusive only, which lets inserts in. There is no harm AFAICS because of the placeholder tuple stuff I describe above. </pre></blockquote><br /> Debuging VACUUM VERBOSE ANALYZE over a concurrent table being updated/insert.<br /><br /> (gbd)<br/> Breakpoint 1, errfinish (dummy=0) at elog.c:411<br /> 411 ErrorData *edata = &errordata[errordata_stack_depth];<br/><br /> The complete backtrace is at <a class="moz-txt-link-freetext" href="http://pastebin.com/gkigSNm7">http://pastebin.com/gkigSNm7</a><br/><br /><br /> Also, I found pages with an unkowntype (using deafult parameters for the index<br /> creation):<br /><br /> brin_page_type | array_agg<br /> ----------------+-----------<br/> unknown (00) | {3,4}<br /> revmap | {1}<br /> regular | {2}<br /> meta | {0}<br /> (4 rows)<br /><br /><br /><br /><br /><blockquote cite="mid:20140908160219.GN14037@eldon.alvh.no-ip.org"type="cite"><pre wrap=""> </pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap=""> </pre></blockquote><br /><pre class="moz-signature" cols="72">-- -- Emanuel Calvo @3manuek</pre>
Here's an updated version, rebased to current master. Erik Rijkers wrote: > I get into a BadArgument after: Fixed in the attached, thanks. Emanuel Calvo wrote: > Debuging VACUUM VERBOSE ANALYZE over a concurrent table being > updated/insert. > > (gbd) > Breakpoint 1, errfinish (dummy=0) at elog.c:411 > 411 ErrorData *edata = &errordata[errordata_stack_depth]; > > The complete backtrace is at http://pastebin.com/gkigSNm7 The file/line info in the backtrace says that this is reporting this message: ereport(elevel, (errmsg("scanned index \"%s\" to remove %d row versions", RelationGetRelationName(indrel), vacrelstats->num_dead_tuples), errdetail("%s.", pg_rusage_show(&ru0)))); Not sure why you're reporting it, since this is expected. There were thousands of WARNINGs being emitted by IndexBuildHeapScan when concurrent insertions occurred; I fixed that by setting the ii_Concurrent flag, which makes that function obtain a snapshot to use for the scan. This is okay because concurrent insertions will be detected via the placeholder tuple mechanism as previously described. (There is no danger of serializable transactions etc, because this only runs in vacuum. I added an Assert() nevertheless.) > Also, I found pages with an unkown type (using deafult parameters for > the index > creation): > > brin_page_type | array_agg > ----------------+----------- > unknown (00) | {3,4} > revmap | {1} > regular | {2} > meta | {0} > (4 rows) Ah, we had an issue with the vacuuming of the FSM. I had to make that more aggressive; I was able to reproduce the problem and it is fixed now. Heikki Linnakangas wrote: > Hmm. So the union support proc is only called if there is a race > condition? That makes it very difficult to test, I'm afraid. Yes. I guess we can fix that by having an assert-only block that uses the union support proc to verify consistency of generated tuples. This might be difficult for types involving floating point arithmetic. > It would make more sense to pass BrinValues to the support > functions, rather than DeformedBrTuple. An opclass'es support > function should never need to access the values for other columns. Agreed -- fixed. I added attno to BrinValues, which makes this easier. > Does minmaxUnion handle NULLs correctly? Nope, fixed. > minmaxUnion pfrees the old values. Is that necessary? What memory > context does the function run in? If the code runs in a short-lived > memory context, you might as well just let them leak. If it runs in > a long-lived context, well, perhaps it shouldn't. It's nicer to > write functions that can leak freely. IIRC, GiST and GIN runs the > support functions in a temporary context. In any case, it might be > worth noting explicitly in the docs which functions may leak and > which may not. Yeah, I had tried playing with contexts in general previously but it turned out that there was too much bureaucratic overhead (quite visible in profiles), so I ripped it out and did careful retail pfree instead (it's not *that* difficult). Maybe I went overboard with it, and that with more careful planning we can do better; I don't think this is critical ATM -- we can certainly stand later cleanup in this area. > If you add a new datatype, and define b-tree operators for it, what > is required to create a minmax opclass for it? Would it be possible > to generalize the functions in brin_minmax.c so that they can be > reused for any datatype (with b-tree operators) without writing any > new C code? I think we're almost there; the only thing that differs > between each data type is the opcinfo function. Let's pass the type > OID as argument to the opcinfo function. You could then have just a > single minmax_opcinfo function, instead of the macro to generate a > separate function for each built-in datatype. Yeah, that's how I had that initially. I changed it to what it's now as part of a plan to enable building cross-type opclasses, so you could have "WHERE int8col=42" without requiring a cast of the constant to type int8. This might have been a thinko, because AFAICS it's possible to build them with a constant opcinfo as well (I changed several other things to support this, as described in a previous email.) I will look into this later. Thanks for the review! -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > Heikki Linnakangas wrote: > > If you add a new datatype, and define b-tree operators for it, what > > is required to create a minmax opclass for it? Would it be possible > > to generalize the functions in brin_minmax.c so that they can be > > reused for any datatype (with b-tree operators) without writing any > > new C code? I think we're almost there; the only thing that differs > > between each data type is the opcinfo function. Let's pass the type > > OID as argument to the opcinfo function. You could then have just a > > single minmax_opcinfo function, instead of the macro to generate a > > separate function for each built-in datatype. > > Yeah, that's how I had that initially. I changed it to what it's now as > part of a plan to enable building cross-type opclasses, so you could > have "WHERE int8col=42" without requiring a cast of the constant to type > int8. This might have been a thinko, because AFAICS it's possible to > build them with a constant opcinfo as well (I changed several other > things to support this, as described in a previous email.) I will look > into this later. I found out that we don't really throw errors in such cases anymore; we insert casts instead. Maybe there's a performance argument that it might be better to use existing cross-type operators than casting, but justifying this work just turned a lot harder. Here's a patch that reverts opcinfo into a generic function that receives the type OID. I will look into adding some testing mechanism for the union support proc; with that I will just consider the patch ready for commit and will push. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Tue, Sep 23, 2014 at 3:04 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Alvaro Herrera wrote: >> Heikki Linnakangas wrote: >> > If you add a new datatype, and define b-tree operators for it, what >> > is required to create a minmax opclass for it? Would it be possible >> > to generalize the functions in brin_minmax.c so that they can be >> > reused for any datatype (with b-tree operators) without writing any >> > new C code? I think we're almost there; the only thing that differs >> > between each data type is the opcinfo function. Let's pass the type >> > OID as argument to the opcinfo function. You could then have just a >> > single minmax_opcinfo function, instead of the macro to generate a >> > separate function for each built-in datatype. >> >> Yeah, that's how I had that initially. I changed it to what it's now as >> part of a plan to enable building cross-type opclasses, so you could >> have "WHERE int8col=42" without requiring a cast of the constant to type >> int8. This might have been a thinko, because AFAICS it's possible to >> build them with a constant opcinfo as well (I changed several other >> things to support this, as described in a previous email.) I will look >> into this later. > > I found out that we don't really throw errors in such cases anymore; we > insert casts instead. Maybe there's a performance argument that it > might be better to use existing cross-type operators than casting, but > justifying this work just turned a lot harder. Here's a patch that > reverts opcinfo into a generic function that receives the type OID. > > I will look into adding some testing mechanism for the union support > proc; with that I will just consider the patch ready for commit and will > push. With all respect, I think this is a bad idea. I know you've put a lot of energy into this patch and I'm confident it's made a lot of progress. But as with Stephen's patch, the final form deserves a thorough round of looking over by someone else before it goes in. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 24, 2014 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 23, 2014 at 3:04 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Alvaro Herrera wrote: >> I will look into adding some testing mechanism for the union support >> proc; with that I will just consider the patch ready for commit and will >> push. > > With all respect, I think this is a bad idea. I know you've put a lot > of energy into this patch and I'm confident it's made a lot of > progress. But as with Stephen's patch, the final form deserves a > thorough round of looking over by someone else before it goes in. Would this person be it an extra committer or an simple reviewer? It would give more insurance if such huge patches (couple of thousands of lines) get an extra +1 from another committer, proving that the code has been reviewed by people well-experienced with backend code. Now as this would put more pressure in the hands of committers, an extra external pair of eyes, be it non-committer but let's say a seasoned reviewer would be fine IMO. -- Michael
Robert Haas wrote: > With all respect, I think this is a bad idea. I know you've put a lot > of energy into this patch and I'm confident it's made a lot of > progress. But as with Stephen's patch, the final form deserves a > thorough round of looking over by someone else before it goes in. As you can see in the thread, Heikki's put a lot of review effort into it (including important code contributions); I don't feel I'm rushing it at this point. If you or somebody else want to give it a look, I have no problem waiting a bit longer. I don't want to delay indefinitely, though, because I think it's better shipped early in the release cycle than later, to allow for further refinements and easier testing by other interested parties. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Sep 23, 2014 at 7:35 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Would this person be it an extra committer or an simple reviewer? It > would give more insurance if such huge patches (couple of thousands of > lines) get an extra +1 from another committer, proving that the code > has been reviewed by people well-experienced with backend code. Now as > this would put more pressure in the hands of committers, an extra > external pair of eyes, be it non-committer but let's say a seasoned > reviewer would be fine IMO. If you're volunteering, I certainly wouldn't say "no". The more the merrier. Same with anyone else. Since Heikki looked at it before, I also think it would be appropriate to give him a bit of time to see if he feels satisfied with it now - nobody on this project has more experience with indexing than he does, but he may not have the time, and even if he does, someone else might spot something he misses. Alvaro's quite right to point out that there is no sense in waiting a long time for a review that isn't coming. That just backs everything up against the end of the release cycle to no benefit. But if there's review available from experienced people within the community, taking advantage of that now might find things that could be much harder to fix later. That's a win for everybody. And it's not like we're pressed up against the end of the cycle, nor is it as if this feature has been through endless rounds of review already. It's certainly had some, and it's gotten better as a result. But it's also changed a lot in the process. And much of the review to date has been high-level design review, like "how should the opclasses look?" and "what should we call this thing anyway?". Going through it for logic errors, documentation shortcomings, silly thinkos, etc. has not been done too much, I think, and definitely not on the latest version. So, some of that might not be out of place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 23, 2014 at 9:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> With all respect, I think this is a bad idea. I know you've put a lot >> of energy into this patch and I'm confident it's made a lot of >> progress. But as with Stephen's patch, the final form deserves a >> thorough round of looking over by someone else before it goes in. > > As you can see in the thread, Heikki's put a lot of review effort into > it (including important code contributions); I don't feel I'm rushing it > at this point. Yeah, I was really glad Heikki looked at it. That seemed good. > If you or somebody else want to give it a look, I have > no problem waiting a bit longer. I don't want to delay indefinitely, > though, because I think it's better shipped early in the release cycle > than later, to allow for further refinements and easier testing by other > interested parties. I agree with that. I'd like to look at it, and I will if I get time, but as I said elsewhere, I also think it's appropriate to give a little time around the final version of any big, complex patch just because people may have thoughts, and they may not have time to deliver those thoughts the minute the patch hits the list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Tue, Sep 23, 2014 at 9:23 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > If you or somebody else want to give it a look, I have > > no problem waiting a bit longer. I don't want to delay indefinitely, > > though, because I think it's better shipped early in the release cycle > > than later, to allow for further refinements and easier testing by other > > interested parties. > > I agree with that. I'd like to look at it, and I will if I get time, > but as I said elsewhere, I also think it's appropriate to give a > little time around the final version of any big, complex patch just > because people may have thoughts, and they may not have time to > deliver those thoughts the minute the patch hits the list. Fair enough -- I'll keep it open for the time being. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 09/23/2014 10:04 PM, Alvaro Herrera wrote: > + <para> > + The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname> > + is primarily maintained by Álvaro Herrera. > + </para> We don't usually have such verbiage in the docs. The GIN and GiST pages do, but I think those are a historic exceptions, not something we want to do going forward. > + <variablelist> > + <varlistentry> > + <term><function>BrinOpcInfo *opcInfo(void)</></term> > + <listitem> > + <para> > + Returns internal information about the indexed columns' summary data. > + </para> > + </listitem> > + </varlistentry> I think you should explain what that internal information is. The minmax-19a.patch adds the type OID argument to this; remember to update the docs. In SP-GiST, the similar function is called "config". It might be good to use the same name here, for consistency across indexams (although I actually like the "opcInfo" name better than "config") The docs for the other support functions need to be updated, now that you changed the arguments from DeformedBrTuple to BrinValues. > + <!-- this needs improvement ... --> > + To implement these methods in a generic ways, normally the opclass > + defines its own internal support functions. For instance, minmax > + opclasses add the support functions for the four inequality operators > + for the datatype. > + Additionally, the operator class must supply appropriate > + operator entries, > + to enable the optimizer to use the index when those operators are > + used in queries. The above needs improvement ;-) > + BRIN indexes (a shorthand for Block Range indexes) > + store summaries about the values stored in consecutive table physical block ranges. "consecutive table physical block ranges" is quite a mouthful. > + For datatypes that have a linear sort order, the indexed data > + corresponds to the minimum and maximum values of the > + values in the column for each block range, > + which support indexed queries using these operators: > + > + <simplelist> > + <member><literal><</literal></member> > + <member><literal><=</literal></member> > + <member><literal>=</literal></member> > + <member><literal>>=</literal></member> > + <member><literal>></literal></member> > + </simplelist> That's the built-in minmax indexing strategy, yes, but you could have others, even for datatypes with a linear sort order. > + To find out the index tuple for a particular page range, we have an internal s/find out/find/ > + new heap tuple contains null values but the index tuple indicate there are no s/indicate/indicates/ > + Open questions > + -------------- > + > + * Same-size page ranges? > + Current related literature seems to consider that each "index entry" in a > + BRIN index must cover the same number of pages. There doesn't seem to be a What is the related literature? Is there an academic paper or something that should be cited as a reference for BRIN? > + * TODO > + * * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY) > + * * add support for unlogged indexes > + * * ditto expressional indexes We don't have unlogged indexes in general, so no need to list that here. What would be needed to implement ScalarArrayOpExprs? I didn't realize that expression indexes are still not supported. And I see that partial indexes are not supported either. Why not? I wouldn't expect BRIN to need to care about those things in particular; the expressions for an expressional or partial index are handled in the executor, no? > + /* > + * A tuple in the heap is being inserted. To keep a brin index up to date, > + * we need to obtain the relevant index tuple, compare its stored values with > + * those of the new tuple; if the tuple values are consistent with the summary > + * tuple, there's nothing to do; otherwise we need to update the index. s/compare/and compare/. Perhaps replace one of the semicolons with a full stop. > + * If the range is not currently summarized (i.e. the revmap returns InvalidTid > + * for it), there's nothing to do either. > + */ > + Datum > + brininsert(PG_FUNCTION_ARGS) There is no InvalidTid, as a constant or a #define. Perhaps replace with "invalid item pointer". > + /* > + * XXX We need to know the size of the table so that we know how long to > + * iterate on the revmap. There's room for improvement here, in that we > + * could have the revmap tell us when to stop iterating. > + */ The revmap doesn't know how large the table is. Remember that you have to return all blocks that are not in the revmap, so you can't just stop when you reach the end of the revmap. I think the current design is fine. I have to stop now to do some other stuff. Overall, this is in pretty good shape. In addition to little cleanup of things I listed above, and similar stuff elsewhere that I didn't read through right now, there are a few medium-sized items I'd still like to see addressed before you commit this: * expressional/partial index support * the difficulty of testing the union support function that we discussed earlier * clarify the memory context stuff of support functions that we also discussed earlier - Heikki
On Tue, September 23, 2014 21:04, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > [minmax-19.patch] > [minmax-19a.patch] Although admittedly it is not directly likely for us to need it, and although I see that there is a BRIN Extensibility chapter added (good!), I am still a bit surprised by the absence of a built-in BRIN operator class for bigint, as the BRIN index type is specifically useful for huge tables (where after all huge values are more likely to occur). Will a brin int8 be added operator class for 9.5? (I know, quite some time left...) (btw, so far the patch proves quite stable under my abusive testing...) thanks, Erik Rijkers
Heikki Linnakangas wrote: > On 09/23/2014 10:04 PM, Alvaro Herrera wrote: > >+ <para> > >+ The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname> > >+ is primarily maintained by Álvaro Herrera. > >+ </para> > > We don't usually have such verbiage in the docs. The GIN and GiST > pages do, but I think those are a historic exceptions, not something > we want to do going forward. Removed. > >+ <variablelist> > >+ <varlistentry> > >+ <term><function>BrinOpcInfo *opcInfo(void)</></term> > >+ <listitem> > >+ <para> > >+ Returns internal information about the indexed columns' summary data. > >+ </para> > >+ </listitem> > >+ </varlistentry> > > I think you should explain what that internal information is. The > minmax-19a.patch adds the type OID argument to this; remember to > update the docs. Updated. > In SP-GiST, the similar function is called "config". It might be > good to use the same name here, for consistency across indexams > (although I actually like the "opcInfo" name better than "config") Well, I'm not sure that there's any value in being consistent if the new name is better than the old one. Most likely, a person trying to implement an spgist opclass wouldn't try to do a brin opclass at the same time, so it's not like there's a lot of value in being consistent there, anyway. > The docs for the other support functions need to be updated, now > that you changed the arguments from DeformedBrTuple to BrinValues. Updated. > >+ <!-- this needs improvement ... --> > >+ To implement these methods in a generic ways, normally the opclass > >+ defines its own internal support functions. For instance, minmax > >+ opclasses add the support functions for the four inequality operators > >+ for the datatype. > >+ Additionally, the operator class must supply appropriate > >+ operator entries, > >+ to enable the optimizer to use the index when those operators are > >+ used in queries. > > The above needs improvement ;-) I rechecked and while I tweaked it here and there, I wasn't able to add much more to it. > >+ BRIN indexes (a shorthand for Block Range indexes) > >+ store summaries about the values stored in consecutive table physical block ranges. > > "consecutive table physical block ranges" is quite a mouthful. I reworded this introduction. I hope it makes more sense now. > >+ For datatypes that have a linear sort order, the indexed data > >+ corresponds to the minimum and maximum values of the > >+ values in the column for each block range, > >+ which support indexed queries using these operators: > >+ > >+ <simplelist> > >+ <member><literal><</literal></member> > >+ <member><literal><=</literal></member> > >+ <member><literal>=</literal></member> > >+ <member><literal>>=</literal></member> > >+ <member><literal>></literal></member> > >+ </simplelist> > > That's the built-in minmax indexing strategy, yes, but you could > have others, even for datatypes with a linear sort order. I "fixed" this by removing this list. It's not possible to be comprehensive here, I think, and anyway I don't think there's much point. > >+ To find out the index tuple for a particular page range, we have an internal > > s/find out/find/ > > >+ new heap tuple contains null values but the index tuple indicate there are no > > s/indicate/indicates/ Both fixed. > >+ Open questions > >+ -------------- > >+ > >+ * Same-size page ranges? > >+ Current related literature seems to consider that each "index entry" in a > >+ BRIN index must cover the same number of pages. There doesn't seem to be a > > What is the related literature? Is there an academic paper or > something that should be cited as a reference for BRIN? I the original "minmax-proposal" file, I had these four URLs: : Other database systems already have similar features. Some examples: : : * Oracle Exadata calls this "storage indexes" : http://richardfoote.wordpress.com/category/storage-indexes/ : : * Netezza has "zone maps" : http://nztips.com/2010/11/netezza-integer-join-keys/ : : * Infobright has this automatically within their "data packs" according to a : May 3rd, 2009 blog post : http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/ : : * MonetDB also uses this technique, according to a published paper : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662 : "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS" I gave them all a quick look and none of them touches the approach in detail; in fact other than the Oracle Exadata one, they are all talking about something else and mention the "minmax" stuff only in passing. I don't think any of them is worth citing. > >+ * TODO > >+ * * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY) > >+ * * add support for unlogged indexes > >+ * * ditto expressional indexes > > We don't have unlogged indexes in general, so no need to list that > here. What would be needed to implement ScalarArrayOpExprs? Well, it requires a different way to handle ScanKeys. Anyway the queries that it is supposed to serve can already be served in some other ways for AMs that don't have amsearcharray, so I don't think it's a huge loss if we don't implement it. We can add that later. > I didn't realize that expression indexes are still not supported. > And I see that partial indexes are not supported either. Why not? I > wouldn't expect BRIN to need to care about those things in > particular; the expressions for an expressional or partial index are > handled in the executor, no? Yeah; those restrictions were leftovers from back when I didn't really know how they were supposed to be implemented. I took out the restrictions and there wasn't anything else required to support both these features. > >+ /* > >+ * A tuple in the heap is being inserted. To keep a brin index up to date, > >+ * we need to obtain the relevant index tuple, compare its stored values with > >+ * those of the new tuple; if the tuple values are consistent with the summary > >+ * tuple, there's nothing to do; otherwise we need to update the index. > > s/compare/and compare/. Perhaps replace one of the semicolons with a > full stop. Fixed. > >+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid > >+ * for it), there's nothing to do either. > >+ */ > >+ Datum > >+ brininsert(PG_FUNCTION_ARGS) > > There is no InvalidTid, as a constant or a #define. Perhaps replace > with "invalid item pointer". Fixed -- actually it doesn't return invalid TID anymore, only NULL. > >+ /* > >+ * XXX We need to know the size of the table so that we know how long to > >+ * iterate on the revmap. There's room for improvement here, in that we > >+ * could have the revmap tell us when to stop iterating. > >+ */ > > The revmap doesn't know how large the table is. Remember that you > have to return all blocks that are not in the revmap, so you can't > just stop when you reach the end of the revmap. I think the current > design is fine. Yeah, I was leaning towards the same conclusion myself. I have removed the comment. (We could think about having brininsert update the metapage so that the index keeps track of what's the last heap page, which could help us support this, but I'm not sure there's much point. Anyway we can tweak this later.) > I have to stop now to do some other stuff. Overall, this is in > pretty good shape. In addition to little cleanup of things I listed > above, and similar stuff elsewhere that I didn't read through right > now, there are a few medium-sized items I'd still like to see > addressed before you commit this: > > * expressional/partial index support > * the difficulty of testing the union support function that we > discussed earlier I added an USE_ASSERTION-only block in brininsert that runs the union support proc and compares the output with the one from regular addValue. I haven't tested this too much yet. > * clarify the memory context stuff of support functions that we also > discussed earlier I re-checked this stuff. Turns out that the support functions don't palloc/pfree memory too much, except to update the stuff stored in BrinValues, by using datumCopy(). This memory is only freed when we need to update a previous Datum. There's no way for the brin.c code to know when the Datum is going to be released by the support proc, and thus no way for a temp context to be used. The memory context experiments I alluded to earlier are related to pallocs done in brininsert / bringetbitmap themselves, not in the opclass-provided support procs. All in all, I don't think there's much room for improvement, other than perhaps doing so in brininsert/ bringetbitmap. Don't really care too much about this either way. Once again, many thanks for the review. Here's a new version. I have added operator classes for int8, text, and actually everything that btree supports except: bool record oidvector anyarray tsvector tsquery jsonb range since I'm not sure that it makes sense to have opclasses for any of these -- at least not regular minmax opclasses. There are some interesting possibilities, for example for range types, whereby we store in the index tuple the union of all the range in the block range. (I had an opclass for anyenum too, but on further thought I removed it because it is going to be pointless in nearly all cases.) contrib/pageinspect/Makefile | 2 +- contrib/pageinspect/brinfuncs.c | 410 +++++++++++ contrib/pageinspect/pageinspect--1.2.sql | 37 + contrib/pg_xlogdump/rmgrdesc.c | 1 + doc/src/sgml/brin.sgml | 498 +++++++++++++ doc/src/sgml/filelist.sgml | 1 + doc/src/sgml/indices.sgml | 36 +- doc/src/sgml/postgres.sgml | 1 + src/backend/access/Makefile | 2 +- src/backend/access/brin/Makefile | 18 + src/backend/access/brin/README | 179 +++++ src/backend/access/brin/brin.c | 1116 ++++++++++++++++++++++++++++++ src/backend/access/brin/brin_minmax.c | 320 +++++++++ src/backend/access/brin/brin_pageops.c | 712 +++++++++++++++++++ src/backend/access/brin/brin_revmap.c | 473 +++++++++++++ src/backend/access/brin/brin_tuple.c | 553 +++++++++++++++ src/backend/access/brin/brin_xlog.c | 319 +++++++++ src/backend/access/common/reloptions.c | 7 + src/backend/access/heap/heapam.c | 22 +- src/backend/access/rmgrdesc/Makefile | 3 +- src/backend/access/rmgrdesc/brindesc.c | 112 +++ src/backend/access/transam/rmgr.c | 1 + src/backend/catalog/index.c | 24 + src/backend/replication/logical/decode.c | 1 + src/backend/storage/page/bufpage.c | 179 ++++- src/backend/utils/adt/selfuncs.c | 74 +- src/include/access/brin.h | 52 ++ src/include/access/brin_internal.h | 87 +++ src/include/access/brin_page.h | 70 ++ src/include/access/brin_pageops.h | 36 + src/include/access/brin_revmap.h | 39 ++ src/include/access/brin_tuple.h | 97 +++ src/include/access/brin_xlog.h | 107 +++ src/include/access/heapam.h | 2 + src/include/access/reloptions.h | 3 +- src/include/access/relscan.h | 4 +- src/include/access/rmgrlist.h | 1 + src/include/catalog/index.h | 8 + src/include/catalog/pg_am.h | 2 + src/include/catalog/pg_amop.h | 164 +++++ src/include/catalog/pg_amproc.h | 245 +++++++ src/include/catalog/pg_opclass.h | 32 + src/include/catalog/pg_opfamily.h | 28 + src/include/catalog/pg_proc.h | 38 + src/include/storage/bufpage.h | 2 + src/include/utils/selfuncs.h | 1 + src/test/regress/expected/opr_sanity.out | 14 +- src/test/regress/sql/opr_sanity.sql | 7 +- 48 files changed, 6122 insertions(+), 18 deletions(-) (I keep naming the patch file "minmax", but nothing in the code is actually called that way anymore, except the opclasses). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 10/07/2014 01:33 AM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: >> On 09/23/2014 10:04 PM, Alvaro Herrera wrote: >>> + Open questions >>> + -------------- >>> + >>> + * Same-size page ranges? >>> + Current related literature seems to consider that each "index entry" in a >>> + BRIN index must cover the same number of pages. There doesn't seem to be a >> >> What is the related literature? Is there an academic paper or >> something that should be cited as a reference for BRIN? > > I the original "minmax-proposal" file, I had these four URLs: > > : Other database systems already have similar features. Some examples: > : > : * Oracle Exadata calls this "storage indexes" > : http://richardfoote.wordpress.com/category/storage-indexes/ > : > : * Netezza has "zone maps" > : http://nztips.com/2010/11/netezza-integer-join-keys/ > : > : * Infobright has this automatically within their "data packs" according to a > : May 3rd, 2009 blog post > : http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/ > : > : * MonetDB also uses this technique, according to a published paper > : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662 > : "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS" > > I gave them all a quick look and none of them touches the approach in > detail; in fact other than the Oracle Exadata one, they are all talking > about something else and mention the "minmax" stuff only in passing. I > don't think any of them is worth citing. I think the "current related literature" phrase should be removed, if there isn't in fact any literature on this. If there's any literature worth referencing, should add a proper citation. > I added an USE_ASSERTION-only block in brininsert that runs the union > support proc and compares the output with the one from regular addValue. > I haven't tested this too much yet. Ok, that's better than nothing. I wonder if it's too strict, though. It uses brin_tuple_equal(), which does a memcmp() on the tuples. That will trip for any non-meaningful differences, like the scale in a numeric. >> * clarify the memory context stuff of support functions that we also >> discussed earlier > > I re-checked this stuff. Turns out that the support functions don't > palloc/pfree memory too much, except to update the stuff stored in > BrinValues, by using datumCopy(). This memory is only freed when we > need to update a previous Datum. There's no way for the brin.c code to > know when the Datum is going to be released by the support proc, and > thus no way for a temp context to be used. > > The memory context experiments I alluded to earlier are related to > pallocs done in brininsert / bringetbitmap themselves, not in the > opclass-provided support procs. At the very least, it needs to be documented. > All in all, I don't think there's much > room for improvement, other than perhaps doing so in brininsert/ > bringetbitmap. Don't really care too much about this either way. Doing it in brininsert/bringetbitmap seems like the right approach. GiST, GIN, and SP-GiST all use a temporary memory context like that. It would be wise to reserve some more support procedure numbers, for future expansion. Currently, support procs 1-4 are used by BRIN itself, and higher numbers can be used by the opclass. minmax opclasses uses 5-8 for the <, <=, >= and > operators. If we ever want to add a new, optional, support function to BRIN, we're out of luck. Let's document that e.g. support procs < 10 are reserved for BRIN. The redo routines should be updated to follow the new XLogReadBufferForRedo idiom (commit f8f4227976a2cdb8ac7c611e49da03aa9e65e0d2). - Heikki
> Once again, many thanks for the review. Here's a new version. I have > added operator classes for int8, text, and actually everything that btree > supports except: > bool > record > oidvector > anyarray > tsvector > tsquery > jsonb > range > > since I'm not sure that it makes sense to have opclasses for any of > these -- at least not regular minmax opclasses. There are some > interesting possibilities, for example for range types, whereby we store > in the index tuple the union of all the range in the block range. I thought we can do better than minmax for the inet data type, and ended up with a generalized opclass supporting both inet and range types. Patch based on minmax-v20 attached. It works well except a few small problems. I will improve the patch and add into a commitfest after BRIN framework is committed. To support more operators I needed to change amstrategies and amsupport on the catalog. It would be nice if amsupport can be set to 0 like amstrategies. Inet data types accept IP version 4 and version 6. It is not possible to represent union of addresses from different versions with a valid inet type. So, I made the union function return NULL in this case. Then, I tried to store if returned value is NULL or not, in column->values[] as boolean, but it failed on the pfree() inside brin_dtuple_initilize(). It doesn't seem right to free the values based on attr->attbyval. I think the same opclass can be used for geometric types. I can rename it to inclusion_ops instead of range_ops. The GiST opclasses for the geometric types use bounding boxes. It wouldn't be possible to use a different data type in a generic oplass. Maybe STORAGE parameter can be used for that purpose. > (I had an opclass for anyenum too, but on further thought I removed it > because it is going to be pointless in nearly all cases.) It can be useful in some circumstances. We wouldn't lose anything by supporting more types. I think we should even add an operator class for boolean.
Attachment
Heikki Linnakangas wrote: > On 10/07/2014 01:33 AM, Alvaro Herrera wrote: > >I added an USE_ASSERTION-only block in brininsert that runs the union > >support proc and compares the output with the one from regular addValue. > >I haven't tested this too much yet. > > Ok, that's better than nothing. I wonder if it's too strict, though. It uses > brin_tuple_equal(), which does a memcmp() on the tuples. That will trip for > any non-meaningful differences, like the scale in a numeric. True. I'm not real sure how to do better, though. For types that have a btree opclass it's easy, because we can just use the btree equality function to compare the values. But most interesting cases would not have btree opclasses; those are covered by the minmax family of opclasses. > It would be wise to reserve some more support procedure numbers, for future > expansion. Currently, support procs 1-4 are used by BRIN itself, and higher > numbers can be used by the opclass. minmax opclasses uses 5-8 for the <, <=, > >= and > operators. If we ever want to add a new, optional, support function > to BRIN, we're out of luck. Let's document that e.g. support procs < 10 are > reserved for BRIN. Sure. I hope we never need to add a seventh optional support function ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
+ <acronym>BRIN</acronym> indexes can satisfy queries via the bitmap + scanning facility, and will return all tuples in all pages within "The bitmap scanning facility?" Does this mean a bitmap index scan? Or something novel to BRIN? I think this could be clearer. + This enables them to work as very fast sequential scan helpers to avoid + scanning blocks that are known not to contain matching tuples. Hmm, but they don't actually do anything about sequential scans per se, right? I'd say something like: "Because a BRIN index is very small, scanning the index adds little overhead compared to a sequential scan, but may avoid scanning large parts of the table that are known not to contain matching tuples." + depend on the operator class selected for the data type. The operator class is selected for the index, not the data type. + The size of the block range is determined at index creation time with + the <literal>pages_per_range</> storage parameter. + The smaller the number, the larger the index becomes (because of the need to + store more index entries), but at the same time the summary data stored can + be more precise and more data blocks can be skipped during an index scan. I would insert a sentence something like this: "The number of index entries will be equal to the size of the relation in pages divided by the selected value for pages_per_range. Therefore, the smaller the number...." At least, I would insert that if it's actually true. My point is that I think the effect of pages_per_range could be made more clear. + The core <productname>PostgreSQL</productname> distribution includes + includes the <acronym>BRIN</acronym> operator classes shown in + <xref linkend="gin-builtin-opclasses-table">. Shouldn't that say brin, not gin? + requiring the access method implementer only to implement the semantics The naming of the reverse range map seems a little weird. It seems like most operations go through it, so it feels more like the forward direction. Maybe I'm misunderstanding. (I doubt it's worth renaming it at this point either way, but I thought I'd mention it.) + errmsg("unlogged BRIN indexes are not supported"))); Why not? Shouldn't be particularly hard, I wouldn't think. I'm pretty sure you need to create a pageinspect--1.3.sql, not just update the 1.2 file. Because that's in 9.4, and this won't be. I'm pretty excited about this feature. I think it's going to be very good for PostgreSQL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > [lots] I have fixed all these items in the attached, thanks -- most user-visible change was the pageinspect 1.3 thingy. pg_upgrade from 1.2 works fine now. I also fixed some things Heikki noted, mainly avoid retail pfree where possible, and renumber the support procs to leave room for future expansion of the framework. XLog replay code is updated too. Also, I made the summarization step callable directly from SQL without having to invoke VACUUM. So here's v21. I also attach a partial diff from v20, just in case anyone wants to give it a look. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Robert Haas wrote:
> [lots]
I have fixed all these items in the attached, thanks -- most
user-visible change was the pageinspect 1.3 thingy. pg_upgrade from 1.2
works fine now. I also fixed some things Heikki noted, mainly avoid
retail pfree where possible, and renumber the support procs to leave
room for future expansion of the framework. XLog replay code is updated
too.
Also, I made the summarization step callable directly from SQL without
having to invoke VACUUM.
So here's v21. I also attach a partial diff from v20, just in case
anyone wants to give it a look.
I get a couple compiler warnings with this:
brin.c: In function 'brininsert':
brin.c:97: warning: 'tupcxt' may be used uninitialized in this function
brin.c:98: warning: 'oldcxt' may be used uninitialized in this function
Also, I think it is missing a cat version bump. It let me start the patched server against an unpatched initdb run, but once started it didn't find the index method.
What would it take to make CLUSTER work on a brin index? Now I just added a btree index on the same column, clustered on that, then dropped that index.
Thanks,
Jeff
Jeff Janes wrote: > On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > I get a couple compiler warnings with this: > > brin.c: In function 'brininsert': > brin.c:97: warning: 'tupcxt' may be used uninitialized in this function > brin.c:98: warning: 'oldcxt' may be used uninitialized in this function Ah, that's easily fixed. My compiler (gcc 4.9 from Debian Jessie nowadays) doesn't complain, but I can see that it's not entirely trivial. > Also, I think it is missing a cat version bump. It let me start the > patched server against an unpatched initdb run, but once started it didn't > find the index method. Sure, that's expected (by me at least). I'm too lazy to maintain catversion bumps in the patch before pushing, since that generates constant conflicts as I rebase. > What would it take to make CLUSTER work on a brin index? Now I just added > a btree index on the same column, clustered on that, then dropped that > index. Interesting question. What's the most efficient way to pack a table to minimize the intervals covered by each index entry? One thing that makes this project a bit easier, I think, is that CLUSTER has already been generalized so that it supports either an indexscan or a seqscan+sort. If anyone wants to work on this, be my guest; I'm certainly not going to add it to the initial commit. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 3 November 2014 22:18, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > So here's v21. I also attach a partial diff from v20, just in case > anyone wants to give it a look. Looks really good. I'd like to reword this sentence in the readme, since one of the main use cases would be tables without btrees It's unlikely that BRIN would be the only + indexes in a table, though, because primary keys can be btrees only, and so + we don't implement this optimization. I don't see a regression test. Create, use, VACUUM, just so we know it hasn't regressed after commit. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
contrib/pageinspect/pageinspect.control
So here's v21. I also attach a partial diff from v20, just in case
anyone wants to give it a look.
This needs a bump to 1.3, or the extension won't install:
During crash recovery, I am getting a segfault:
#0 0x000000000067ee35 in LWLockRelease (lock=0x0) at lwlock.c:1161
#1 0x0000000000664f4a in UnlockReleaseBuffer (buffer=0) at bufmgr.c:2888
#2 0x0000000000465a88 in brin_xlog_revmap_extend (lsn=<value optimized out>, record=<value optimized out>) at brin_xlog.c:261
#3 brin_redo (lsn=<value optimized out>, record=<value optimized out>) at brin_xlog.c:284
#4 0x00000000004ce505 in StartupXLOG () at xlog.c:6795
I failed to preserve the data directory, I'll try to repeat this later this week if needed.
Cheers,
Jeff
Jeff Janes wrote: > On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > > > > > So here's v21. I also attach a partial diff from v20, just in case > > anyone wants to give it a look. > > > > This needs a bump to 1.3, or the extension won't install: Missed that, thanks. > #0 0x000000000067ee35 in LWLockRelease (lock=0x0) at lwlock.c:1161 > #1 0x0000000000664f4a in UnlockReleaseBuffer (buffer=0) at bufmgr.c:2888 > #2 0x0000000000465a88 in brin_xlog_revmap_extend (lsn=<value optimized > out>, record=<value optimized out>) at brin_xlog.c:261 > #3 brin_redo (lsn=<value optimized out>, record=<value optimized out>) at > brin_xlog.c:284 > #4 0x00000000004ce505 in StartupXLOG () at xlog.c:6795 > > I failed to preserve the data directory, I'll try to repeat this later this > week if needed. I was clearly too careless about testing the xlog code --- it had numerous bugs. This version should be a lot better, but there might be problems lurking still as I don't think I covered it all. Let me know if you see anything wrong. I also added pageinspect docs, which I had neglected and only realized due to a comment in another thread (thanks Amit). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Tue, Nov 4, 2014 at 2:28 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I was clearly too careless about testing the xlog code --- it had
numerous bugs. This version should be a lot better, but there might be
problems lurking still as I don't think I covered it all. Let me know
if you see anything wrong.
At line 252 of brin_xlog.c, should the UnlockReleaseBuffer(metabuf) be protected by a BufferIsValid?
XLogReadBufferForRedo says that it might return an invalid buffer under some situations. Perhaps it is known that those situations can't apply here?
Now I am getting segfaults during normal (i.e. no intentional crashes) operations. I think I was seeing them sometimes before as well, I just wasn't looking for them.
The attached script invokes the segfault within a few minutes. A lot of the stuff in the script is probably not necessary, I just didn't spend the time to pair it down to the essentials. It does not need to be in parallel, I get the crash when invoked with only one job (perl ~/brin_crash.pl 1).
I think this is related to having block ranges which have no tuples in them when they are first summarized. If I take out the "with t as (delete from foo returning *) insert into foo select * from t", then I don't see the crashes
#0 0x000000000089ed3e in pg_detoast_datum_packed (datum=0x0) at fmgr.c:2270
#1 0x0000000000869be9 in text_le (fcinfo=0x7fff1bf6b9f0) at varlena.c:1661
#2 0x000000000089cfc7 in FunctionCall2Coll (flinfo=0x297e640, collation=100, arg1=0, arg2=43488216) at fmgr.c:1324
#3 0x00000000004678f8 in minmaxConsistent (fcinfo=0x7fff1bf6be40) at brin_minmax.c:213
#4 0x000000000089d0c9 in FunctionCall3Coll (flinfo=0x297b830, collation=100, arg1=43509512, arg2=43510296, arg3=43495856) at fmgr.c:1349
#5 0x0000000000462484 in bringetbitmap (fcinfo=0x7fff1bf6c310) at brin.c:469
#6 0x000000000089cfc7 in FunctionCall2Coll (flinfo=0x28f2440, collation=0, arg1=43495712, arg2=43497376) at fmgr.c:1324
#7 0x00000000004b3fc9 in index_getbitmap (scan=0x297b120, bitmap=0x297b7a0) at indexam.c:651
#8 0x000000000062ece0 in MultiExecBitmapIndexScan (node=0x297af30) at nodeBitmapIndexscan.c:89
#9 0x0000000000619783 in MultiExecProcNode (node=0x297af30) at execProcnode.c:550
#10 0x000000000062dea2 in BitmapHeapNext (node=0x2974750) at nodeBitmapHeapscan.c:104
Cheers,
Jeff
Attachment
Jeff Janes wrote: > At line 252 of brin_xlog.c, should the UnlockReleaseBuffer(metabuf) be > protected by a BufferIsValid? Yes, that was just me being careless. Fixed. > Now I am getting segfaults during normal (i.e. no intentional crashes) > operations. I think I was seeing them sometimes before as well, I just > wasn't looking for them. Interesting. I was neglecting to test for empty index tuples in the Consistent support function. Should be fixed now, and I verified that the other support functions check for this condition (AFAICS this was the only straggler -- I had fixed all the others already). > I think this is related to having block ranges which have no tuples in them > when they are first summarized. If I take out the "with t as (delete from > foo returning *) insert into foo select * from t", then I don't see the > crashes Exactly. After fixing that I noticed that there was an assertion (about collations) failing under certain conditions with your script. I also fixed that. I also added a test for regress. I didn't have time to distill a standalone test case for your crash, though. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Thanks for the updated patch.
Now when I run the test program (version with better error reporting attached), it runs fine until I open a psql session and issue:
reindex table foo;
Then it immediately falls over with some rows no longer being findable through the index.
-- use index
select count(*) from foo where text_array = md5(4611::text);
0
-- use seq scan
select count(*) from foo where text_array||'' = md5(4611::text);
1
Where the number '4611' was taken from the error message of the test program.
Attachment
Jeff Janes wrote: > On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > > Thanks for the updated patch. > > Now when I run the test program (version with better error reporting > attached), it runs fine until I open a psql session and issue: > > reindex table foo; Interesting. This was a more general issue actually -- if you dropped the index at that point and created it again, the resulting index would also be corrupt in the same way. Inspecting with the supplied pageinspect functions made the situation pretty obvious. The old code was skipping page ranges in which it could not find any tuples, but that's bogus and inefficient. I changed an "if" into a loop that inserts intermediary tuples, if any are needed. I cannot reproduce that problem anymore. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
I just pushed this, after some more minor tweaks. Thanks, and please do continue testing! -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Nov 8, 2014 at 8:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I just pushed this, after some more minor tweaks. Thanks, and please do
continue testing!
I'm having problems getting this to compile on MSVC. Attached is a patch which fixes the problem.
There also seems to be a bit of a problem with:
brin.c(250): warning C4700: uninitialized local variable 'newsz' used
/*
* Before releasing the lock, check if we can attempt a same-page
* update. Another process could insert a tuple concurrently in
* the same page though, so downstream we must be prepared to cope
* if this turns out to not be possible after all.
*/
samepage = brin_can_do_samepage_update(buf, origsz, newsz);
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
newtup = brin_form_tuple(bdesc, heapBlk, dtup, &newsz);
Here newsz is passed to brin_can_do_samepage_update before being initialised. I'm not quite sure of the solution here as I've not spent much time looking at it, but perhaps brin_form_tuple needs to happen before brin_can_do_samepage_update, then the lock should be released? I didn't change this in the patch as I'm not sure if that's the proper fix or not.
The attached should fix the build problem that anole is having: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2014-11-07%2022%3A04%3A03
Regards
David Rowley
Attachment
David Rowley <dgrowleyml@gmail.com> writes: > I'm having problems getting this to compile on MSVC. Attached is a patch > which fixes the problem. The committed code is completely broken on compilers that don't accept varargs macros, and this patch will not make them happier. Probably what needs to happen is to put extra parentheses into the call sites, along the lines of #ifdef BRIN_DEBUG #define BRIN_elog(args) elog args #else #define BRIN_elog(args) ((void) 0) #endif BRIN_elog((LOG, "fmt", ...)); Or we could decide we don't need this debugging crud anymore and just nuke it all. regards, tom lane
Tom Lane wrote: > David Rowley <dgrowleyml@gmail.com> writes: > > I'm having problems getting this to compile on MSVC. Attached is a patch > > which fixes the problem. > > The committed code is completely broken on compilers that don't accept > varargs macros, and this patch will not make them happier. I tried to make it fire only on GCC, which is known to support variadic macros, but I evidently failed. > Probably what needs to happen is to put extra parentheses into the call > sites, along the lines of > > #ifdef BRIN_DEBUG > #define BRIN_elog(args) elog args > #else > #define BRIN_elog(args) ((void) 0) > #endif > > > BRIN_elog((LOG, "fmt", ...)); That works for me, thanks for the suggestion. > Or we could decide we don't need this debugging crud anymore and just > nuke it all. I'm removing one which seems pointless, but keeping the others for now. We can always remove them later. (I also left BRIN_DEBUG turned on by default; I'm turning it off.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Nov 8, 2014 at 8:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I just pushed this, after some more minor tweaks. Thanks, and please do
continue testing!
Here's another small fix for some unused variable warnings. Unfortunately this Microsoft compiler that I'm using does not know about __attribute__((unused)), so some warnings are generated for these:
BrinTuple *tmptup PG_USED_FOR_ASSERTS_ONLY;
BrinMemTuple *tmpdtup PG_USED_FOR_ASSERTS_ONLY;
Size tmpsiz PG_USED_FOR_ASSERTS_ONLY;
The attached patch moves these into within the #ifdef USE_ASSERT_CHECKING section.
I know someone will ask so, let me explain: The reason I don't see a bunch of other warnings for PG_USED_FOR_ASSERTS_ONLY vars when compiling without assert checks, is that this Microsoft compiler seems to be ok with variables being assigned values and the values never being used, but if the variable is never assigned a value, then it'll warn you of that.
Regards
David Rowley
Attachment
On Sat, Nov 8, 2014 at 8:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I just pushed this, after some more minor tweaks. Thanks, and please do
continue testing!
Please find attached another small fix. This time it's just a small typo in the README, and just some updates to some, now outdated docs.
Kind Regards
David Rowley
Attachment
On Sat, Nov 8, 2014 at 5:40 PM, David Rowley <dgrowleyml@gmail.com> wrote: > Please find attached another small fix. This time it's just a small typo in > the README, and just some updates to some, now outdated docs. Speaking about the feature... The index operators are still named with "minmax", wouldn't it be better to switch to "brin"? Regards, -- Michael
On 11/09/2014 08:06 AM, Michael Paquier wrote: > On Sat, Nov 8, 2014 at 5:40 PM, David Rowley <dgrowleyml@gmail.com> wrote: >> Please find attached another small fix. This time it's just a small typo in >> the README, and just some updates to some, now outdated docs. > Speaking about the feature... The index operators are still named with > "minmax", wouldn't it be better to switch to "brin"? All the built-in opclasses still implement the min-max policy - they store the min and max values. BRIN supports other kinds of opclasses, like storing a containing box for points, but no such opclasses have been implemented yet. Speaking of which, Alvaro, any chance we could get such on opclass still included into 9.5? It would be nice to have one, just to be sure that nothing minmax-specific has crept into the BRIN code. - Heikki
On Sat, Nov 8, 2014 at 4:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I just pushed this, after some more minor tweaks. Nice! > Thanks, and please do continue testing! I got the following PANIC error in the standby server when I set up the replication servers and ran "make installcheck". Note that I was repeating the manual CHECKPOINT every second while "installcheck" was running. Without the checkpoints, I could not reproduce the problem. I'm not sure if CHECKPOINT really triggers this problem, though. Anyway BRIN seems to have a problem around its WAL replay. 2014-11-09 22:19:42 JST sby1 WARNING: page 547 of relation base/16384/30878 does not exist 2014-11-09 22:19:42 JST sby1 CONTEXT: xlog redo BRIN/UPDATE: rel 1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2) TID (547,2) 2014-11-09 22:19:42 JST sby1 PANIC: WAL contains references to invalid pages 2014-11-09 22:19:42 JST sby1 CONTEXT: xlog redo BRIN/UPDATE: rel 1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2) TID (547,2) 2014-11-09 22:19:47 JST sby1 LOG: startup process (PID 15230) was terminated by signal 6: Abort trap 2014-11-09 22:19:47 JST sby1 LOG: terminating any other active server processes Regards, -- Fujii Masao
On Sun, Nov 9, 2014 at 9:18 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Speaking of which, Alvaro, any chance we could get such on opclass still > included into 9.5? It would be nice to have one, just to be sure that > nothing minmax-specific has crept into the BRIN code. I'm trying to do a bloom filter Brin index. I'm already a bit puzzled by a few things but I've just started so maybe it'll become clear. From what I've seen so far it feels more likely there's the opposite. There's some boilerplate that I'm doing that feels like it could be pushed down into general Brin code since it'll be the same for every access method. -- greg
On Sun, Nov 9, 2014 at 5:06 PM, Greg Stark <stark@mit.edu> wrote: > I'm trying to do a bloom filter Brin index. I'm already a bit puzzled > by a few things but I've just started so maybe it'll become clear. So some quick comments from pretty early goings -- partly because I'm afraid once I get past them I'll forget what it was I was confused by.... 1) The manual describes the exensibility API including the BrinOpcInfo struct -- but it doesn't define the BrinDesc struct that every API method takes. It's not clear what exactly that argument is for or how to make use of it. 2) The mention about additional opclass operators and to number them from 11 up is fine -- but there's no explanation of how to decide what operators need to be explicitly added like that. Specifically I gather from reading minmax that = is handled internally by Brin and you only need to add any other operators aside from = ? Is that right? 3) It's not entirely clear in the docs when each method is will be invoked. Specifically it's not clear whether opcInfo is invoked once when the index is defined or every time the definition is loaded to be used. I gather it's the latter? Perhaps there needs to be a method that's invoked specifically when the index is defined? I'm wondering where I'm going to hook in the logic to determine the size and number of hash functions to use for the bloom filter which needs to be decided once when the index is created and then static for the index in the future. 4) It doesn't look like BRIN handles cross-type operators at all. For example this query with btree indexes can use the index just fine because it looks up the operator based on both the left and right operands: ::***# explain select * from data where i = 1::smallint; ┌─────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────┤ │ Index Scan using btree_i on data (cost=0.42..8.44 rows=1 width=14) │ │ Index Cond: (i = 1::smallint) │ └─────────────────────────────────────────────────────────────────────┘ (2 rows) But Minmax opclasses don't contain the cross-type operators and in fact looking at the code I don't think minmax would be able to cope (minmax_get_procinfo doesn't even get passed the type int he qual, only the type of the column). ::***# explain select * from data2 where i = 1::smallint; ┌──────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────┤ │ Seq Scan on data2 (cost=0.00..18179.00 rows=1 width=14) │ │ Filter: (i = 1::smallint) │ └──────────────────────────────────────────────────────────┘ (2 rows) Time: 0.544 ms -- greg
On Sun, Nov 9, 2014 at 5:57 PM, Greg Stark <stark@mit.edu> wrote: > 2) The mention about additional opclass operators and to number them > from 11 up is fine -- but there's no explanation of how to decide what > operators need to be explicitly added like that. Specifically I gather > from reading minmax that = is handled internally by Brin and you only > need to add any other operators aside from = ? Is that right? I see I totally misunderstood the use of the opclass procedure functions. I think I understand now but just to be sure -- If I can only handle BTEqualStrategyNumber keys then is it adequate to just define the opclass containing only the equality operator? Somehow I got confused between the amprocs that minmax uses to implement the consistency function and the amops that the brin index supports. -- greg
Heikki Linnakangas wrote: > Speaking of which, Alvaro, any chance we could get such on opclass still > included into 9.5? It would be nice to have one, just to be sure that > nothing minmax-specific has crept into the BRIN code. Emre Hasegeli contributed a patch for range types. I am hoping he will post a rebased version that we can consider including. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Nov 9, 2014 at 10:30 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sat, Nov 8, 2014 at 4:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> >> I just pushed this, after some more minor tweaks. > > Nice! > >> Thanks, and please do continue testing! > > I got the following PANIC error in the standby server when I set up > the replication servers and ran "make installcheck". Note that I was > repeating the manual CHECKPOINT every second while "installcheck" > was running. Without the checkpoints, I could not reproduce the > problem. I'm not sure if CHECKPOINT really triggers this problem, though. > Anyway BRIN seems to have a problem around its WAL replay. > > 2014-11-09 22:19:42 JST sby1 WARNING: page 547 of relation > base/16384/30878 does not exist > 2014-11-09 22:19:42 JST sby1 CONTEXT: xlog redo BRIN/UPDATE: rel > 1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2) > TID (547,2) > 2014-11-09 22:19:42 JST sby1 PANIC: WAL contains references to invalid pages > 2014-11-09 22:19:42 JST sby1 CONTEXT: xlog redo BRIN/UPDATE: rel > 1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2) > TID (547,2) > 2014-11-09 22:19:47 JST sby1 LOG: startup process (PID 15230) was > terminated by signal 6: Abort trap > 2014-11-09 22:19:47 JST sby1 LOG: terminating any other active server processes > I could reproduce this using the same steps. It's the same page 547 here too if that's any helpful. Thanks, Amit
Fujii Masao wrote: > I got the following PANIC error in the standby server when I set up > the replication servers and ran "make installcheck". Note that I was > repeating the manual CHECKPOINT every second while "installcheck" > was running. Without the checkpoints, I could not reproduce the > problem. I'm not sure if CHECKPOINT really triggers this problem, though. > Anyway BRIN seems to have a problem around its WAL replay. Hm, I think I see what's happening. The xl_brin_update record references two buffers, one which is target for the updated tuple and another which is the revmap buffer. When the update target buffer is being first used we set the INIT bit which removes the buffer reference from the xlog record; in that case, if the revmap buffer is first being modified after the prior checkpoint, that revmap buffer receives backup block number 0; but the code hardcodes it as 1 on the expectation that the buffer that's target for the update will receive 0. The attached patch should fix this. I cannot reproduce the issue after applying this patch, can you please confirm that it fixes the issue for you as well? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > Hm, I think I see what's happening. The xl_brin_update record > references two buffers, one which is target for the updated tuple and > another which is the revmap buffer. When the update target buffer is > being first used we set the INIT bit which removes the buffer reference > from the xlog record; in that case, if the revmap buffer is first being > modified after the prior checkpoint, that revmap buffer receives backup > block number 0; but the code hardcodes it as 1 on the expectation that > the buffer that's target for the update will receive 0. The attached > patch should fix this. Pushed, thanks for the report. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Greg Stark wrote: > On Sun, Nov 9, 2014 at 5:57 PM, Greg Stark <stark@mit.edu> wrote: > > 2) The mention about additional opclass operators and to number them > > from 11 up is fine -- but there's no explanation of how to decide what > > operators need to be explicitly added like that. Specifically I gather > > from reading minmax that = is handled internally by Brin and you only > > need to add any other operators aside from = ? Is that right? > > I see I totally misunderstood the use of the opclass procedure > functions. I think I understand now but just to be sure -- If I can > only handle BTEqualStrategyNumber keys then is it adequate to just > define the opclass containing only the equality operator? Yes. I agree that this deserves some more documentation. In a nutshell, the opclass must provide three separate groups of items: 1. the mandatory support functions, opcInfo, addValue, Union, Consistent. opcInfo is invoked each time the index is accessed (including during index creation). 2. the additional support functions; normally these are called from within addValue, Consistent, Union. For minmax, what we provide is the functions that implement the inequality operators for the type, that is < <= => and >. Since minmax tries to be generic and support a whole lot of types, this is the way that the mandatory support functions know what functions to call to compare two given values. If the opclass is specific to one data type, you might not need anything here; or perhaps you have other ways to figure out a hash function to call, etc. 3. the operators. We only use these so that the optimizer picks up the index for queries. > Somehow I got confused between the amprocs that minmax uses to > implement the consistency function and the amops that the brin index > supports. I think it is somewhat confusing, yeah. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Greg Stark wrote: > 1) The manual describes the exensibility API including the BrinOpcInfo > struct -- but it doesn't define the BrinDesc struct that every API > method takes. It's not clear what exactly that argument is for or how > to make use of it. Hm, I guess this could use some expansion. > 2) The mention about additional opclass operators and to number them > from 11 up is fine -- but there's no explanation of how to decide what > operators need to be explicitly added like that. Specifically I gather > from reading minmax that = is handled internally by Brin and you only > need to add any other operators aside from = ? Is that right? I think I already replied to this in the other email. > 3) It's not entirely clear in the docs when each method is will be > invoked. Specifically it's not clear whether opcInfo is invoked once > when the index is defined or every time the definition is loaded to be > used. I gather it's the latter? Perhaps there needs to be a method > that's invoked specifically when the index is defined? I'm wondering > where I'm going to hook in the logic to determine the size and number > of hash functions to use for the bloom filter which needs to be > decided once when the index is created and then static for the index > in the future. Every time the index is accessed, yeah. I'm not sure about figuring the initial creation details. Do you think we need another support procedure to help with that? We can add it if needed; minmax would just define it to InvalidOid. > 4) It doesn't look like BRIN handles cross-type operators at all. The idea here is that there is a separate opclass to handle cross-type operators, which would be together in the same opfamily as the opclass used to create the index. I haven't actually tried this yet, mind you. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Nov 10, 2014 at 9:31 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Every time the index is accessed, yeah. I'm not sure about figuring the > initial creation details. Do you think we need another support > procedure to help with that? We can add it if needed; minmax would just > define it to InvalidOid. I have a working bloom filter with hard coded filter size and hard coded number of hash functions. I need to think about how I'm going to make it more general now. I think the answer is that I should have an index option that specifies the false positive rate and calculates the optimal filter size and number of hash functions. It might possibly need to peek at the table statistics to determine the population size though. Or perhaps I should bite the bullet and size the bloom filters based on the actual number of rows in a chunk since the BRIN infrastructure does allow each summary to be a different size. There's another API question I have. To implement Consistent I need to call the hash function which in the case of functions like hashtext could be fairly expensive and I even need to generate multiple hash values(though currently I'm slicing them all from the integer hash value so that's not too bad) and then test each of those bits. It would be natural to call hashtext once at the start of the scan and possibly build a bitmap and compare all of them in a single & operation. But afaict there's no way to hook the beginning of the scan and opaque is not associated with the specific scan so I don't think I can cache the hash value of the scan key there safely. Is there a good way to do it with the current API? On a side note I'm curious about something, I was stepping through the my code in gdb and discovered that a single row insert appeared to construct a new summary then union it into the existing summary instead of just calling AddValue on the existing summary. Is that intentional? What led to that? -- greg
Greg Stark wrote: > There's another API question I have. To implement Consistent I need to > call the hash function which in the case of functions like hashtext > could be fairly expensive and I even need to generate multiple hash > values(though currently I'm slicing them all from the integer hash > value so that's not too bad) and then test each of those bits. It > would be natural to call hashtext once at the start of the scan and > possibly build a bitmap and compare all of them in a single & > operation. But afaict there's no way to hook the beginning of the scan > and opaque is not associated with the specific scan so I don't think I > can cache the hash value of the scan key there safely. Is there a good > way to do it with the current API? I'm not sure why you say opaque is not associated with the specific scan. Are you thinking we could reuse opaque for a future scan? I think we could consider that opaque *is* the place to cache things such as the hashed value of the qual constants or whatever. > On a side note I'm curious about something, I was stepping through the > my code in gdb and discovered that a single row insert appeared to > construct a new summary then union it into the existing summary > instead of just calling AddValue on the existing summary. Is that > intentional? What led to that? That's to test the Union procedure; if you look at the code, it's just used in assert-enabled builds. Now that I think about it, perhaps this can turn out to be problematic for your bloom filter opclass. I considered the idea of allowing the opclass to disable this testing procedure, but it isn't done (yet.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 11, 2014 at 2:14 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I'm not sure why you say opaque is not associated with the specific > scan. Are you thinking we could reuse opaque for a future scan? I > think we could consider that opaque *is* the place to cache things such > as the hashed value of the qual constants or whatever. Oh. I guess this goes back to my original suggestion that the API docs need to explain some sense of when OpcInfo is called. I didn't realize it was tied to a specific scan. This does raise the question of why the scan information isn't available in OpcInfo though. That would let me build the hash value in a natural place instead of having to do it lazily which I find significantly more awkward. Is it possible for scan keys to change between calls for nested loop joins or quirky SQL with volatile functions in the scan or anything? I guess that would prevent the index scan from being used at all. But I can be reassured the Opcinfo call will be called again when a cached plan is reexecuted? Stable functions might have new values in a subsequent execution even if the plan hasn't changed at all for example. > That's to test the Union procedure; if you look at the code, it's just > used in assert-enabled builds. Now that I think about it, perhaps this > can turn out to be problematic for your bloom filter opclass. I > considered the idea of allowing the opclass to disable this testing > procedure, but it isn't done (yet.) No, it isn't a problem for my opclass other than performance, it was quite helpful in turning up bugs early in fact. It was just a bit confusing because I was trying to test things one by one and it turned out the assertion checks meant a simple insert turned up bugs in Union which I hadn't expected. But it seems perfectly sensible in an assertion check. -- greg
Greg Stark wrote: > On Tue, Nov 11, 2014 at 2:14 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > I'm not sure why you say opaque is not associated with the specific > > scan. Are you thinking we could reuse opaque for a future scan? I > > think we could consider that opaque *is* the place to cache things such > > as the hashed value of the qual constants or whatever. > > Oh. I guess this goes back to my original suggestion that the API docs > need to explain some sense of when OpcInfo is called. I didn't realize > it was tied to a specific scan. This does raise the question of why > the scan information isn't available in OpcInfo though. That would let > me build the hash value in a natural place instead of having to do it > lazily which I find significantly more awkward. Hmm. OpcInfo is also called in contexts other than scans, though, so passing down scan keys into it seems wrong. Maybe we do need another amproc that "initializes" the scan for the opclass, which would get whatever got returned from opcinfo as well as scankeys. There you would have the opportunity to run the hash and store it into the opaque. > Is it possible for scan keys to change between calls for nested loop > joins or quirky SQL with volatile functions in the scan or anything? I > guess that would prevent the index scan from being used at all. But I > can be reassured the Opcinfo call will be called again when a cached > plan is reexecuted? Stable functions might have new values in a > subsequent execution even if the plan hasn't changed at all for > example. As far as I understand, the scan keys don't change within any given scan; if they do, the rescan AM method is called, at which point we should reset whatever is cached about the previous scan. > > That's to test the Union procedure; if you look at the code, it's just > > used in assert-enabled builds. Now that I think about it, perhaps this > > can turn out to be problematic for your bloom filter opclass. I > > considered the idea of allowing the opclass to disable this testing > > procedure, but it isn't done (yet.) > > No, it isn't a problem for my opclass other than performance, it was > quite helpful in turning up bugs early in fact. It was just a bit > confusing because I was trying to test things one by one and it turned > out the assertion checks meant a simple insert turned up bugs in Union > which I hadn't expected. But it seems perfectly sensible in an > assertion check. Great, thanks. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 11, 2014 at 12:12 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > As far as I understand, the scan keys don't change within any given > scan; if they do, the rescan AM method is called, at which point we > should reset whatever is cached about the previous scan. But am I guaranteed that rescan will throw away the opcinfo struct and its opaque element? I guess that's the heart of the uncertainty I had. -- greg
Greg Stark wrote: > On Tue, Nov 11, 2014 at 12:12 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > As far as I understand, the scan keys don't change within any given > > scan; if they do, the rescan AM method is called, at which point we > > should reset whatever is cached about the previous scan. > > But am I guaranteed that rescan will throw away the opcinfo struct and > its opaque element? I guess that's the heart of the uncertainty I had. Well, it should, and if not that's a bug, which should be fixed by the attached (untested) patch. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
It might be clearer to have an opclassinfo and a scaninfo which can store information in separate opc_opaque and scan_opaque fields with distinct liftetimes. In the bloom filter case the longlived info is the (initial?) size of the bloom filter and the number of hash functions. But I still haven't determined how much it will cost to recalculate them. Right now they're just hard coded so it doesn't hurt to do it on every rescan but if it involves peeking at the index reloptions or stats that might be impractical.
Greg Stark wrote: > It might be clearer to have an opclassinfo and a scaninfo which can > store information in separate opc_opaque and scan_opaque fields with > distinct liftetimes. > > In the bloom filter case the longlived info is the (initial?) size of > the bloom filter and the number of hash functions. But I still haven't > determined how much it will cost to recalculate them. Right now > they're just hard coded so it doesn't hurt to do it on every rescan > but if it involves peeking at the index reloptions or stats that might > be impractical. Patches welcome :-) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 11, 2014 at 1:04 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> It might be clearer ... > > Patches welcome :-) Or perhaps there could still be a single opaque field but have two optional opclass methods "scaninit" and "rescan" which allow the op class to set or reset whichever fields inside opaque that need to be reset. -- greg
On Sat, Nov 8, 2014 at 1:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>
> I just pushed this, after some more minor tweaks. Thanks, and please do
> continue testing!
>
>
>
> I just pushed this, after some more minor tweaks. Thanks, and please do
> continue testing!
>
Few typo's and few questions
1. * range. Need to an extra flag in mmtuples for that.
*/
Datum
brinbulkdelete(PG_FUNCTION_ARGS)
Isn't the part of comment referring *mmtuples* require some change,
as I think mmtuples was used in initial version of patch.
2.
/* ---------------
* mt_info is laid out in the following fashion:
*
* 7th (high)
bit: has nulls
* 6th bit: is placeholder tuple
* 5th bit: unused
* 4-0 bit: offset of data
* ---------------
*/
uint8 bt_info;
} BrinTuple;
Here in comments, bt_info is referred as mt_info.
3.
/*
* t_info manipulation macros
*/
#define BRIN_OFFSET_MASK 0x1F
I think in above comment it should be bt_info, rather than t_info.
4.
static void
revmap_physical_extend(BrinRevmap *revmap)
{
..
..
START_CRIT_SECTION();
/* the rm_tids array is initialized to all invalid by PageInit */
brin_page_init(page, BRIN_PAGETYPE_REVMAP);
MarkBufferDirty(buf);
metadata->lastRevmapPage = mapBlk;
MarkBufferDirty(revmap->rm_metaBuf);
..
}
Can't we update revmap->rm_lastRevmapPage along with metadata->lastRevmap?
5.
typedef struct BrinMemTuple
{
bool bt_placeholder; /* this is a placeholder tuple */
BlockNumber bt_blkno; /* heap blkno that the tuple is for */
MemoryContext bt_context; /*
memcxt holding the dt_column values */
..
}
How is this memory context getting used?
I could see that this is used brin_deform_tuple() which gets called from
3 other places in core code bringetbitmap(), brininsert() and union_tuples()
and in all the 3 places there is already another temporaray memory context
used to avoid any form of memory leaks.
6.
Is there anyway to force brin index to be off, if not, then do we need it
as it is present for other type of scan's.
like set enable_indexscan=off;
> I thought we can do better than minmax for the inet data type, > and ended up with a generalized opclass supporting both inet and range > types. Patch based on minmax-v20 attached. It works well except > a few small problems. I will improve the patch and add into > a commitfest after BRIN framework is committed. I wanted to send a new version before the commitfest to get some feedback, but it is still work in progress. Patch attached rebased to the current HEAD. This version supports more operators and box from geometric data types. Opclasses are renamed to inclusion_ops to be more generic. The problems I mentioned remain beause I couldn't solve them without touching the BRIN framework. > To support more operators I needed to change amstrategies and > amsupport on the catalog. It would be nice if amsupport can be set > to 0 like am strategies. I think it would be nicer to get the functions from the operators with using the strategy numbers instead of adding them directly as support functions. I looked around a bit but couldn't find a sensible way to support it. Is it possible without adding them to the RelationData struct? > Inet data types accept IP version 4 and version 6. It isn't possible > to represent union of addresses from different versions with a valid > inet type. So, I made the union function return NULL in this case. > Then, I tried to store if returned value is NULL or not, in > column->values[] as boolean, but it failed on the pfree() inside > brin_dtuple_initilize(). It doesn't seem right to free the values > based on attr->attbyval. This problem remains. There is also a similar problem with the range types, namely empty ranges. There should be special cases for them on some of the strategies. I tried to solve the problems in several different ways, but got a segfault one line or another. This makes me think that BRIN framework doesn't support to store different types than the indexed column in the values array. For example, brin_deform_tuple() iterates over the values array and copies them using the length of the attr on the index, not the length of the type defined by OpcInfo function. If storing another types aren't supported, why is it required to return oid's on the OpcInfo function. I am confused. I didn't try to support other geometric types than box as I couldn't managed to store a different type on the values array, but it would be nice to get some feedback about the overall design. I was thinking to add a STORAGE parameter to the index to support other geometric types. I am not sure that adding the STORAGE parameter to be used by the opclass implementation is the right way. It wouldn't be the actual thing that is stored by the index, it will be an element in the values array. Maybe, data type specific opclasses is the way to go, not a generic one as I am trying.
Attachment
Hi, I made a quick review for your patch, but I would like to see someone who was involved in the BRIN work comment on Emre's design issues. I will try to answer them as best as I can below. I think minimax indexes on range types seems very useful, and inet/cidr too. I have no idea about geometric types. But we need to fix the issues with empty ranges and IPv4/IPv6 for these indexes to be useful. = Review The current code compiles but the brin test suite fails. I tested the indexes a bit and they seem to work fine, except for cases where we know it to be broken like IPv4/IPv6. The new code is generally clean and readable. I think some things should be broken out in separate patches since they are unrelated to this patch. - The addition of &< and >& on inet types. - The fix in brin_minmax.c. Your brin tests seems to forget &< and >& for inet types. The tests should preferably be extended to support ipv6 and empty ranges once we have fixed support for those cases. The /* If the it is all nulls, it cannot possibly be consistent. */ comment is different from the equivalent comment in brin_minmax.c. I do not see why they should be different. In brin_inclusion_union() the "if (col_b->bv_allnulls)" is done after handling has_nulls, which is unlike what is done in brin_minmax_union(), which code is right? I am leaning towards the code in brin_inclusion_union() since you can have all_nulls without has_nulls. On 12/14/2014 09:04 PM, Emre Hasegeli wrote: >> To support more operators I needed to change amstrategies and >> amsupport on the catalog. It would be nice if amsupport can be set >> to 0 like am strategies. > > I think it would be nicer to get the functions from the operators > with using the strategy numbers instead of adding them directly as > support functions. I looked around a bit but couldn't find > a sensible way to support it. Is it possible without adding them > to the RelationData struct? Yes that would be nice, but I do not think the current solution is terrible. > This problem remains. There is also a similar problem with the > range types, namely empty ranges. There should be special cases > for them on some of the strategies. I tried to solve the problems > in several different ways, but got a segfault one line or another. > This makes me think that BRIN framework doesn't support to store > different types than the indexed column in the values array. > For example, brin_deform_tuple() iterates over the values array and > copies them using the length of the attr on the index, not the length > of the type defined by OpcInfo function. If storing another types > aren't supported, why is it required to return oid's on the OpcInfo > function. I am confused. I leave this to someone more knowledgable about BRIN to answer. > I didn't try to support other geometric types than box as I couldn't > managed to store a different type on the values array, but it would > be nice to get some feedback about the overall design. I was > thinking to add a STORAGE parameter to the index to support other > geometric types. I am not sure that adding the STORAGE parameter > to be used by the opclass implementation is the right way. It > wouldn't be the actual thing that is stored by the index, it will be > an element in the values array. Maybe, data type specific opclasses > is the way to go, not a generic one as I am trying. I think a STORAGE parameter sounds like a good idea. Could it also be used to solve the issue with IPv4/IPv6 by setting the storage type to custom? Or is that the wrong way to fix things? -- Andreas Karlsson
Can you please break up this patch? I think I see three patches, 1. add sql-callable functions such as inet_merge, network_overright, etc etc. These need documentation and a trivial regression test somewhere. 2. necessary changes to header files (skey.h etc) 3. the inclusion opclass itself Thanks BTW the main idea behind having opcinfo return the type oid was to tell the index what was stored in the index. If that doesn't work right now, maybe it needs some tweak to the brin framework code. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you for looking at my patch again. New version is attached with a lot of changes and point data type support. > I think minimax indexes on range types seems very useful, and inet/cidr too. > I have no idea about geometric types. But we need to fix the issues with > empty ranges and IPv4/IPv6 for these indexes to be useful. Both of the cases are fixed on the new version. > The current code compiles but the brin test suite fails. Now, only a test in . > I tested the indexes a bit and they seem to work fine, except for cases > where we know it to be broken like IPv4/IPv6. > > The new code is generally clean and readable. > > I think some things should be broken out in separate patches since they are > unrelated to this patch. Yes but they were also required by this patch. This version adds more functions and operators. I can split them appropriately after your review. > - The addition of &< and >& on inet types. I haven't actually added the operators, just the underlying procedures for them to support basic comparison operators with the BRIN opclass. I left them them out on the new version because of its new design. We can add the operators later with documentation, tests and index support. > - The fix in brin_minmax.c. It is already committed by Alvaro Herrera. I can send another patch to use pg_amop instead of pg_amproc on brin_minmax.c, if it is acceptable. > The tests should preferably be extended to support ipv6 and empty ranges > once we have fixed support for those cases. Done. > The /* If the it is all nulls, it cannot possibly be consistent. */ comment > is different from the equivalent comment in brin_minmax.c. I do not see why > they should be different. Not to confuse with the empty ranges. Also, there it supports other types than ranges, like box. > In brin_inclusion_union() the "if (col_b->bv_allnulls)" is done after > handling has_nulls, which is unlike what is done in brin_minmax_union(), > which code is right? I am leaning towards the code in brin_inclusion_union() > since you can have all_nulls without has_nulls. >> I think it would be nicer to get the functions from the operators >> with using the strategy numbers instead of adding them directly as >> support functions. I looked around a bit but couldn't find >> a sensible way to support it. Is it possible without adding them >> to the RelationData struct? > > > Yes that would be nice, but I do not think the current solution is terrible. The new version does it this way. It was required to support strategies between different types. >> This problem remains. There is also a similar problem with the >> range types, namely empty ranges. There should be special cases >> for them on some of the strategies. I tried to solve the problems >> in several different ways, but got a segfault one line or another. >> This makes me think that BRIN framework doesn't support to store >> different types than the indexed column in the values array. >> For example, brin_deform_tuple() iterates over the values array and >> copies them using the length of the attr on the index, not the length >> of the type defined by OpcInfo function. If storing another types >> aren't supported, why is it required to return oid's on the OpcInfo >> function. I am confused. > > > I leave this to someone more knowledgable about BRIN to answer. I think I have fixed them. >> I didn't try to support other geometric types than box as I couldn't >> managed to store a different type on the values array, but it would >> be nice to get some feedback about the overall design. I was >> thinking to add a STORAGE parameter to the index to support other >> geometric types. I am not sure that adding the STORAGE parameter >> to be used by the opclass implementation is the right way. It >> wouldn't be the actual thing that is stored by the index, it will be >> an element in the values array. Maybe, data type specific opclasses >> is the way to go, not a generic one as I am trying. > > > I think a STORAGE parameter sounds like a good idea. Could it also be used > to solve the issue with IPv4/IPv6 by setting the storage type to custom? Or > is that the wrong way to fix things? I have fixed different addressed families by adding another support function. I used STORAGE parameter to support the point data type. To make it work I added some operators between box and point data type. We can support all geometric types with this method.
Attachment
On Thu, Feb 12, 2015 at 3:34 AM, Emre Hasegeli <emre@hasegeli.com> wrote:
Thank you for looking at my patch again. New version is attached
with a lot of changes and point data type support.
Patch is moved to next CF 2015-02 as work is still going on.
--
--
Michael
On 02/11/2015 07:34 PM, Emre Hasegeli wrote: >> The current code compiles but the brin test suite fails. > > Now, only a test in . Yeah, there is still a test which fails in opr_sanity. > Yes but they were also required by this patch. This version adds more > functions and operators. I can split them appropriately after your > review. Ok, sounds fine to me. >>> This problem remains. There is also a similar problem with the >>> range types, namely empty ranges. There should be special cases >>> for them on some of the strategies. I tried to solve the problems >>> in several different ways, but got a segfault one line or another. >>> This makes me think that BRIN framework doesn't support to store >>> different types than the indexed column in the values array. >>> For example, brin_deform_tuple() iterates over the values array and >>> copies them using the length of the attr on the index, not the length >>> of the type defined by OpcInfo function. If storing another types >>> aren't supported, why is it required to return oid's on the OpcInfo >>> function. I am confused. >> >> >> I leave this to someone more knowledgable about BRIN to answer. > > I think I have fixed them. Looks good as far as I can tell. > I have fixed different addressed families by adding another support > function. > > I used STORAGE parameter to support the point data type. To make it > work I added some operators between box and point data type. We can > support all geometric types with this method. Looks to me like this should work. = New comments - Searching for the empty range is slow since the empty range matches all brin ranges. EXPLAIN ANALYZE SELECT * FROM foo WHERE r = '[1,1)'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- BitmapHeap Scan on foo (cost=12.01..16.02 rows=1 width=14) (actual time=47.603..47.605 rows=1 loops=1) Recheck Cond: (r = 'empty'::int4range) Rows Removed by Index Recheck: 200000 HeapBlocks: lossy=1082 -> Bitmap Index Scan on foo_r_idx (cost=0.00..12.01 rows=1 width=0) (actual time=0.169..0.169 rows=11000 loops=1) Index Cond: (r = 'empty'::int4range) Planning time: 0.062ms Execution time: 47.647 ms (8 rows) - Found a typo in the docs: "withing the range" - Why have you removed the USE_ASSERT_CHECKING code from brin.c? - Remove redundant "or not" from "/* includes empty element or not */". - Minor grammar gripe: Change "Check that" to "Check if" in the comments in brin_inclusion_add_value(). - Wont the code incorrectly return false if the first added element to an index page is empty? - Would it be worth optimizing the code by checking for empty ranges after checking for overlap in brin_inclusion_add_value()? I would imagine that empty ranges are rare in most use cases. - Typo in comment: "If the it" -> "If it" - Typo in comment: "Note that this strategies" -> "Note that these strategies" - Typo in comment: "inequality strategies does not" -> "inequality strategies do not" - Typo in comment: "geometric types which uses" -> "geometric types which use" - I get 'ERROR: missing strategy 7 for attribute 1 of index "bar_i_idx"' when running the query below. Why does this not fail in the test suite? The overlap operator works just fine. If I read your code correctly other strategies are also missing. SELECT * FROM bar WHERE i = '::1'; - I do not think this comment is true "Used to determine the addresses have a common union or not". It actually checks if we can create range which contains both ranges. - Compact random spaces in "select numrange(1.0, 2.0) + numrange(2.5, 3.0); -- should fail" -- Andreas Karlsson
> Yeah, there is still a test which fails in opr_sanity. I attached an additional patch to remove extra pg_amproc entries from minmax operator classes. It fixes the test as a side effect. >> Yes but they were also required by this patch. This version adds more >> functions and operators. I can split them appropriately after your >> review. > > > Ok, sounds fine to me. It is now split. > = New comments > > - Searching for the empty range is slow since the empty range matches all > brin ranges. > > EXPLAIN ANALYZE SELECT * FROM foo WHERE r = '[1,1)'; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on foo (cost=12.01..16.02 rows=1 width=14) (actual > time=47.603..47.605 rows=1 loops=1) > Recheck Cond: (r = 'empty'::int4range) > Rows Removed by Index Recheck: 200000 > Heap Blocks: lossy=1082 > -> Bitmap Index Scan on foo_r_idx (cost=0.00..12.01 rows=1 width=0) > (actual time=0.169..0.169 rows=11000 loops=1) > Index Cond: (r = 'empty'::int4range) > Planning time: 0.062 ms > Execution time: 47.647 ms > (8 rows) There is not much we can do about it. It looks like the problem in here is the selectivity estimation. > - Found a typo in the docs: "withing the range" Fixed. > - Why have you removed the USE_ASSERT_CHECKING code from brin.c? Because it doesn't work with the new operator class. We don't set the union field when there are elements that are not mergeable. > - Remove redundant "or not" from "/* includes empty element or not */". Fixed. > - Minor grammar gripe: Change "Check that" to "Check if" in the comments in > brin_inclusion_add_value(). Fixed. > - Wont the code incorrectly return false if the first added element to an > index page is empty? No, column->bv_values[2] is set to true for the first empty element. > - Would it be worth optimizing the code by checking for empty ranges after > checking for overlap in brin_inclusion_add_value()? I would imagine that > empty ranges are rare in most use cases. I changed it for all empty range checks. > - Typo in comment: "If the it" -> "If it" > > - Typo in comment: "Note that this strategies" -> "Note that these > strategies" > > - Typo in comment: "inequality strategies does not" -> "inequality > strategies do not" > > - Typo in comment: "geometric types which uses" -> "geometric types which > use" All of them are fixed. > - I get 'ERROR: missing strategy 7 for attribute 1 of index "bar_i_idx"' > when running the query below. Why does this not fail in the test suite? The > overlap operator works just fine. If I read your code correctly other > strategies are also missing. > > SELECT * FROM bar WHERE i = '::1'; I fixed it on the new version. Tests wasn't failing because they were using minimal operator class for quality. > - I do not think this comment is true "Used to determine the addresses have > a common union or not". It actually checks if we can create range which > contains both ranges. Changed as you suggested. > - Compact random spaces in "select numrange(1.0, 2.0) + numrange(2.5, 3.0); -- should fail" There was a tab in there. Now it is replaced with a space.
Attachment
- brin-inclusion-v05-box-vs-point-operators.patch
- brin-inclusion-v05-fix-brin-deform-tuple.patch
- brin-inclusion-v05-inclusion-opclasses.patch
- brin-inclusion-v05-remove-assert-checking.patch
- brin-inclusion-v05-remove-minmax-amprocs.patch
- brin-inclusion-v05-sql-level-support-functions.patch
- brin-inclusion-v05-strategy-numbers.patch
Thanks for the updated patch; I will at it as soon as time allows. (Not really all that soon, regrettably.) Judging from a quick look, I think patches 1 and 5 can be committed quickly; they imply no changes to other parts of BRIN. (Not sure why 1 and 5 are separate. Any reason for this?) Also patch 2. Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN framework code; should also be committable right away. Needs a closer look of course. Patch 3 is a problem. That code is there because the union proc is only used in a corner case in Minmax, so if we remove it, user-written Union procs are very likely to remain buggy for long. If you have a better idea to test Union in Minmax, or some other way to turn that stuff off for the range stuff, I'm all ears. Just lets make sure the support procs are tested to avoid stupid bugs. Before I introduced that, my Minmax Union proc was all wrong. Patch 7 I don't understand. Will have to look closer. Are you saying Minmax will depend on Btree opclasses? I remember thinking in doing it that way at some point, but wasn't convinced for some reason. Patch 6 seems the real meat of your own stuff. I think there should be a patch 8 also but it's not attached ... ?? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> Judging from a quick look, I think patches 1 and 5 can be committed > quickly; they imply no changes to other parts of BRIN. (Not sure why 1 > and 5 are separate. Any reason for this?) Also patch 2. Not much reason except that 1 includes only functions, but 5 includes operators. > Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN > framework code; should also be committable right away. Needs a closer > look of course. > > Patch 3 is a problem. That code is there because the union proc is only > used in a corner case in Minmax, so if we remove it, user-written Union > procs are very likely to remain buggy for long. If you have a better > idea to test Union in Minmax, or some other way to turn that stuff off > for the range stuff, I'm all ears. Just lets make sure the support > procs are tested to avoid stupid bugs. Before I introduced that, my > Minmax Union proc was all wrong. I removed this test because I don't see a way to support it. I believe any other implementation that is more complicated than minmax will fail in there. It is better to cache them with the regression tests, so I tried to improve them. GiST, SP-GiST and GIN don't have similar checks, but they have more complicated user defined functions. > Patch 7 I don't understand. Will have to look closer. Are you saying > Minmax will depend on Btree opclasses? I remember thinking in doing it > that way at some point, but wasn't convinced for some reason. No, there isn't any additional dependency. It makes minmax operator classes use the procedures from the pg_amop instead of adding them to pg_amproc. It also makes the operator class safer for cross data type usage. Actually, I just checked and find out that we got wrong answers from index on the current master without this patch. You can reproduce it with this query on the regression database: select * from brintest where timestampcol = '1979-01-29 11:05:09'::timestamptz; inclusion-opclasses patch make it possible to add cross type brin regression tests. I will add more of them on the next version. > Patch 6 seems the real meat of your own stuff. I think there should be > a patch 8 also but it's not attached ... ?? I had another commit not to intended to be sent. Sorry about that.
On Mon, Apr 6, 2015 at 5:17 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thanks for the updated patch; I will at it as soon as time allows. (Not > really all that soon, regrettably.) > > Judging from a quick look, I think patches 1 and 5 can be committed > quickly; they imply no changes to other parts of BRIN. (Not sure why 1 > and 5 are separate. Any reason for this?) Also patch 2. > > Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN > framework code; should also be committable right away. Needs a closer > look of course. Is this still pending? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Mon, Apr 6, 2015 at 5:17 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Thanks for the updated patch; I will at it as soon as time allows. (Not > > really all that soon, regrettably.) > > > > Judging from a quick look, I think patches 1 and 5 can be committed > > quickly; they imply no changes to other parts of BRIN. (Not sure why 1 > > and 5 are separate. Any reason for this?) Also patch 2. > > > > Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN > > framework code; should also be committable right away. Needs a closer > > look of course. > > Is this still pending? Yeah. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/06/2015 09:36 PM, Emre Hasegeli wrote: >>> Yes but they were also required by this patch. This version adds more >>> functions and operators. I can split them appropriately after your >>> review. >> >> >> Ok, sounds fine to me. > > It is now split. In which order should I apply the patches? I also agree with Alvaro's comments. -- Andreas Karlsson
> In which order should I apply the patches? I rebased and renamed them with numbers.
Attachment
- brin-inclusion-v06-01-sql-level-support-functions.patch
- brin-inclusion-v06-02-strategy-numbers.patch
- brin-inclusion-v06-03-remove-assert-checking.patch
- brin-inclusion-v06-04-fix-brin-deform-tuple.patch
- brin-inclusion-v06-05-box-vs-point-operators.patch
- brin-inclusion-v06-06-inclusion-opclasses.patch
- brin-inclusion-v06-07-remove-minmax-amprocs.patch
From my point of view as a reviewer this patch set is very close to being committable. = brin-inclusion-v06-01-sql-level-support-functions.patch This patch looks good. = brin-inclusion-v06-02-strategy-numbers.patch This patch looks good, but shouldn't it be merged with 07? = brin-inclusion-v06-03-remove-assert-checking.patch As you wrote earlier this is needed because the new range indexes would violate the asserts. I think it is fine to remove the assertion. = brin-inclusion-v06-04-fix-brin-deform-tuple.patch This patch looks good and can be committed separately. = brin-inclusion-v06-05-box-vs-point-operators.patch This patch looks good and can be committed separately. = brin-inclusion-v06-06-inclusion-opclasses.patch - "operator classes store the union of the values in the indexed column" is not technically true. It stores something which covers all of the values. - Missing space in "except box and point*/". - Otherwise looks good. = brin-inclusion-v06-07-remove-minmax-amprocs.patch Shouldn't this be merged with 02? Otherwise it looks good. -- Andreas Karlsson
Hi, 2015-05-05 2:51 GMT+02:00 Andreas Karlsson <andreas@proxel.se>: > From my point of view as a reviewer this patch set is very close to being > committable. I'd like to thank already now to all committers and reviewers and hope BRIN makes it into PG 9.5. As a database instructor, conference organisator and geospatial specialist I'm looking forward for this clever new index. I'm keen to see if a PostGIS specialist jumps in and adds PostGIS geometry support. Yours, S. 2015-05-05 2:51 GMT+02:00 Andreas Karlsson <andreas@proxel.se>: > From my point of view as a reviewer this patch set is very close to being > committable. > > = brin-inclusion-v06-01-sql-level-support-functions.patch > > This patch looks good. > > = brin-inclusion-v06-02-strategy-numbers.patch > > This patch looks good, but shouldn't it be merged with 07? > > = brin-inclusion-v06-03-remove-assert-checking.patch > > As you wrote earlier this is needed because the new range indexes would > violate the asserts. I think it is fine to remove the assertion. > > = brin-inclusion-v06-04-fix-brin-deform-tuple.patch > > This patch looks good and can be committed separately. > > = brin-inclusion-v06-05-box-vs-point-operators.patch > > This patch looks good and can be committed separately. > > = brin-inclusion-v06-06-inclusion-opclasses.patch > > - "operator classes store the union of the values in the indexed column" is > not technically true. It stores something which covers all of the values. > - Missing space in "except box and point*/". > - Otherwise looks good. > > = brin-inclusion-v06-07-remove-minmax-amprocs.patch > > Shouldn't this be merged with 02? Otherwise it looks good. > > > -- > Andreas Karlsson > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Stefan Keller wrote: > Hi, > > 2015-05-05 2:51 GMT+02:00 Andreas Karlsson <andreas@proxel.se>: > > From my point of view as a reviewer this patch set is very close to being > > committable. > > I'd like to thank already now to all committers and reviewers and hope > BRIN makes it into PG 9.5. > As a database instructor, conference organisator and geospatial > specialist I'm looking forward for this clever new index. Appreciated. The base BRIN code is already in 9.5, so barring significant issues you should see it in the next major release. Support for geometry types and the like is still pending, but I hope to get to it shortly. > I'm keen to see if a PostGIS specialist jumps in and adds PostGIS > geometry support. Did you test the patch proposed here already? It could be a very good contribution. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 05/05/2015 04:24 AM, Alvaro Herrera wrote: > Stefan Keller wrote: >> I'm keen to see if a PostGIS specialist jumps in and adds PostGIS >> geometry support. > > Did you test the patch proposed here already? It could be a very good > contribution. Indeed, I have done some testing of the patch but more people testing would be nice. Andreas
> From my point of view as a reviewer this patch set is very close to being > committable. Thank you. The new versions are attached. > - "operator classes store the union of the values in the indexed column" is > not technically true. It stores something which covers all of the values. I rephrased it as " operator classes store a value which includes the values in the indexed column". > - Missing space in "except box and point*/". Fixed. > = brin-inclusion-v06-07-remove-minmax-amprocs.patch > > Shouldn't this be merged with 02? Otherwise it looks good. It doesn't have any relation with the 02-strategy-numbers.patch. Maybe you mean 01-sql-level-support-functions.patch and 05-box-vs-point-operators.patch should be merged. They can always be committed together.
Attachment
- brin-inclusion-v07-01-sql-level-support-functions.patch
- brin-inclusion-v07-02-strategy-numbers.patch
- brin-inclusion-v07-03-remove-assert-checking.patch
- brin-inclusion-v07-04-fix-brin-deform-tuple.patch
- brin-inclusion-v07-05-box-vs-point-operators.patch
- brin-inclusion-v07-06-inclusion-opclasses.patch
- brin-inclusion-v07-07-remove-minmax-amprocs.patch
> Indeed, I have done some testing of the patch but more people testing would > be nice. The inclusion opclass should work for other data types as long required operators and SQL level support functions are supplied. Maybe it would work for PostGIS, too.
On 05/05/2015 11:57 AM, Emre Hasegeli wrote: >> From my point of view as a reviewer this patch set is very close to being >> committable. > > Thank you. The new versions are attached. Nice, I think it is ready now other than the issues Alvaro raised in his review[1]. Have you given those any thought? Notes 1. http://www.postgresql.org/message-id/20150406211724.GH4369@alvh.no-ip.org Andreas
> Nice, I think it is ready now other than the issues Alvaro raised in his > review[1]. Have you given those any thought? I already replied his email [1]. Which issues do you mean? [1] http://www.postgresql.org/message-id/CAE2gYzxQ-Gk3q3jYWT=1eNLEbSgCgU28+1axML4oMCwjBkPuqw@mail.gmail.com
After looking at 05 again, I don't like the "same as %" business. Creating a whole new class of exceptions is not my thing, particularly not in a regression test whose sole purpose is to look for exceptional (a.k.a. "wrong") cases. I would much rather define the opclasses for those two datatypes using the existing @> operators rather than create && operators for this purpose. We can add a note to the docs, "for historical reasons the brin opclass for datatype box/point uses the <@ operator instead of &&", or something like that. AFAICS this is just some pretty small changes to patches 05 and 06. Will you please resubmit? I just pushed patch 01, and I'm looking at 04 next. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Looking at patch 04, it seems to me that it would be better to have the OpcInfo struct carry the typecache struct rather than the type OID, so that we can avoid repeated typecache lookups in brin_deform_tuple; something like /* struct returned by "OpcInfo" amproc */ typedef struct BrinOpcInfo {/* Number of columns stored in an index column of this opclass */uint16 oi_nstored; /* Opaque pointer for the opclass' private use */void *oi_opaque; /* Typecache entries of the stored columns */TypeCacheEntry oi_typcache[FLEXIBLE_ARRAY_MEMBER]; } BrinOpcInfo; Looking into it now. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote: > Looking at patch 04, it seems to me that it would be better to have > the OpcInfo struct carry the typecache struct rather than the type OID, > so that we can avoid repeated typecache lookups in brin_deform_tuple; Here's the patch. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Can you please explain what is the purpose of patch 07? I'm not sure I understand; are we trying to avoid having to add pg_amproc entries for these operators and instead piggy-back on btree opclass definitions? Not too much in love with that idea; I see that there is less tedium in that the brin opclass definition is simpler. One disadvantage is a 3x increase in the number of syscache lookups to get the function you need, unless I'm reading things wrong. Maybe this is not performance critical. Anyway I tried applying it on isolation, and found that it fails the assertion that tests the "union" support proc in brininsert. That doesn't seem okay. I mean, it's okay not to run the test for the inclusion opclasses, but why does it now fail in minmax which was previously passing? Couldn't figure it out. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 05/05/2015 01:10 PM, Emre Hasegeli wrote: > I already replied his email [1]. Which issues do you mean? Sorry, my bad please ignore the previous email. -- Andreas Karlsson
> Can you please explain what is the purpose of patch 07? I'm not sure I > understand; are we trying to avoid having to add pg_amproc entries for > these operators and instead piggy-back on btree opclass definitions? > Not too much in love with that idea; I see that there is less tedium in > that the brin opclass definition is simpler. One disadvantage is a 3x > increase in the number of syscache lookups to get the function you need, > unless I'm reading things wrong. Maybe this is not performance critical. It doesn't use btree opclass definitions. It uses brin opclass pg_amop entries instead of duplicating them in pg_amproc. The pg_amproc.h header says: > * The amproc table identifies support procedures associated with index > * operator families and classes. These procedures can't be listed in pg_amop > * since they are not the implementation of any indexable operator. In our case, these procedures can be listed in pg_amop as they are implementations of indexable operators. The more important change on this patch is to request procedures for the right data types. Minmax opclasses return wrong results without this patch. You can reproduce it with this query on the regression database: select * from brintest where timestampcol = '1979-01-29 11:05:09'::timestamptz; > Anyway I tried applying it on isolation, and found that it fails the > assertion that tests the "union" support proc in brininsert. That > doesn't seem okay. I mean, it's okay not to run the test for the > inclusion opclasses, but why does it now fail in minmax which was > previously passing? Couldn't figure it out. The regression tests passed when I tried it on the current master.
>> Looking at patch 04, it seems to me that it would be better to have >> the OpcInfo struct carry the typecache struct rather than the type OID, >> so that we can avoid repeated typecache lookups in brin_deform_tuple; > > Here's the patch. Looks better to me. I will incorporate with this patch.
> After looking at 05 again, I don't like the "same as %" business. > Creating a whole new class of exceptions is not my thing, particularly > not in a regression test whose sole purpose is to look for exceptional > (a.k.a. "wrong") cases. I would much rather define the opclasses for > those two datatypes using the existing @> operators rather than create > && operators for this purpose. We can add a note to the docs, "for > historical reasons the brin opclass for datatype box/point uses the <@ > operator instead of &&", or something like that. I worked around this by adding point <@ box operator as the overlap strategy and removed additional && operators. > AFAICS this is just some pretty small changes to patches 05 and 06. > Will you please resubmit? New series of patches are attached. Note that brin-inclusion-v08-04-fix-brin-deform-tuple.patch is the one from you.
Attachment
I again have to refuse the notion that removing the assert-only block without any replacement is acceptable. I just spent a lot of time tracking down what turned out to be a bug in your patch 07: /* Adjust maximum, if B's max is greater than A's max */ - needsadj = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno, - PROCNUM_GREATER), - colloid, col_b->bv_values[1], col_a->bv_values[1]); + frmg = minmax_get_strategy_procinfo(bdesc, attno, attr->atttypid, + BTGreaterStrategyNumber); + needsadj = FunctionCall2Coll(frmg, colloid, col_b->bv_values[0], + col_a->bv_values[0]); Note the removed lines use array index 1, while the added lines use array index 0. The only reason I noticed this is because I applied this patch without the others and saw the assertion fire; how would I have noticed the problem had I just removed it? Let's think together and try to find a reasonable way to get the union procedures tested regularly. It is pretty clear that having them run only when the race condition occurs is not acceptable; bugs go unnoticed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Let's think together and try to find a reasonable way to get the union > procedures tested regularly. It is pretty clear that having them run > only when the race condition occurs is not acceptable; bugs go > unnoticed. [ just a drive-by comment... ] Maybe you could set up a testing mode that forces the race condition to occur? Then you could test the calling code paths, not only the union procedures per se. regards, tom lane
Emre Hasegeli wrote: > > After looking at 05 again, I don't like the "same as %" business. > > Creating a whole new class of exceptions is not my thing, particularly > > not in a regression test whose sole purpose is to look for exceptional > > (a.k.a. "wrong") cases. I would much rather define the opclasses for > > those two datatypes using the existing @> operators rather than create > > && operators for this purpose. We can add a note to the docs, "for > > historical reasons the brin opclass for datatype box/point uses the <@ > > operator instead of &&", or something like that. > > I worked around this by adding point <@ box operator as the overlap > strategy and removed additional && operators. That works for me. I pushed patches 04 and 07, as well as adopting some of the changes to the regression test in 06. I'm afraid I caused a bit of merge pain for you -- sorry about that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I pushed patches 04 and 07, as well as adopting some of the changes to > the regression test in 06. I'm afraid I caused a bit of merge pain for > you -- sorry about that. No problem. I rebased the remaining ones.
Attachment
Emre Hasegeli wrote: > > I pushed patches 04 and 07, as well as adopting some of the changes to > > the regression test in 06. I'm afraid I caused a bit of merge pain for > > you -- sorry about that. > > No problem. I rebased the remaining ones. In patch 05, you use straight > etc comparisons of point/box values. All the other code in that file AFAICS uses FPlt() macros and others; I assume we should do likewise. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote: > In patch 05, you use straight > etc comparisons of point/box values. > All the other code in that file AFAICS uses FPlt() macros and others; I > assume we should do likewise. Oooh, looking at the history of this I just realized that the comments signed "tgl" are actually Thomas G. Lockhart, not Tom G. Lane! See commit 9e2a87b62db87fc4175b00dabfd26293a2d072fa -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So, in reading these patches, it came to me that we might want to have pg_upgrade mark indexes invalid if we in the future change the implementation of some opclass. For instance, the inclusion opclass submitted here uses three columns: the indexed value itself, plus two booleans; each of these booleans is a workaround for some nasty design decision in the underlying datatypes. One boolean is "unmergeable": if a block range contains both IPv4 and IPv6 addresses, we mark it as 'unmergeable' and then every query needs to visit that block range always. The other boolean is "contains empty" and is used for range types: it is set if the empty value is present somewhere in the block range. If in the future, for instance, we come up with a way to store the ipv4 plus ipv6 info, we will want to change the page format. If we add a page version to the metapage, we can detect the change at pg_upgrade time and force a reindex of the index. Thoughts? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 05/12/2015 10:49 PM, Alvaro Herrera wrote: > If in the future, for instance, we come up with a way to store the ipv4 > plus ipv6 info, we will want to change the page format. If we add a > page version to the metapage, we can detect the change at pg_upgrade > time and force a reindex of the index. A version number in the metapage is a certainly a good idea. But we already have that, don't we? : > /* Metapage definitions */ > typedef struct BrinMetaPageData > { > uint32 brinMagic; > uint32 brinVersion; > BlockNumber pagesPerRange; > BlockNumber lastRevmapPage; > } BrinMetaPageData; > > #define BRIN_CURRENT_VERSION 1 > #define BRIN_META_MAGIC 0xA8109CFA Did you have something else in mind? - Heikki
Heikki Linnakangas wrote: > On 05/12/2015 10:49 PM, Alvaro Herrera wrote: > >If in the future, for instance, we come up with a way to store the ipv4 > >plus ipv6 info, we will want to change the page format. If we add a > >page version to the metapage, we can detect the change at pg_upgrade > >time and force a reindex of the index. > > A version number in the metapage is a certainly a good idea. But we already > have that, don't we? : > > >/* Metapage definitions */ > >typedef struct BrinMetaPageData > >{ > > uint32 brinMagic; > > uint32 brinVersion; > > BlockNumber pagesPerRange; > > BlockNumber lastRevmapPage; > >} BrinMetaPageData; > > > >#define BRIN_CURRENT_VERSION 1 > >#define BRIN_META_MAGIC 0xA8109CFA > > Did you have something else in mind? Yeah, I was thinking we could have a separate version number for the opclass code as well. An external extension could change that, for instance. Also, we could change the 'inclusion' version and leave minmax alone. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Emre Hasegeli wrote: > > I pushed patches 04 and 07, as well as adopting some of the changes to > > the regression test in 06. I'm afraid I caused a bit of merge pain for > > you -- sorry about that. > > No problem. I rebased the remaining ones. Thanks! After some back-and-forth between Emre and me, here's an updated patch. My changes are cosmetic; for a detailed rundown, see https://github.com/alvherre/postgres/commits/brin-inclusion Note that datatype point was removed: it turns out that unless we get box_contain_pt changed to use FPlt() et al, indexes created with this opclass would be corrupt. And we cannot simply change box_contain_pt, because that would break existing GiST and SP-GiST indexes that use it today and pg_upgrade to 9.5! So that needs to be considered separately. Also, removing point support means remove the CAST support procedure, because there is no use for it in the supported types. Also, patch 05 in the previous submissions goes away completely because there's no need for those (box,point) operators anymore. There's nothing Earth-shattering here that hasn't been seen in previous submissions by Emre. One item of note is that this patch is blindly removing the assert-only blocks as previously discussed, without any replacement. Need to think more on how to put something back ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Emre Hasegeli wrote: > > I pushed patches 04 and 07, as well as adopting some of the changes to > > the regression test in 06. I'm afraid I caused a bit of merge pain for > > you -- sorry about that. > > No problem. I rebased the remaining ones. Thanks, pushed. There was a proposed change by Emre to renumber operator -|- to 17 for range types (from 6 I think). I didn't include that as I think it should be a separate commit. Also, we're now in debt of the test strategy for the union procedure. I will work with Emre in the coming days to get that sorted out. I'm now thinking that something in src/test/modules is the most appropriate. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services