Thread: Minmax indexes

Re: Minmax indexes

From

Peter Eisentraut

Date:

15 September 2013, 15:26:04

On Sat, 2013-09-14 at 21:14 -0300, Alvaro Herrera wrote:
> Here's a reviewable version of what I've dubbed Minmax indexes.

Please fix duplicate OID 3177.

Re: Minmax indexes

From

Thom Brown

Date:

16 September 2013, 08:48:09

On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Hi,

Here's a reviewable version of what I've dubbed Minmax indexes. Some
people said they would like to use some other name for this feature, but
I have yet to hear usable ideas, so for now I will keep calling them
this way. I'm open to proposals, but if you pick something that cannot
be abbreviated "mm" I might have you prepare a rebased version which
renames the files and structs.

The implementation here has been simplified from what I originally
proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular,
I noticed that there's no need to involve aggregate functions at all; we
can just use inequality operators. So the pg_amproc entries are gone;
only the pg_amop entries are necessary.

I've somewhat punted on the question of doing resummarization separately
from vacuuming. Right now, resummarization (as well as other necessary
index cleanup) takes place in amvacuumcleanup. This is not optimal; I
have stated elsewhere that I'd like to create separate maintenance
actions that can be carried out by autovacuum. That would be useful
both for Minmax indexes and GIN indexes (pending insertion list); maybe
others. That's not part of this patch, however.

The design of this stuff is in the file "minmax-proposal" at the top of
the tree. That file is up to date, though it still contains some open
questions that were present in the original proposal. (I have not fixed
some bogosities pointed out by Noah, for instance. I will do that
shortly.) In a final version, that file would be applied as
src/backend/access/minmax/README, most likely.

One area on which I needed to modify core code is IndexBuildHeapScan. I
needed a version that was able to scan only a certain range of pages,
not the entire table, so I introduced a new IndexBuildHeapRangeScan, and
added a quick "heap_scansetlimits" function. I haven't tested that this
works outside of the HeapRangeScan thingy, so it's probably completely
bogus; I'm open to suggestions if people think this should be
implemented differently. In any case, keeping that implementation
together with vanilla IndexBuildHeapScan makes a lot of sense.

One thing still to tackle is when to mark ranges as unsummarized. Right
now, any new tuple on a page range would cause a new index entry to be
created and a new revmap update. This would cause huge index bloat if,
say, a page is emptied and vacuumed and filled with new tuples with
increasing values outside the original range; each new tuple would
create a new index tuple. I have two ideas about this (1. mark range as
unsummarized if 3rd time we touch the same page range; 2. vacuum the
affected index page if it's full, so we can maintain the index always up
to date without causing unduly bloat), but I haven't implemented
anything yet.

The "amcostestimate" routine is completely bogus; right now it returns
constant 0, meaning the index is always chosen if it exists.

There are opclasses for int4, numeric and text. The latter doesn't work
at all, because collation info is not passed down at all. I will have
to figure that out (even if I find unlikely that minmax indexes have any
usefulness on top of text columns). I admit that numeric hasn't been
tested, and it's quite likely that they won't work; mainly because of
lack of some datumCopy() calls, about which the code contains some
/* XXX */ lines. I think this should be relatively straightforward.
Ideally, the final version of this patch would contain opclasses for all
supported datatypes (i.e. the same that have got btree opclasses).

I have messed up the opclass information, as evidenced by failures in
opr_sanity regression test. I will research that later.

There's working contrib/pageinspect support; pg_xlogdump (and wal_debug)
seems to work sanely too.
This patch compiles cleanly under -Werror.

The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633

Thanks for the patch, but I seem to have immediately hit a snag:

pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);

PANIC: invalid xlog record length 0

--
Thom

Re: Minmax indexes

From

Heikki Linnakangas

Date:

16 September 2013, 10:04:12

On 15.09.2013 03:14, Alvaro Herrera wrote:
> + Partial indexes are not supported; since an index is concerned with minimum and
> + maximum values of the involved columns across all the pages in the table, it
> + doesn't make sense to exclude values.  Another way to see "partial" indexes
> + here would be those that only considered some pages in the table instead of all
> + of them; but this would be difficult to implement and manage and, most likely,
> + pointless.

Something like this seems completely sensible to me:

create index i_accounts on accounts using minmax (ts) where valid = true;

The situation where that would be useful is if 'valid' accounts are 
fairly well clustered, but invalid ones are scattered all over the 
table. The minimum and maximum stoed in the index would only concern 
valid accounts.

- Heikki

Re: Minmax indexes

From

Chris Travers

Date:

16 September 2013, 10:19:25

> On 16 September 2013 at 11:03 Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

>
> Something like this seems completely sensible to me:
>
> create index i_accounts on accounts using minmax (ts) where valid = true;
>
> The situation where that would be useful is if 'valid' accounts are
> fairly well clustered, but invalid ones are scattered all over the
> table. The minimum and maximum stoed in the index would only concern
> valid accounts.

Here's one that occurs to me:

CREATE INDEX i_billing_id_mm ON billing(id) WHERE paid_in_full IS NOT TRUE;

Note that this would be a frequently moving target and over years of billing, the subset would be quite small compared to the full system (imagine, say, 50k rows out of 20M).

Best Wises,

Chris Travers
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Best Wishes,
Chris Travers
http://www.2ndquadrant.com
PostgreSQL Services, Training, and Support

Re: Minmax indexes

From

Andres Freund

Date:

16 September 2013, 10:25:20

On 2013-09-16 11:19:19 +0100, Chris Travers wrote:
> 
> 
> > On 16 September 2013 at 11:03 Heikki Linnakangas <hlinnakangas@vmware.com>
> > wrote:
> 
> >
> > Something like this seems completely sensible to me:
> >
> > create index i_accounts on accounts using minmax (ts) where valid = true;
> >
> > The situation where that would be useful is if 'valid' accounts are
> > fairly well clustered, but invalid ones are scattered all over the
> > table. The minimum and maximum stoed in the index would only concern
> > valid accounts.

Yes, I wondered the same myself.

> Here's one that occurs to me:
> 
> CREATE INDEX i_billing_id_mm ON billing(id) WHERE paid_in_full IS NOT TRUE;
> 
> Note that this would be a frequently moving target and over years of billing,
> the subset would be quite small compared to the full system (imagine, say, 50k
> rows out of 20M).

In that case you'd just use a normal btree index, no?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes

From

Jaime Casanova

Date:

17 September 2013, 06:20:39

On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
> On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Here's a reviewable version of what I've dubbed Minmax indexes.
>>
> Thanks for the patch, but I seem to have immediately hit a snag:
>
> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
> PANIC:  invalid xlog record length 0
>

fwiw, this seems to be triggered by ANALYZE.
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)

Another thing is this messages i got when compiling:
"""
mmxlog.c: In function ‘minmax_xlog_revmap_set’:
mmxlog.c:161:14: warning: unused variable ‘blkno’ [-Wunused-variable]
bufpage.c: In function ‘PageIndexDeleteNoCompact’:
bufpage.c:1066:18: warning: ‘lastused’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
"""

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157

Attachment

stacktrace.txt

Re: Minmax indexes

From

Thom Brown

Date:

17 September 2013, 08:30:49

On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
> On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> Here's a reviewable version of what I've dubbed Minmax indexes.
>>
> Thanks for the patch, but I seem to have immediately hit a snag:
>
> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
> PANIC: invalid xlog record length 0
>

fwiw, this seems to be triggered by ANALYZE.
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)

Another thing is this messages i got when compiling:
"""
mmxlog.c: In function ‘minmax_xlog_revmap_set’:
mmxlog.c:161:14: warning: unused variable ‘blkno’ [-Wunused-variable]
bufpage.c: In function ‘PageIndexDeleteNoCompact’:
bufpage.c:1066:18: warning: ‘lastused’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
"""

I'm able to run ANALYSE manually without it dying:

pgbench=# analyse pgbench_accounts;

ANALYZE

pgbench=# analyse pgbench_accounts;

ANALYZE

pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);

PANIC: invalid xlog record length 0

--
Thom

Re: Minmax indexes

From

Jaime Casanova

Date:

17 September 2013, 13:37:30

On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote:
> On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote:
>>
>> On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
>> > On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Here's a reviewable version of what I've dubbed Minmax indexes.
>> >>
>> > Thanks for the patch, but I seem to have immediately hit a snag:
>> >
>> > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax
>> > (aid);
>> > PANIC:  invalid xlog record length 0
>> >
>>
>> fwiw, this seems to be triggered by ANALYZE.
>> At least i can trigger it by executing ANALYZE on the table (attached
>> is a stacktrace of a backend exhibiting the failure)
>>
>
> I'm able to run ANALYSE manually without it dying:
>

try inserting some data before the ANALYZE, that will force a
resumarization which is mentioned in the stack trace of the failure

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157

Re: Minmax indexes

From

Thom Brown

Date:

17 September 2013, 13:44:07

On 17 September 2013 14:37, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote:
> On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote:
>>
>> On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
>> > On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Here's a reviewable version of what I've dubbed Minmax indexes.
>> >>
>> > Thanks for the patch, but I seem to have immediately hit a snag:
>> >
>> > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax
>> > (aid);
>> > PANIC: invalid xlog record length 0
>> >
>>
>> fwiw, this seems to be triggered by ANALYZE.
>> At least i can trigger it by executing ANALYZE on the table (attached
>> is a stacktrace of a backend exhibiting the failure)
>>
>
> I'm able to run ANALYSE manually without it dying:
>

try inserting some data before the ANALYZE, that will force a
resumarization which is mentioned in the stack trace of the failure

I've tried inserting 1 row then ANALYSE and 10,000 rows then ANALYSE, and in both cases there's no error. But then trying to create the index again results in my original error.

--
Thom

Re: Minmax indexes

From

Jaime Casanova

Date:

17 September 2013, 16:05:02

On Tue, Sep 17, 2013 at 8:43 AM, Thom Brown <thom@linux.com> wrote:
> On 17 September 2013 14:37, Jaime Casanova <jaime@2ndquadrant.com> wrote:
>>
>> On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote:
>> > On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com>
>> > wrote:
>> >>
>> >> On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
>> >> > On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> Here's a reviewable version of what I've dubbed Minmax indexes.
>> >> >>
>> >> > Thanks for the patch, but I seem to have immediately hit a snag:
>> >> >
>> >> > pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax
>> >> > (aid);
>> >> > PANIC:  invalid xlog record length 0
>> >> >
>> >>
>> >> fwiw, this seems to be triggered by ANALYZE.
>> >> At least i can trigger it by executing ANALYZE on the table (attached
>> >> is a stacktrace of a backend exhibiting the failure)
>> >>
>> >
>> > I'm able to run ANALYSE manually without it dying:
>> >
>>
>> try inserting some data before the ANALYZE, that will force a
>> resumarization which is mentioned in the stack trace of the failure
>
>
> I've tried inserting 1 row then ANALYSE and 10,000 rows then ANALYSE, and in
> both cases there's no error.  But then trying to create the index again
> results in my original error.
>

Ok

So, please confirm if this is the pattern you are following:

CREATE TABLE t1(i int);
INSERT INTO t1 SELECT generate_series(1, 10000);
CREATE INDEX idx1 ON t1 USING minmax (i);

if that, then the attached stack trace (index_failure_thom.txt) should
correspond to the failure you are looking.

My test was slightly different:

CREATE TABLE t1(i int);
CREATE INDEX idx1 ON t1 USING minmax (i);
INSERT INTO t1 SELECT generate_series(1, 10000);
ANALYZE t1;

and the failure happened in a different time, in resumarization
(attached index_failure_jcm.txt)

but in the end, both failures seems to happen for the same reason: a
record of length 0... at XLogInsert time

#4  XLogInsert at xlog.c:966
#5  mmSetHeapBlockItemptr at mmrevmap.c:169
#6  mm_doinsert at minmax.c:1410

actually, if you create a temp table both tests works fine

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157

Attachment

Re: Minmax indexes

From

Alvaro Herrera

Date:

17 September 2013, 21:03:19

Thom Brown wrote:

Thanks for testing.

> Thanks for the patch, but I seem to have immediately hit a snag:
>
> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
> PANIC:  invalid xlog record length 0

Silly mistake I had already made in another patch.  Here's an
incremental patch which fixes this bug.  Apply this on top of previous
minmax-1.patch.

I also renumbered the duplicate OID pointed out by Peter, and fixed the
two compiler warnings reported by Jaime.

Note you'll need to re-initdb in order to get the right catalog entries.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-2-incr.patch

Re: Minmax indexes

From

Thom Brown

Date:

17 September 2013, 21:17:54

On 17 September 2013 22:03, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Thom Brown wrote:

Thanks for testing.

> Thanks for the patch, but I seem to have immediately hit a snag:
>
> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
> PANIC: invalid xlog record length 0

Silly mistake I had already made in another patch. Here's an
incremental patch which fixes this bug. Apply this on top of previous
minmax-1.patch.

Thanks.

Hit another issue with exactly the same procedure:

pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);

ERROR: lock 176475 is not held

Thom

Re: Minmax indexes

From

"Erik Rijkers"

Date:

17 September 2013, 21:27:34

On Tue, September 17, 2013 23:03, Alvaro Herrera wrote:

> [minmax-1.patch. + minmax-2-incr.patch. (and initdb)]


The patches apply and compile OK.

I've not yet really tested; I just wanted to mention that  make check  gives the following differences:



*** /home/aardvark/pg_stuff/pg_sandbox/pgsql.minmax/src/test/regress/expected/opr_sanity.out    2013-09-17
23:18:31.427356703
+0200
--- /home/aardvark/pg_stuff/pg_sandbox/pgsql.minmax/src/test/regress/results/opr_sanity.out    2013-09-17
23:20:48.208150824
+0200
***************
*** 1076,1081 ****
--- 1076,1086 ----        2742 |            2 | @@@        2742 |            3 | <@        2742 |            4 | =
+        3847 |            1 | <
+        3847 |            2 | <=
+        3847 |            3 | =
+        3847 |            4 | >=
+        3847 |            5 | >        4000 |            1 | <<        4000 |            1 | ~<~        4000 |
  2 | &<
 
***************
*** 1098,1104 ****        4000 |           15 | >        4000 |           16 | @>        4000 |           18 | =
! (62 rows)
 -- Check that all opclass search operators have selectivity estimators. -- This is not absolutely required, but it
seemsa reasonable thing
 
--- 1103,1109 ----        4000 |           15 | >        4000 |           16 | @>        4000 |           18 | =
! (67 rows)
 -- Check that all opclass search operators have selectivity estimators. -- This is not absolutely required, but it
seemsa reasonable thing
 
***************
*** 1272,1280 **** WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' GROUP BY amname,
amsupport,opcname, amprocfamily HAVING count(*) != amsupport OR amprocfamily IS NULL;
 
!  amname | opcname | count
! --------+---------+-------
! (0 rows)
 SELECT amname, opcname, count(*) FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
--- 1277,1288 ---- WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' GROUP BY amname,
amsupport,opcname, amprocfamily HAVING count(*) != amsupport OR amprocfamily IS NULL;
 
!  amname |   opcname   | count
! --------+-------------+-------
!  minmax | int4_ops    |     1
!  minmax | text_ops    |     1
!  minmax | numeric_ops |     1
! (3 rows)
 SELECT amname, opcname, count(*) FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid

======================================================================




Erik Rijkers

Re: Minmax indexes

From

Alvaro Herrera

Date:

17 September 2013, 21:52:34

Thom Brown wrote:

> Hit another issue with exactly the same procedure:
>
> pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);
> ERROR:  lock 176475 is not held

That's what I get for restructuring the way buffers are acquired to use
the FSM, and then neglecting to test creation on decently-sized indexes.
Fix attached.

I just realized that xlog replay is also broken.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-3-incr.patch

Re: Minmax indexes

From

Alvaro Herrera

Date:

17 September 2013, 22:00:10

Erik Rijkers wrote:
> On Tue, September 17, 2013 23:03, Alvaro Herrera wrote:
> 
> > [minmax-1.patch. + minmax-2-incr.patch. (and initdb)]
> 
> 
> The patches apply and compile OK.
> 
> I've not yet really tested; I just wanted to mention that  make check  gives the following differences:

Oops, I forgot to update the expected file.  I had to comment on this
when submitting minmax-2-incr.patch and forgot.  First, those extra five
operators are supposed to be there; expected file needs an update.  As
for this:

> --- 1277,1288 ----
>   WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
>   GROUP BY amname, amsupport, opcname, amprocfamily
>   HAVING count(*) != amsupport OR amprocfamily IS NULL;
> !  amname |   opcname   | count
> ! --------+-------------+-------
> !  minmax | int4_ops    |     1
> !  minmax | text_ops    |     1
> !  minmax | numeric_ops |     1
> ! (3 rows)

I think the problem is that the query is wrong.  This is the complete query:

SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid    LEFT JOIN pg_amproc p ON amprocfamily = opcfamily AND
amproclefttype= amprocrighttype AND amproclefttype = opcintype

WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
HAVING count(*) != amsupport OR amprocfamily IS NULL;

I should be, instead, this:

SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid    LEFT JOIN pg_amproc p ON amprocfamily = opcfamily AND
amproclefttype= amprocrighttype AND amproclefttype = opcintype

WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
HAVING count(*) != amsupport AND (amprocfamily IS NOT NULL);

This query is supposed to check that there are no opclasses with
mismatching number of support procedures; but if the left join returns a
null-extended row for pg_amproc, that means there is no support proc,
yet count(*) will return 1.  So count(*) will not match amsupport, and
the row is supposed to be excluded by the amprocfamily IS NULL clause in
HAVING.

Both queries return empty in HEAD, but only the second one correctly
returns empty with the patch applied.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Jaime Casanova

Date:

18 September 2013, 07:03:00

On Tue, Sep 17, 2013 at 4:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Thom Brown wrote:
>
> Thanks for testing.
>
>> Thanks for the patch, but I seem to have immediately hit a snag:
>>
>> pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
>> PANIC:  invalid xlog record length 0
>
> Silly mistake I had already made in another patch.  Here's an
> incremental patch which fixes this bug.  Apply this on top of previous
> minmax-1.patch.
>
> I also renumbered the duplicate OID pointed out by Peter, and fixed the
> two compiler warnings reported by Jaime.
>
> Note you'll need to re-initdb in order to get the right catalog entries.
>

Hi,

Found another problem with the this steps:

create table t1 (i int);
create index idx_t1_i on t1 using minmax(i);
insert into t1 select generate_series(1, 2000000);
ERROR:  could not read block 1 in file "base/12645/16397_vm": read
only 0 of 8192 bytes
STATEMENT:  insert into t1 select generate_series(1, 2000000);
ERROR:  could not read block 1 in file "base/12645/16397_vm": read
only 0 of 8192 bytes

After that, i keep receiving these messages (when autovacuum tries to
vacuum this table):

ERROR:  could not truncate file "base/12645/16397_vm" to 2 blocks:
it's only 1 blocks now
CONTEXT:  automatic vacuum of table "postgres.public.t1"
ERROR:  could not truncate file "base/12645/16397_vm" to 2 blocks:
it's only 1 blocks now
CONTEXT:  automatic vacuum of table "postgres.public.t1"

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157

Re: Minmax indexes

From

Alvaro Herrera

Date:

24 September 2013, 22:14:29

Jaime Casanova wrote:

> Found another problem with the this steps:
>
> create table t1 (i int);
> create index idx_t1_i on t1 using minmax(i);
> insert into t1 select generate_series(1, 2000000);
> ERROR:  could not read block 1 in file "base/12645/16397_vm": read
> only 0 of 8192 bytes

Thanks.  This was a trivial off-by-one bug; fixed in the attached patch.
While studying it, I noticed that I was also failing to notice extension
of the fork by another process.  I have tried to fix that also in the
current patch, but I'm afraid that a fully robust solution for this will
involve having a cached fork size in the index's relcache entry -- just
like we have smgr_vm_nblocks.  In fact, since the revmap fork is
currently reusing the VM forknum, I might even be able to use the same
variable to keep track of the fork size.  But I don't really like this
bit of reusing the VM forknum for revmap, so I've refrained from
extending that assumption into further code for the time being.

There was also a bug that we would try to initialize a revmap page twice
during recovery, if two backends thought they needed to extend it; that
would cause the data written by the first extender to be lost.

This patch applies on top of the two previous incremental patches.  I
will send a full patch later, including all those fixes and the fix for
the opr_sanity regression test.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-4-incr.patch

Re: Minmax indexes

From

"Erik Rijkers"

Date:

25 September 2013, 06:40:05

On Wed, September 25, 2013 00:14, Alvaro Herrera wrote:
> [minmax-4-incr.patch]

After a  --data-checksums initdb (successful), the following error came up:



after the statement: create index t_minmax_idx on t using minmax (r);

WARNING:  page verification failed, calculated checksum 25951 but expected 0
ERROR:  invalid page in block 1 of relation base/21324/26267_vm

it happens reliably. every time I run the program.

Below is the whole program that I used.


Thanks,

Erik Rijkers






#!/bin/sh

t=t

if [[ 1 -eq 1 ]]; then
   echo "       drop table if exists $t ;       create table $t           as           select i, cast( random() * 10^9
asinteger ) as r           from generate_series(1, 1000000)  as f(i) ;   analyze $t;   table $t limit 5;   select
count(*)from $t;   explain analyze select min(r), max(r) from $t;                           select min(r), max(r) from
$t;
   create index ${t}_minmax_idx on $t using minmax (r);   analyze $t;
   explain analyze select min(r), max(r) from $t;                           select min(r), max(r) from $t;
   " | psql

fi

Re: Minmax indexes

From

Amit Kapila

Date:

25 September 2013, 07:48:22

On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Hi,
>
> Here's a reviewable version of what I've dubbed Minmax indexes.  Some
> people said they would like to use some other name for this feature, but
> I have yet to hear usable ideas, so for now I will keep calling them
> this way.  I'm open to proposals, but if you pick something that cannot
> be abbreviated "mm" I might have you prepare a rebased version which
> renames the files and structs.
>
> The implementation here has been simplified from what I originally
> proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular,
> I noticed that there's no need to involve aggregate functions at all; we
> can just use inequality operators.  So the pg_amproc entries are gone;
> only the pg_amop entries are necessary.
>
> I've somewhat punted on the question of doing resummarization separately
> from vacuuming.  Right now, resummarization (as well as other necessary
> index cleanup) takes place in amvacuumcleanup.  This is not optimal; I
> have stated elsewhere that I'd like to create separate maintenance
> actions that can be carried out by autovacuum.  That would be useful
> both for Minmax indexes and GIN indexes (pending insertion list); maybe
> others.  That's not part of this patch, however.
>
> The design of this stuff is in the file "minmax-proposal" at the top of
> the tree.  That file is up to date, though it still contains some open
> questions that were present in the original proposal.  (I have not fixed
> some bogosities pointed out by Noah, for instance.  I will do that
> shortly.)  In a final version, that file would be applied as
> src/backend/access/minmax/README, most likely.
>
> One area on which I needed to modify core code is IndexBuildHeapScan.  I
> needed a version that was able to scan only a certain range of pages,
> not the entire table, so I introduced a new IndexBuildHeapRangeScan, and
> added a quick "heap_scansetlimits" function.  I haven't tested that this
> works outside of the HeapRangeScan thingy, so it's probably completely
> bogus; I'm open to suggestions if people think this should be
> implemented differently.  In any case, keeping that implementation
> together with vanilla IndexBuildHeapScan makes a lot of sense.
>
> One thing still to tackle is when to mark ranges as unsummarized.  Right
> now, any new tuple on a page range would cause a new index entry to be
> created and a new revmap update.  This would cause huge index bloat if,
> say, a page is emptied and vacuumed and filled with new tuples with
> increasing values outside the original range; each new tuple would
> create a new index tuple.  I have two ideas about this (1. mark range as
> unsummarized if 3rd time we touch the same page range;
  Why only at 3rd time?  Doesn't it need to be precise, like if someone inserts a row having
value greater than max value of corresponding index tuple,  then that index tuple's corresponding max value needs to be
updated
and I think its updated with the help of validity map.
  For example:  considering we need to store below info for each index tuple:  In each index tuple (corresponding to
onepage range), we store:   - first block this tuple applies to   - last block this tuple applies to   - for each
indexedcolumn:     * min() value across all tuples in the range     * max() value across all tuples in the range
 
  Assume first and last block for index tuple is same (assume block
no. 'x') and min value is 5 and max is 10.  Now user insert/update value in block 'x' such that max value of
index col. is 11, if we don't update corresponding  index tuple or at least invalidate it, won't it lead to wrong
results?

> 2. vacuum the
> affected index page if it's full, so we can maintain the index always up
> to date without causing unduly bloat), but I haven't implemented
> anything yet.
>
> The "amcostestimate" routine is completely bogus; right now it returns
> constant 0, meaning the index is always chosen if it exists.
 I think for first version, you might want to keep things simple, but
there should be some way for optimizer to select this index. So rather than choose if it is present, we can make
optimizerchoose
 
when some-one says set enable_minmax index to true.

 How about keeping this up-to-date during foreground operations.
Vacuum/Maintainer task maintaining things usually have problems of
bloat and then we need optimize/workaround issues. Lot of people have raised this or similar point previously and what
I read you are of opinion that it seems to be slow. I really don't think that it can be so slow that adding so much
handling to get it up-to-date by some maintainer task is useful.
Currently there are systems like Oracle where index clean-up is mainly done during
foreground operation, so this alone cannot be reason for slowness.
 Comparing the logic with IOS is also not completely right as for
IOS, we need to know each tuple's visibility, which is not the case
here.
 Now it can so happen that min and max values are sometimes not right
because later the operation is rolled back, but I think such cases
will be less and we can find some way to handle such cases may be
maintainer task only, but the handling will be quite simpler.
 On Windows, patch gives below compilation errors: src\backend\access\minmax\mmtuple.c(96): error C2057: expected
constant expression src\backend\access\minmax\mmtuple.c(96): error C2466: cannot
allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(96): error C2133: 'values' : unknown size
src\backend\access\minmax\mmtuple.c(97):error C2057: expected
 
constant expression src\backend\access\minmax\mmtuple.c(97): error C2466: cannot
allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(97): error C2133: 'nulls' : unknown size
src\backend\access\minmax\mmtuple.c(102):error C2057: expected
 
constant expression src\backend\access\minmax\mmtuple.c(102): error C2466: cannot
allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(102): error C2133:
'phony_nullbitmap' : unknown size src\backend\access\minmax\mmtuple.c(110): warning C4034: sizeof returns 0
src\backend\access\minmax\mmtuple.c(246):error C2057: expected
 
constant expression src\backend\access\minmax\mmtuple.c(246): error C2466: cannot
allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(246): error C2133: 'values' : unknown size
src\backend\access\minmax\mmtuple.c(247):error C2057: expected
 
constant expression src\backend\access\minmax\mmtuple.c(247): error C2466: cannot
allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(247): error C2133: 'allnulls' :
unknown size src\backend\access\minmax\mmtuple.c(248): error C2057: expected
constant expression src\backend\access\minmax\mmtuple.c(248): error C2466: cannot
allocate an array of constant size 0 src\backend\access\minmax\mmtuple.c(248): error C2133: 'hasnulls' :
unknown size



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Minmax indexes

From

Alvaro Herrera

Date:

25 September 2013, 20:17:09

Amit Kapila escribió:
> On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > One thing still to tackle is when to mark ranges as unsummarized.  Right
> > now, any new tuple on a page range would cause a new index entry to be
> > created and a new revmap update.  This would cause huge index bloat if,
> > say, a page is emptied and vacuumed and filled with new tuples with
> > increasing values outside the original range; each new tuple would
> > create a new index tuple.  I have two ideas about this (1. mark range as
> > unsummarized if 3rd time we touch the same page range;
>
>    Why only at 3rd time?
>    Doesn't it need to be precise, like if someone inserts a row having
> value greater than max value of corresponding index tuple,
>    then that index tuple's corresponding max value needs to be updated
> and I think its updated with the help of validity map.

Of course.  Note I no longer have the concept of a validity map; I have
switched things to use a "reverse range map", or revmap for short.  The
revmap is responsible for mapping each page number to an individual
index TID.  If the TID stored in the revmap is InvalidTid, that means
the range is not summarized.  Summarized ranges are always considered as
"match query quals", and thus all tuples in them are returned in the
bitmap for heap recheck.

The way it works currently, is that any tuple insert (that's outside the
bounds of the current index tuple) causes a new index tuple to be
created, and the revmap is updated to point to the new index tuple.  The
old index tuple is orphaned and will be deleted at next vacuum.  This
works fine.  However the problem is excess orphaned tuples; I don't want
a long series of updates to create many orphaned dead tuples.  Instead I
would like the system to, at some point, stop creating new index tuples
and instead set the revmap to InvalidTid.  That would stop the index
bloat.

>    For example:
>    considering we need to store below info for each index tuple:
>    In each index tuple (corresponding to one page range), we store:
>     - first block this tuple applies to
>     - last block this tuple applies to
>     - for each indexed column:
>       * min() value across all tuples in the range
>       * max() value across all tuples in the range
>
>    Assume first and last block for index tuple is same (assume block
> no. 'x') and min value is 5 and max is 10.
>    Now user insert/update value in block 'x' such that max value of
> index col. is 11, if we don't update corresponding
>    index tuple or at least invalidate it, won't it lead to wrong results?

Sure, that would result in wrong results.  Fortunately that's not how I
am suggesting to do it.

I note you're reading an old version of the design.  I realize now that
this is my mistake because instead of posting the new design in the
cover letter for the patch, I only put it in the "minmax-proposal" file.
Please give that file a read to see how the design differs from the
design I originally posted in the old thread.

> > The "amcostestimate" routine is completely bogus; right now it returns
> > constant 0, meaning the index is always chosen if it exists.
>
>   I think for first version, you might want to keep things simple, but
> there should be some way for optimizer to select this index.
>   So rather than choose if it is present, we can make optimizer choose
> when some-one says set enable_minmax index to true.

Well, enable_bitmapscan already disables minmax indexes, just like it
disables other indexes.

>   How about keeping this up-to-date during foreground operations.
> Vacuum/Maintainer task maintaining things usually have problems of
> bloat and
>   then we need optimize/workaround issues.
>   Lot of people have raised this or similar point previously and what
> I read you are of opinion that it seems to be slow.

Well, the current code does keep the index up to date -- I did choose to
implement what people suggested :-)

>   Now it can so happen that min and max values are sometimes not right
> because later the operation is rolled back, but I think such cases
> will
>   be less and we can find some way to handle such cases may be
> maintainer task only, but the handling will be quite simpler.

Agreed.

>   On Windows, patch gives below compilation errors:
>   src\backend\access\minmax\mmtuple.c(96): error C2057: expected
> constant expression

I have fixed all these compile errors (fix attached).  Thanks for
reporting them.  I'll post a new version shortly.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-5a-incr.patch

Re: Minmax indexes

From

Alvaro Herrera

Date:

25 September 2013, 20:23:54

Erik Rijkers wrote:

> After a  --data-checksums initdb (successful), the following error came up:
>
> after the statement: create index t_minmax_idx on t using minmax (r);
>
> WARNING:  page verification failed, calculated checksum 25951 but expected 0
> ERROR:  invalid page in block 1 of relation base/21324/26267_vm
>
> it happens reliably. every time I run the program.

Thanks for the report.  That's fixed with the attached.

> Below is the whole program that I used.

Hmm, this test program shows that you're trying to use the index to
optimize min() and max() queries, but that's not what these indexes do.
You will need to use operators > >= = <= < (or BETWEEN, which is the
same thing) to see your index in action.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-5b-incr.patch

Re: Minmax indexes

From

Alvaro Herrera

Date:

25 September 2013, 20:34:47

Here's an updated version of this patch, with fixes to all the bugs
reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-5.patch

Re: Minmax indexes

From

"Erik Rijkers"

Date:

25 September 2013, 22:34:51

On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:

> [minmax-5.patch]

I have the impression it's not quite working correctly.

The attached program returns different results for different values of enable_bitmapscan (consistently).

( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected?  For a 100M
row
table, max_locks_per_transaction=1024 was not enough; I set it to 2048.  Might be worth some documentation, eventually.
)

>From eyeballing the results it looks like the minmax result (i.e. the result set with enable_bitmapscan = 1) yields
only
the last part because the only 'last' rows seem to be present (see the values in column i in table tmm in the attached
program).

Thanks,

Erikjan Rijkers

Attachment

test.sh

Re: Minmax indexes

From

"Erik Rijkers"

Date:

26 September 2013, 06:54:59

On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
> On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
>
>> [minmax-5.patch]
>
> I have the impression it's not quite working correctly.
>
> The attached program returns different results for different values of enable_bitmapscan (consistently).
>
> ( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected?  For a
100Mrow
 
> table, max_locks_per_transaction=1024 was not enough; I set it to 2048.  Might be worth some documentation,
eventually.)
 
>
> From eyeballing the results it looks like the minmax result (i.e. the result set with enable_bitmapscan = 1) yields
only
> the last part because the only 'last' rows seem to be present (see the values in column i in table tmm in the
attached
> program).

Looking back at that, I realize I should have added a bit more detail on that test.sh program and its output (attached
on
previous mail).

test.sh creates a table tmm and a minmax index on that table:

testdb=# \d tmm     Table "public.tmm"Column |  Type   | Modifiers
--------+---------+-----------i      | integer |r      | integer |
Indexes:   "tmm_minmax_idx" minmax (r)


The following shows the problem:  the same search with minax index on versus off gives different result sets:

testdb=# set enable_bitmapscan=0; select count(*) from tmm where r between symmetric 19494484 and 145288238;
SET
Time: 0.473 mscount
------- 1261
(1 row)

Time: 7.764 ms
testdb=# set enable_bitmapscan=1; select count(*) from tmm where r between symmetric 19494484 and 145288238;
SET
Time: 0.471 mscount
-------    3
(1 row)

Time: 1.014 ms



testdb=# set enable_bitmapscan =1; select * from tmm where r between symmetric 19494484 and 145288238;
SET
Time: 0.615 ms i   |     r
------+-----------9945 |  454056039951 | 1025524859966 |  63763962
(3 rows)

Time: 0.984 ms

testdb=# set enable_bitmapscan=0; select * from ( select * from tmm where r between symmetric 19494484 and 145288238
order
by i desc limit 10) f order by i ;
SET
Time: 0.470 ms i   |     r
------+-----------9852 | 1149969069858 |  699071699875 |  433415839894 | 1278626579895 |  447400339911 |  517975539916
| 585387749945 |  454056039951 | 1025524859966 |  63763962
 
(10 rows)

Time: 8.704 ms
testdb=#

If enable_bitmapscan=1 (i.e. using the minmax index), then only some values are retrieved (in this case 3 rows).   It
turns
out those are always the last N rows of the full resultset (i.e. with enable_bitmapscan=0).


Erikjan Rijkers

Re: Minmax indexes

From

Robert Haas

Date:

26 September 2013, 17:01:01

On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Here's an updated version of this patch, with fixes to all the bugs
> reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> Amit Kapila for the reports.

I'm not very happy with the use of a separate relation fork for
storing this data.  Using an existing fork number rather than creating
a new one avoids some of them (like, the fact that we loop over all
known fork numbers in various places, and adding another one will add
latency in all of those places, particularly when there is a system
call in the loop) but not all of them (like, what happens if the index
is unlogged?  we have provisions to reset the main fork but any others
are just removed; is that OK?), and it also creates some new ones
(like, files having misleading names).

More generally, I fear we really opened a bag of worms with this
relation fork stuff.  Every time I turn around I run into a problem
that could be solved by adding another relation fork.  I'm not
terribly sure that it was a good idea to go that way to begin with,
because we've got customers who are unhappy about 3 files/heap due to
inode consumption and slow directory lookups.  I think we would have
been smarter to devise a strategy for storing the fsm and vm pages
within the main fork in some fashion, and I tend to think that's the
right solution here as well.  Of course, it may be hopeless to put the
worms back in the can at this point, and surely these indexes will be
lightly used compared to heaps, so it's not incrementally exacerbating
the problems all that much.  But I still feel uneasy about widening
use of that mechanism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Alvaro Herrera

Date:

26 September 2013, 17:39:23

Robert Haas escribió:
> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Here's an updated version of this patch, with fixes to all the bugs
> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> > Amit Kapila for the reports.
> 
> I'm not very happy with the use of a separate relation fork for
> storing this data.

I understand this opinion, as I considered it myself while developing
it.  Also, GIN already does things this way.  Perhaps I should just bite
the bullet and do this.

> Using an existing fork number rather than creating
> a new one avoids some of them (like, the fact that we loop over all
> known fork numbers in various places, and adding another one will add
> latency in all of those places, particularly when there is a system
> call in the loop) but not all of them (like, what happens if the index
> is unlogged?  we have provisions to reset the main fork but any others
> are just removed; is that OK?), and it also creates some new ones
> (like, files having misleading names).

All good points.

Index scans will normally access the revmap in sequential fashion; it
would be enough to chain revmap pages, keeping a single block number in
the metapage pointing to the first one, and subsequent ones are accessed
from a "next" block number in each page.  However, heap insertion might
need to access a random revmap page, and this would be too slow.  I
think it would be enough to keep an array of block numbers in the
index's metapage; the metapage would be share locked on every scan and
insert, but that's not a big deal because exclusive lock would only be
needed on the metapage to extend the revmap, which would be a very
infrequent operation.

As this will require some rework to this code, I think it's fair to mark
this as returned with feedback for the time being.  I will return with
an updated version soon, fixing the relation fork issue as well as the
locking and visibility bugs reported by Erik.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Alvaro Herrera

Date:

26 September 2013, 17:40:53

Erik Rijkers wrote:

> I have the impression it's not quite working correctly.
> 
> The attached program returns different results for different values of enable_bitmapscan (consistently).

Clearly there's some bug somewhere.  I'll investigate it more.

> ( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected?  For a
100Mrow
 
> table, max_locks_per_transaction=1024 was not enough; I set it to 2048.  Might be worth some documentation,
eventually.)
 

Not documentation -- that would also be a bug which needs to be fixed.

Thanks for testing.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Jim Nasby

Date:

26 September 2013, 18:58:58

On 9/26/13 12:00 PM, Robert Haas wrote:
> More generally, I fear we really opened a bag of worms with this
> relation fork stuff.  Every time I turn around I run into a problem
> that could be solved by adding another relation fork.  I'm not
> terribly sure that it was a good idea to go that way to begin with,
> because we've got customers who are unhappy about 3 files/heap due to
> inode consumption and slow directory lookups.  I think we would have
> been smarter to devise a strategy for storing the fsm and vm pages
> within the main fork in some fashion, and I tend to think that's the
> right solution here as well.  Of course, it may be hopeless to put the
> worms back in the can at this point, and surely these indexes will be
> lightly used compared to heaps, so it's not incrementally exacerbating
> the problems all that much.  But I still feel uneasy about widening
> use of that mechanism.

Why would we add additional code complexity when forks do the trick? That seems like a step backwards, not forward.

If the only complaint about forks is directory traversal why wouldn't we go with the well established practice of using
multipledirectories instead of glomming everything into one place?
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Minmax indexes

From

Robert Haas

Date:

26 September 2013, 19:46:34

On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby <jim@nasby.net> wrote:
> Why would we add additional code complexity when forks do the trick? That
> seems like a step backwards, not forward.

Well, they sorta do the trick, but see e.g. commit
ece01aae479227d9836294b287d872c5a6146a11.  I doubt that's the only
code that's poorly-optimized for multiple forks; IOW, every time
someone adds a new fork, there's a system-wide cost to that, even if
that fork is only used in a tiny percentage of the relations that
exist in the system.

It's tempting to think that we can use the fork mechanism every time
we have multiple logical "streams" of blocks within a relation and
don't want to figure out a way to multiplex them onto the same
physical file.  However, the reality is that the fork mechanism isn't
up to the job.  I certainly don't want to imply that we shouldn't have
gone in that direction - both the fsm and the vm are huge steps
forward, and we wouldn't have gotten them in 8.4 without that
mechanism.  But they haven't been entirely without their own pain,
too, and that pain is going to grow the more we push in the direction
of relying on forks.

> If the only complaint about forks is directory traversal why wouldn't we go
> with the well established practice of using multiple directories instead of
> glomming everything into one place?

That's not the only complaint about forks - but I support what you're
proposing there anyhow, because it will be helpful to users with lots
of relations regardless of what we do or do not decide to do about
forks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Amit Kapila

Date:

27 September 2013, 06:19:13

On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Amit Kapila escribió:
>> On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>

>
>>   On Windows, patch gives below compilation errors:
>>   src\backend\access\minmax\mmtuple.c(96): error C2057: expected
>> constant expression
>
> I have fixed all these compile errors (fix attached).  Thanks for
> reporting them.  I'll post a new version shortly.
  Thanks for fixing it. In last few days I had spent some time
reading about minmax or equivalent indexes in other databases (Netezza
and Oracle) and going through some parts of your proposal. Its a bit
bigger patch and needs much more time, but I would like to share my
findings/thoughts I had developed till now.

Firstly about interface and use case, as far as I could understand
other databases provide this index automatically rather than having a
separate Create Index command which may be because such an index can
be mainly useful when the data is ordered or if it's distributed in
such a way that it's quite useful for repeatedly executing queries.
You have proposed it as a command which means user needs to take care
of it which I find is okay for first version, later may be we can also
have some optimisations so that it can get created automatically.
For the page range, If I read correctly, currently you have used hash
define, do you want to expose it to user in some way like GUC or
maintain it internally and assign the right value based on performance
of different queries?

Operations on this index seems to be very fast, like Oracle has this
as an in-memory structure and I read in Netezza that write operations
doesn't carry any significant overhead for zone maps as compare to
other indexes, so shouldn't we consider it to be without WAL logged?
OTOH I think because these structures get automatically created in
those databases, so it might be okay but if we provide it as a
command, then user might be bothered if he didn't find it
automatically on server restart.

Few Questions and observations:
1.
+ When a new heap tuple is inserted in a summarized page range, it is
possible to
+ compare the existing index tuple with the new heap tuple.  If the
heap tuple is
+ outside the minimum/maximum boundaries given by the index tuple for
any indexed
+ column (or if the new heap tuple contains null values but the index tuple
+ indicate there are no nulls), it is necessary to create a new index tuple with
+ the new values.  To do this, a new index tuple is inserted, and the
reverse range
+ map is updated to point to it.  The old index tuple is left in
place, for later
+ garbage collection.

Is there a reason why we can't directly update the value rather then
new insert in index, as I understand for other indexes like btree
we do this because we might need to rollback, but here even if after
updating the min or max value, rollback happens, it will not cause
any harm (tuple loss).

2.
+ If the reverse range map points to an invalid TID, the corresponding
page range
+ is not summarized.

3.
It might be better if you can mention when range map will point to an
invalid TID, it's not explained in your proposal, but you have used it
in you proposal to explain some other things.

4.
Range reverse map is a good terminology, but isn't Range translation
map better. I don't mind either way, it's just a thought came to my
mind while understanding concept of Range Reverse map.

5.
/** As above, except that instead of scanning the complete heap, only the given* range is scanned.  Scan to end-of-rel
canbe signalled by passing* InvalidBlockNumber as end block number.*/ 
double
IndexBuildHeapRangeScan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
bool allow_sync,
BlockNumber start_blockno,
BlockNumber numblocks,
IndexBuildCallback callback,
void *callback_state)

In comments you have used end block number, which parameter does it
refer to? I could see only start_blockno and numb locks?

6.
currently you are passing 0 as start block and InvalidBlockNumber as
number of blocks, what's the logic for it?
return IndexBuildHeapRangeScan(heapRelation, indexRelation, indexInfo, allow_sync, 0, InvalidBlockNumber, callback,
callback_state);

7.
In mmbuildCallback, it only add's tuple to minmax index, if it
satisfies page range, else this can lead to waste of big scan incase
page range is large (1280 pages as you mentiones in one of your
mails). Why can't we include it end of scan?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Minmax indexes

From

Amit Kapila

Date:

27 September 2013, 06:40:49

On Fri, Sep 27, 2013 at 11:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Amit Kapila escribió:
>>> On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
>>> <alvherre@2ndquadrant.com> wrote:
>>
>
>>
>>>   On Windows, patch gives below compilation errors:
>>>   src\backend\access\minmax\mmtuple.c(96): error C2057: expected
>>> constant expression
>>
>> I have fixed all these compile errors (fix attached).  Thanks for
>> reporting them.  I'll post a new version shortly.
>
>    Thanks for fixing it. In last few days I had spent some time
> reading about minmax or equivalent indexes in other databases (Netezza
> and Oracle) and going through some parts of your proposal. Its a bit
> bigger patch and needs much more time, but I would like to share my
> findings/thoughts I had developed till now.
>
> Firstly about interface and use case, as far as I could understand
> other databases provide this index automatically rather than having a
> separate Create Index command which may be because such an index can
> be mainly useful when the data is ordered or if it's distributed in
> such a way that it's quite useful for repeatedly executing queries.
> You have proposed it as a command which means user needs to take care
> of it which I find is okay for first version, later may be we can also
> have some optimisations so that it can get created automatically.
> For the page range, If I read correctly, currently you have used hash
> define, do you want to expose it to user in some way like GUC or
> maintain it internally and assign the right value based on performance
> of different queries?
>
> Operations on this index seems to be very fast, like Oracle has this
> as an in-memory structure and I read in Netezza that write operations
> doesn't carry any significant overhead for zone maps as compare to
> other indexes, so shouldn't we consider it to be without WAL logged?
> OTOH I think because these structures get automatically created in
> those databases, so it might be okay but if we provide it as a
> command, then user might be bothered if he didn't find it
> automatically on server restart.
>
> Few Questions and observations:
> 1.
> + When a new heap tuple is inserted in a summarized page range, it is
> possible to
> + compare the existing index tuple with the new heap tuple.  If the
> heap tuple is
> + outside the minimum/maximum boundaries given by the index tuple for
> any indexed
> + column (or if the new heap tuple contains null values but the index tuple
> + indicate there are no nulls), it is necessary to create a new index tuple with
> + the new values.  To do this, a new index tuple is inserted, and the
> reverse range
> + map is updated to point to it.  The old index tuple is left in
> place, for later
> + garbage collection.
>
>
> Is there a reason why we can't directly update the value rather then
> new insert in index, as I understand for other indexes like btree
> we do this because we might need to rollback, but here even if after
> updating the min or max value, rollback happens, it will not cause
> any harm (tuple loss).
>
> 2.
> + If the reverse range map points to an invalid TID, the corresponding
> page range
> + is not summarized.
>
> 3.
> It might be better if you can mention when range map will point to an
> invalid TID, it's not explained in your proposal, but you have used it
> in you proposal to explain some other things.
>
> 4.
> Range reverse map is a good terminology, but isn't Range translation
> map better. I don't mind either way, it's just a thought came to my
> mind while understanding concept of Range Reverse map.
>
> 5.
> /*
>  * As above, except that instead of scanning the complete heap, only the given
>  * range is scanned.  Scan to end-of-rel can be signalled by passing
>  * InvalidBlockNumber as end block number.
>  */
> double
> IndexBuildHeapRangeScan(Relation heapRelation,
> Relation indexRelation,
> IndexInfo *indexInfo,
> bool allow_sync,
> BlockNumber start_blockno,
> BlockNumber numblocks,
> IndexBuildCallback callback,
> void *callback_state)
>
> In comments you have used end block number, which parameter does it
> refer to? I could see only start_blockno and numb locks?
>
> 6.
> currently you are passing 0 as start block and InvalidBlockNumber as
> number of blocks, what's the logic for it?
> return IndexBuildHeapRangeScan(heapRelation, indexRelation,
>   indexInfo, allow_sync,
>   0, InvalidBlockNumber,
>   callback, callback_state);

I got it, I think here it means scan all the pages.

> 7.
> In mmbuildCallback, it only add's tuple to minmax index, if it
> satisfies page range, else this can lead to waste of big scan incase
> page range is large (1280 pages as you mentiones in one of your
> mails). Why can't we include it end of scan?
>
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

Re: Minmax indexes

From

Jim Nasby

Date:

27 September 2013, 18:22:25

On 9/26/13 2:46 PM, Robert Haas wrote:
> On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby <jim@nasby.net> wrote:
>> Why would we add additional code complexity when forks do the trick? That
>> seems like a step backwards, not forward.
>
> Well, they sorta do the trick, but see e.g. commit
> ece01aae479227d9836294b287d872c5a6146a11.  I doubt that's the only
> code that's poorly-optimized for multiple forks; IOW, every time
> someone adds a new fork, there's a system-wide cost to that, even if
> that fork is only used in a tiny percentage of the relations that
> exist in the system.

Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvements
thatneed to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost of improving
themvs what we gain by using them.

Of course there's always some added cost so we shouldn't just blindly use them all over the place without considering
thefork cost either...

> It's tempting to think that we can use the fork mechanism every time
> we have multiple logical "streams" of blocks within a relation and
> don't want to figure out a way to multiplex them onto the same
> physical file.  However, the reality is that the fork mechanism isn't
> up to the job.  I certainly don't want to imply that we shouldn't have
> gone in that direction - both the fsm and the vm are huge steps
> forward, and we wouldn't have gotten them in 8.4 without that
> mechanism.  But they haven't been entirely without their own pain,
> too, and that pain is going to grow the more we push in the direction
> of relying on forks.

Agreed.

Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres...
we'restarting to look at areas where that would help. For example, the recent idea of being able to truncate individual
relationfiles and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is
apretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably
shouldn'tjust blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we
betterstill consider where we're headed).

>> If the only complaint about forks is directory traversal why wouldn't we go
>> with the well established practice of using multiple directories instead of
>> glomming everything into one place?
>
> That's not the only complaint about forks - but I support what you're
> proposing there anyhow, because it will be helpful to users with lots
> of relations regardless of what we do or do not decide to do about
> forks.
>

-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Minmax indexes

From

Greg Stark

Date:

27 September 2013, 18:44:41

On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby <jim@nasby.net> wrote:
>
> Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's
improvementsthat need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost
ofimproving them vs what we gain by using them.

We think this gives short change to the decision to introduce forks.
If you go back to the discussion at the time it was a topic of debate
and the argument which won the day is that interleaving different
streams of data in one storage system is exactly what the file system
is designed to do and we would just be reinventing the wheel if we
tried to do it ourselves. I think that makes a lot of sense for things
like the fsm or vm which grow indefinitely and are maintained by a
different piece of code from the main heap.

The tradeoff might be somewhat different for the pieces of a data
structure like a bitmap index or gin index where the code responsible
for maintaining it is all the same.

> Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of
postgres...we're starting to look at areas where that would help. For example, the recent idea of being able to
truncateindividual relation files and not being limited to only truncating the end of the relation. My concern in that
caseis that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in
storagewe probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to
reducecomplexity, but we better still consider where we're headed).

The ultimate goal here would be to get the filesystem to issue a TRIM
call so an SSD storage system can reuse the underlying blocks.
Truncating 1GB files might be a convenient way to do it, especially if
we have some new kind of vacuum full that can pack tuples within each
1GB file.

But there may be easier ways to achieve the same thing. If we can
notify the filesystem that we're not using some of the blocks in the
middle of the file we might be able to just leave things where they
are and have holes in the files. Or we might be better off not
depending on truncate and just look for ways to mark entire 1GB files
as "deprecated" and move tuples out of them until we can just remove
that whole file.

--
greg

Re: Minmax indexes

From

Jim Nasby

Date:

27 September 2013, 19:14:48

On 9/27/13 1:43 PM, Greg Stark wrote:
>> Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of
postgres...we're starting to look at areas where that would help. For example, the recent idea of being able to
truncateindividual relation files and not being limited to only truncating the end of the relation. My concern in that
caseis that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in
storagewe probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to
reducecomplexity, but we better still consider where we're headed).
 
> The ultimate goal here would be to get the filesystem to issue a TRIM
> call so an SSD storage system can reuse the underlying blocks.
> Truncating 1GB files might be a convenient way to do it, especially if
> we have some new kind of vacuum full that can pack tuples within each
> 1GB file.
>
> But there may be easier ways to achieve the same thing. If we can
> notify the filesystem that we're not using some of the blocks in the
> middle of the file we might be able to just leave things where they
> are and have holes in the files. Or we might be better off not
> depending on truncate and just look for ways to mark entire 1GB files
> as "deprecated" and move tuples out of them until we can just remove
> that whole file.

Yeah, there's a ton of different things we might do. And dealing with free space is just one example... things like the
VMgive us the ability to detect areas of the heap that have gone "dormant"; imagine if we could seamlessly move that
datato it's own storage, possibly compressing it at the same time. (Yes, I realize there's partitioning and tablespaces
andcompressing filesystems, but those are a lot more work and will never be as efficient as what the database itself
cando).
 

Anyway, I think we're all on the same page. We should stop hijacking Alvaro's thread... ;)
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Minmax indexes

From

Heikki Linnakangas

Date:

30 September 2013, 11:08:05

On 27.09.2013 21:43, Greg Stark wrote:
> On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby<jim@nasby.net>  wrote:
>>
>> Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's
improvementsthat need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost
ofimproving them vs what we gain by using them.

>
> We think this gives short change to the decision to introduce forks.
> If you go back to the discussion at the time it was a topic of debate
> and the argument which won the day is that interleaving different
> streams of data in one storage system is exactly what the file system
> is designed to do and we would just be reinventing the wheel if we
> tried to do it ourselves. I think that makes a lot of sense for things
> like the fsm or vm which grow indefinitely and are maintained by a
> different piece of code from the main heap.
>
> The tradeoff might be somewhat different for the pieces of a data
> structure like a bitmap index or gin index where the code responsible
> for maintaining it is all the same.

There are quite a dfew cases where we have several "streams" of data, 
all related to a single relation. We've solved them all in slightly 
different ways:

1. TOAST. A separate heap relation with accompanying b-tree index is 
created.

2. GIN. GIN contains a b-tree, and data pages (and somer other kinds of 
pages too IIRC). It would be natural to use the regular B-tree code for 
the B-tree, but instead it contains a completely separate 
implementation. All the different kinds of streams are stored in the 
main fork.

3. Free space map. Stored as a separate fork.

4. Visibility map. Stored as a separate fork.

And upcoming:

5. Minmax indexes, with the linearly-addressed range reverse map and 
variable lenghth index tuples.

6. Bitmap indexes. Like in GIN, there's a B-tree and the data pages 
containing the bitmaps.

A nice property of the VM and FSM forks currently is that they are just 
auxiliary information to speed things up. You can safely remove them 
(when the server is shut down), and the system will recreate them on 
next vacuum. It's not carved in stone that it has to be that way for all 
extra forks, but it is today and I like it.

I feel we need a new kind of a relation fork, something more 
heavy-weight than the current forks, but not as heavy-weight as the way 
TOAST does it. It would be nice if GIN and bitmap indexes could use the 
regular nbtree code. Or any other index type - imagine a bitmap index 
using a SP-GiST index instead of a B-tree! You could create a bitmap 
index for 2d points, and use it to speed up operations like overlap for 
example.

The nbtree code expects the data to be in the main fork and uses the FSM 
fork too. Maybe it could be abstracted, so that the regular b-tree could 
be used as part of another index type. Same with other indexams.

Perhaps relation forks need to be made more flexible, allowing access 
methods to define what forks exists. IOW, let's not avoid using relation 
forks, let's make them better instead.

- Heikki

Re: Minmax indexes

From

Heikki Linnakangas

Date:

30 September 2013, 11:17:51

What would it take to abstract the minmax indexes to allow maintaing a 
bounding box for points, instead of a plain min/max? Or for ranges. In 
other words, why is this restricted to b-tree operators?

- Heikki

Re: Minmax indexes

From

David Fetter

Date:

30 September 2013, 16:29:27

On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote:
> What would it take to abstract the minmax indexes to allow maintaing
> a bounding box for points, instead of a plain min/max? Or for
> ranges. In other words, why is this restricted to b-tree operators?

If I had to guess, I'd guess, "first cut."

I take it this also occurred to you and that you believe that this
approach makes the more general case or at least further out than it
would need to be.  Am I close?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Minmax indexes

From

Alvaro Herrera

Date:

30 September 2013, 16:49:19

David Fetter wrote:
> On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote:
> > What would it take to abstract the minmax indexes to allow maintaing
> > a bounding box for points, instead of a plain min/max? Or for
> > ranges. In other words, why is this restricted to b-tree operators?
> 
> If I had to guess, I'd guess, "first cut."

Yeah, there were a few other simplifications in the design too, though I
admit allowing for multidimensional dataypes hadn't occured to me
(though I will guess Simon did think about it and just didn't tell me to
avoid me going overboard with stuff that would make the first version
take forever).

I think we'd better add version numbers and stuff to the metapage to
allow for extensions and proper upgradability.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Heikki Linnakangas

Date:

30 September 2013, 17:20:32

On 30.09.2013 19:49, Alvaro Herrera wrote:
> David Fetter wrote:
>> On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote:
>>> What would it take to abstract the minmax indexes to allow maintaing
>>> a bounding box for points, instead of a plain min/max? Or for
>>> ranges. In other words, why is this restricted to b-tree operators?
>>
>> If I had to guess, I'd guess, "first cut."
>
> Yeah, there were a few other simplifications in the design too, though I
> admit allowing for multidimensional dataypes hadn't occured to me

You can almost create a bounding box opclass in the current 
implementation, by mapping < operator to "contains" and > to "not 
contains". But there's no support for creating a new, larger, bounding 
box on insert. It will just replace the max with the new value if it's 
"greater than", when it should create a whole new value to store in the 
index that covers both the old and the new values. (or less than? I'm 
not sure which way those operators would work..)

When you think of the general case, it's weird that the current 
implementation requires storing both the min and the max. For a bounding 
box, you store the bounding box that covers all heap tuples in the 
range. If that corresponds to "max", what does "min" mean?

In fact, even with regular b-tree operators, over integers for example, 
you don't necessarily want to store both min and max. If you only ever 
perform queries like "WHERE col > ?", there's no need to track the min 
value. So to make this really general, you should be able to create an 
index on only the minimum or maximum. Or if you want both, you can store 
them as separate index columns. Something like:

CREATE INDEX minindex ON table (col ASC); -- For min
CREATE INDEX minindex ON table (col DESC);  -- For max
CREATE INDEX minindex ON table (col ASC, col DESC); -- For both

That said, in practice most people probably want to store both min and 
max. Maybe it's a bit too finicky if we require people to write "col 
ASC, col DESC" to get that. Some kind of a shorthand probably makes sense.

> (though I will guess Simon did think about it and just didn't tell me to
> avoid me going overboard with stuff that would make the first version
> take forever).

Heh, and I ruined that great plan :-).

- Heikki

Re: Minmax indexes

From

Robert Haas

Date:

01 October 2013, 10:18:30

On Mon, Sep 30, 2013 at 1:20 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> You can almost create a bounding box opclass in the current implementation,
> by mapping < operator to "contains" and > to "not contains". But there's no
> support for creating a new, larger, bounding box on insert. It will just
> replace the max with the new value if it's "greater than", when it should
> create a whole new value to store in the index that covers both the old and
> the new values. (or less than? I'm not sure which way those operators would
> work..)

This sounds an awful lot like GiST's "union" operation.  Actually,
following the GiST model of having "union" and "consistent" operations
might be a smart way to go.  Then the exact index semantics could be
decided by the opclass.  This might not even be that much extra code;
the existing consistent and union functions for GiST are pretty short.That way, it'd be easy to add new opclasses with
somewhatdifferent

behavior; the common thread would be that every opclass of this new AM
works by summarizing a physical page range into a single indexed
value.  You might call the AM something like "summary" or "sparse" and
then have "minmax_ops" for your first opclass.

> In fact, even with regular b-tree operators, over integers for example, you
> don't necessarily want to store both min and max. If you only ever perform
> queries like "WHERE col > ?", there's no need to track the min value. So to
> make this really general, you should be able to create an index on only the
> minimum or maximum. Or if you want both, you can store them as separate
> index columns. Something like:
>
> CREATE INDEX minindex ON table (col ASC); -- For min
> CREATE INDEX minindex ON table (col DESC);  -- For max
> CREATE INDEX minindex ON table (col ASC, col DESC); -- For both

This doesn't seem very general, since you're relying on the fact that
ASC and DESC map to < and >.  It's not clear what you'd write here if
you wanted to optimize #$ and @!.  But something based on opclasses
will work, since each opclass can support an arbitrary set of
operators.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Alvaro Herrera

Date:

08 November 2013, 20:41:23

Robert Haas escribió:
> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Here's an updated version of this patch, with fixes to all the bugs
> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> > Amit Kapila for the reports.
> 
> I'm not very happy with the use of a separate relation fork for
> storing this data.

I have been playing with having the revmap in the main fork of the index
rather than a separate one.  On the surface many things stay just what
they are; I only had to add a layer beneath the revmap that maps its
logical block numbers to physical block numbers.  The problem with this
is that it needs more disk access, because revmap block numbers cannot be
hardcoded.

After doing some quick math, what I ended up with was to keep an array
of BlockNumbers in the metapage.  Each element in this array points to
array pages; each array page is, in turn, filled with more BlockNumbers,
which this time correspond to the logical revmap pages we used to have
in the revmap fork.  (I initially feared that this design would not
allow me to address enough revmap pages for the largest of tables; but
fortunately this is sufficient unless you configure very small pages,
say BLCKSZ 2kB, use small page ranges, and use small datatypes, say
"char".  I have no problem with saying that that scenario is not
supported if you want to have minmax indexes on 32 TB tables.  I mean,
who uses BLCKSZ smaller than 8kB anyway?).

The advantage of this design is that in order to find any particular
logical revmap page, you always have to do a constant number of page
accesses.  You read the metapage, then read the array page, then read
the revmap page; done.  Another idea I considered was chaining revmap
pages (so each would have a pointer-to-next), or chaining array pages;
but this would have meant that to locate an individual page to the end
of the revmap, you might need to do many accesses.  Not good.

As an optimization for relatively small indexes, we hardcode the page
number for the first revmap page: it's always the page right after the
metapage (so BlockNumber 1).  A revmap page can store, with the default
page size, about 1350 item pointers; so with an index built for page
ranges of 1000 pages per range, you can point to enough index entries
for a ~10 GB table without having the need to examine the first array
page.  This seems pretty acceptable; people with larger tables can
likely spare one extra page accessed every now and then.
(For comparison, each regular minmax page can store about 500 index
tuples, if it's built for a single 4-byte column; this means that the 10
GB table requires a 5-page index.)

This is not complete yet; although I have a proof-of-concept working, I
still need to write XLog support code and update the pageinspect code to
match.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Alvaro Herrera

Date:

08 November 2013, 20:42:32

Erik Rijkers wrote:
> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
> >
> >> [minmax-5.patch]
> >
> > I have the impression it's not quite working correctly.

Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués.  It's also been rebased to apply
cleanly on top of today's master branch.

I have also added a selectivity function, but I'm not positive that it's
very useful yet.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-7.patch

Re: Minmax indexes

From

Alvaro Herrera

Date:

08 November 2013, 21:04:49

Alvaro Herrera escribió:

> I have been playing with having the revmap in the main fork of the index
> rather than a separate one.
...
> This is not complete yet; although I have a proof-of-concept working, I
> still need to write XLog support code and update the pageinspect code to
> match.

Just to be clear: the v7 published elsewhere in this thread does not
contain this revmap-in-main-fork code.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

"Erik Rijkers"

Date:

11 November 2013, 08:53:56

On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
>
> Here's a version 7 of the patch, which fixes these bugs and adds
> opclasses for a bunch more types (timestamp, timestamptz, date, time,
> timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> cleanly on top of today's master branch.
>
> I have also added a selectivity function, but I'm not positive that it's
> very useful yet.
>
> [minmax-7.patch]

The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query
that
improves with minmax index use.  (it gets used sometimes but speedup is negligable).

That probably means I'm doing something wrong; could you (or anyone) give some hints about use-case would be expected?

(Or is it just the unfinished selectivity function?)

Thanks,

Erikjan Rijkers

Attachment

test.sh

Re: Minmax indexes

From

"Erik Rijkers"

Date:

11 November 2013, 09:35:20

On Mon, November 11, 2013 09:53, Erik Rijkers wrote:
> On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
>>
>> Here's a version 7 of the patch, which fixes these bugs and adds
>> opclasses for a bunch more types (timestamp, timestamptz, date, time,
>> timetz), courtesy of Martín Marqués.  It's also been rebased to apply
>> cleanly on top of today's master branch.
>>
>> I have also added a selectivity function, but I'm not positive that it's
>> very useful yet.
>>
>> [minmax-7.patch]
>
> The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query
that
> improves with minmax index use.  (it gets used sometimes but speedup is negligable).
>

Another issue (I think):

Attached is a program (and output as a .txt file) that gives the following (repeatable) error:

$ ./casanova_test.sh
\timing on
                drop table if exists t1;
Time: 333.159 ms
                create table t1 (i int);
Time: 155.827 ms
                create index t1_i_idx on t1 using minmax(i);
Time: 204.031 ms
                insert into t1 select generate_series(1, 25000000);
Time: 126312.302 ms
        analyze t1;
ERROR:  could not truncate file base/21324/26339_vm to 41 blocks: it's only 1 blocks now
Time: 472.504 ms
[...]


Thanks,

Erik Rijkers

Attachment

Re: Minmax indexes

From

Jeff Janes

Date:

11 November 2013, 17:15:59

On Mon, Nov 11, 2013 at 12:53 AM, Erik Rijkers <er@xs4all.nl> wrote:

On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
>
> Here's a version 7 of the patch, which fixes these bugs and adds
> opclasses for a bunch more types (timestamp, timestamptz, date, time,
> timetz), courtesy of Martín Marqués. It's also been rebased to apply
> cleanly on top of today's master branch.
>
> I have also added a selectivity function, but I'm not positive that it's
> very useful yet.
>
> [minmax-7.patch]

The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query that
improves with minmax index use. (it gets used sometimes but speedup is negligable).

Your data set seems to be completely random. I believe that minmax indices would only be expected to be useful when the data is clustered. Perhaps you could try it on a table where it is populated something like i+random()/10*max_i.

Cheers,

Jeff

Re: Minmax indexes (timings)

From

"Erik Rijkers"

Date:

15 November 2013, 16:11:56

On Mon, November 11, 2013 09:53, Erik Rijkers wrote:
> On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
>>
>> Here's a version 7 of the patch, which fixes these bugs and adds
>>
>> [minmax-7.patch]
[...]
> some hints about use-case would be expected?
>

I've been messing with minmax indexes some more so here are some results of that.

Perhaps someone finds these timings useful.


Centos 5.7, 32 GB memory, 2 quadcores.

'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--enable-cassert'
'--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--enable-dtrace'



Detail is in the attached files; the below is a grep through these.


-- rowcount (size_string):  10_000
     368,640 | size table
     245,760 | size btree index
      16,384 | size minmax index
 Total runtime: 0.167 ms   <-- btree (4x) ( last 2x disabled index-only )
 Total runtime: 0.046 ms
 Total runtime: 0.046 ms
 Total runtime: 0.049 ms
 Total runtime: 0.102 ms   <-- minmax  (4x)
 Total runtime: 0.047 ms
 Total runtime: 0.047 ms
 Total runtime: 0.047 ms
 Total runtime: 1.066 ms   <-- seqscan


-- rowcount (size_string):  100_000
    3,629,056 | size table
    2,260,992 | size btree index
       16,384 | size minmax index
 Total runtime: 0.090 ms   <-- btree (4x) ( last 2x disabled index-only )
 Total runtime: 0.046 ms
 Total runtime: 0.426 ms
 Total runtime: 0.287 ms
 Total runtime: 0.391 ms   <-- minmax (4x)
 Total runtime: 0.285 ms
 Total runtime: 0.285 ms
 Total runtime: 0.291 ms
 Total runtime: 14.065 ms  <-- seqscan


-- rowcount (size_string):  1_000_000
   36,249,600 | size table
   22,487,040 | size btree index
       57,344 | size minmax index
 Total runtime: 0.077 ms    <-- btree (4x) ( last 2x disabled index-only )
 Total runtime: 0.048 ms
 Total runtime: 0.044 ms
 Total runtime: 0.038 ms
 Total runtime: 2.284 ms    <-- minmax (4x)
 Total runtime: 1.812 ms
 Total runtime: 1.813 ms
 Total runtime: 1.809 ms
 Total runtime: 142.958 ms  <-- seqscan


-- rowcount (size_string):  100_000_000
 3,624,779,776 | size table
 2,246,197,248 | size btree index
     4,456,448 | size minmax index
 Total runtime: 0.091 ms      <-- btree (4x) ( last 2x disabled index-only )
 Total runtime: 0.047 ms
 Total runtime: 0.046 ms
 Total runtime: 0.038 ms
 Total runtime: 181.874 ms    <-- minmax (4x)
 Total runtime: 175.084 ms
 Total runtime: 175.104 ms
 Total runtime: 174.349 ms
 Total runtime: 14833.994 ms  <-- seqscan


-- rowcount (size_string):  1_000_000_000
 36,247,789,568 | size table
 22,461,628,416 | size btree index
     44,433,408 | size minmax index
 Total runtime: 14.735 ms     <-- btree (4x) ( last 2x disabled index-only )
 Total runtime: 0.046 ms
 Total runtime: 0.044 ms
 Total runtime: 0.041 ms
 Total runtime: 1790.591 ms   <-- minmax (4x)
 Total runtime: 1750.129 ms
 Total runtime: 1747.987 ms
 Total runtime: 1748.476 ms
 Total runtime: 169770.455 ms <-- seqscan


The messy "program" is attached too (although it still has Jaime's name, the mess is mine).

hth,

Erik Rijkers


PS.
The bug I reported earlier is (of course) still there; but I noticed that it only occurs on larger table sizes (e.g.
+1M
rows).

Attachment

Re: Minmax indexes (timings)

From

Andres Freund

Date:

15 November 2013, 16:29:06

On 2013-11-15 17:11:46 +0100, Erik Rijkers wrote:
> I've been messing with minmax indexes some more so here are some results of that.
> 
> Perhaps someone finds these timings useful.
> 
> 
> Centos 5.7, 32 GB memory, 2 quadcores.
> 
> '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend'
'--enable-cassert'
> '--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--enable-dtrace'

Just some general advice: doing timings with --enale-cassert isn't that
meaningful - it often can distort results significantly.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes (timings)

From

Kevin Grittner

Date:

15 November 2013, 16:34:02

Erik Rijkers <er@xs4all.nl> wrote:

> Perhaps someone finds these timings useful.

> '--enable-cassert'

Assertions can really distort the timings, and not always equally
for all code paths.  Any chance of re-running those tests without
that?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes (timings)

From

"Erik Rijkers"

Date:

15 November 2013, 18:42:22

On Fri, November 15, 2013 17:33, Kevin Grittner wrote:
> Erik Rijkers <er@xs4all.nl> wrote:
>
>> Perhaps someone finds these timings useful.
>
>> '--enable-cassert'
>
> Assertions can really distort the timings, and not always equally
> for all code paths.  Any chance of re-running those tests without
> that?
>

Fair enough.  It seems it doesn't make all that much difference for this case, here are the results:

'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--with-perl'
'--with-openssl' '--with-libxml'

-- rowcount (size_string):  10_000
     368640 | size table        | 360 kB
     245760 | size btree index  | 240 kB
      16384 | size minmax index | 16 kB
 Total runtime: 0.121 ms
 Total runtime: 0.041 ms
 Total runtime: 0.039 ms
 Total runtime: 0.040 ms
 Total runtime: 0.043 ms
 Total runtime: 0.041 ms
 Total runtime: 0.040 ms
 Total runtime: 0.040 ms
 Total runtime: 0.948 ms

-- rowcount (size_string):  100_000
    3629056 | size table        | 3544 kB
    2260992 | size btree index  | 2208 kB
      16384 | size minmax index | 16 kB
 Total runtime: 0.082 ms
 Total runtime: 0.039 ms
 Total runtime: 0.396 ms
 Total runtime: 0.252 ms
 Total runtime: 0.339 ms
 Total runtime: 0.245 ms
 Total runtime: 0.240 ms
 Total runtime: 0.241 ms
 Total runtime: 13.268 ms

-- rowcount (size_string):  1_000_000
   36249600 | size table        | 35 MB
   22487040 | size btree index  | 21 MB
      57344 | size minmax index | 56 kB
 Total runtime: 0.096 ms
 Total runtime: 0.039 ms
 Total runtime: 0.039 ms
 Total runtime: 0.034 ms
 Total runtime: 1.975 ms
 Total runtime: 1.527 ms
 Total runtime: 1.523 ms
 Total runtime: 1.519 ms
 Total runtime: 145.125 ms

-- rowcount (size_string):  100_000_000
 3624779776 | size table        | 3457 MB
 2246197248 | size btree index  | 2142 MB
    4456448 | size minmax index | 4352 kB
 Total runtime: 0.074 ms
 Total runtime: 0.039 ms
 Total runtime: 0.040 ms
 Total runtime: 0.033 ms
 Total runtime: 150.450 ms
 Total runtime: 147.039 ms
 Total runtime: 145.410 ms
 Total runtime: 145.142 ms
 Total runtime: 15068.171 ms

-- rowcount (size_string):  1_000_000_000
 36247789568 | size table        | 34 GB
 22461628416 | size btree index  | 21 GB
    44433408 | size minmax index | 42 MB
 Total runtime: 15.454 ms      <-- 4x btree
 Total runtime: 0.040 ms
 Total runtime: 0.040 ms
 Total runtime: 0.034 ms
 Total runtime: 1502.353 ms    <-- 4x minmax
 Total runtime: 1482.322 ms
 Total runtime: 1489.522 ms
 Total runtime: 1481.424 ms
 Total runtime: 162213.392 ms  <-- seqscan



I'd say minmax indexes give spectacular gains for very small indexsize.


Erik Rijkers

Attachment

minmax_sizes_times-20131115.zip

Re: Minmax indexes

From

Jeff Janes

Date:

15 November 2013, 20:06:13

On Fri, Nov 8, 2013 at 12:11 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Erik Rijkers wrote:
> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
> >
> >> [minmax-5.patch]
> >
> > I have the impression it's not quite working correctly.

Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués. It's also been rebased to apply
cleanly on top of today's master branch.

I have also added a selectivity function, but I'm not positive that it's
very useful yet.

I tested it with attached script, but broke out of the "for" loop after 5 iterations (when it had 300,000,005 rows inserted)

Then I did an analyze, and got an error message below:

jjanes=# analyze;

ERROR: could not truncate file "base/16384/16388_vm" to 488 blocks: it's only 82 blocks now

16388 is the index's relfilenode.

Here is the backtrace upon entry to the truncate that is going to fail:

#0 mdtruncate (reln=0x23c91b0, forknum=VISIBILITYMAP_FORKNUM, nblocks=488) at md.c:858

#1 0x000000000048eb4a in mmRevmapTruncate (rmAccess=0x26ad878, heapNumBlocks=1327434) at mmrevmap.c:360

#2 0x000000000048d37a in mmvacuumcleanup (fcinfo=<value optimized out>) at minmax.c:1264

#3 0x000000000072dcef in FunctionCall2Coll (flinfo=<value optimized out>, collation=<value optimized out>, arg1=<value optimized out>,

arg2=<value optimized out>) at fmgr.c:1323

#4 0x000000000048c1e5 in index_vacuum_cleanup (info=<value optimized out>, stats=0x0) at indexam.c:715

#5 0x000000000052a7ce in do_analyze_rel (onerel=0x7f59798589e8, vacstmt=0x23b0bd8, acquirefunc=0x5298d0 <acquire_sample_rows>, relpages=1327434,

inh=0 '\000', elevel=13) at analyze.c:634

#6 0x000000000052b320 in analyze_rel (relid=<value optimized out>, vacstmt=0x23b0bd8, bstrategy=<value optimized out>) at analyze.c:267

#7 0x000000000057cba7 in vacuum (vacstmt=0x23b0bd8, relid=<value optimized out>, do_toast=1 '\001', bstrategy=<value optimized out>,

for_wraparound=0 '\000', isTopLevel=<value optimized out>) at vacuum.c:249

#8 0x0000000000663177 in standard_ProcessUtility (parsetree=0x23b0bd8, queryString=<value optimized out>, context=<value optimized out>, params=0x0,

dest=<value optimized out>, completionTag=<value optimized out>) at utility.c:682

#9 0x00007f598290b791 in pgss_ProcessUtility (parsetree=0x23b0bd8, queryString=0x23b0220 "analyze \n;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,

dest=0x23b0f18, completionTag=0x7fffd3442f30 "") at pg_stat_statements.c:825

#10 0x000000000065fcf7 in PortalRunUtility (portal=0x24195e0, utilityStmt=0x23b0bd8, isTopLevel=1 '\001', dest=0x23b0f18, completionTag=0x7fffd3442f30 "")

at pquery.c:1187

#11 0x0000000000660c6d in PortalRunMulti (portal=0x24195e0, isTopLevel=1 '\001', dest=0x23b0f18, altdest=0x23b0f18, completionTag=0x7fffd3442f30 "")

at pquery.c:1318

#12 0x0000000000661323 in PortalRun (portal=0x24195e0, count=9223372036854775807, isTopLevel=1 '\001', dest=0x23b0f18, altdest=0x23b0f18,

completionTag=0x7fffd3442f30 "") at pquery.c:816

#13 0x000000000065dbb4 in exec_simple_query (query_string=0x23b0220 "analyze \n;") at postgres.c:1048

#14 0x000000000065f259 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x2347be8 "jjanes", username=<value optimized out>)

at postgres.c:3992

#15 0x000000000061b7d0 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4085

#16 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:3774

#17 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1585

#18 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1240

#19 0x00000000005b5e90 in main (argc=3, argv=0x2346cd0) at main.c:196

Cheers,

Jeff

Attachment

minmax_test3.sh

Re: Minmax indexes

From

Thom Brown

Date:

24 January 2014, 17:10:25

On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Erik Rijkers wrote:
>> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
>> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
>> >
>> >> [minmax-5.patch]
>> >
>> > I have the impression it's not quite working correctly.
>
> Here's a version 7 of the patch, which fixes these bugs and adds
> opclasses for a bunch more types (timestamp, timestamptz, date, time,
> timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> cleanly on top of today's master branch.
>
> I have also added a selectivity function, but I'm not positive that it's
> very useful yet.

This patch doesn't appear to have been submitted to any Commitfest.
Is this still a feature undergoing research then?

--
Thom

Re: Minmax indexes

From

Alvaro Herrera

Date:

24 January 2014, 17:53:19

Thom Brown wrote:
> On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Erik Rijkers wrote:
> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
> >> >
> >> >> [minmax-5.patch]
> >> >
> >> > I have the impression it's not quite working correctly.
> >
> > Here's a version 7 of the patch, which fixes these bugs and adds
> > opclasses for a bunch more types (timestamp, timestamptz, date, time,
> > timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> > cleanly on top of today's master branch.
> >
> > I have also added a selectivity function, but I'm not positive that it's
> > very useful yet.
> 
> This patch doesn't appear to have been submitted to any Commitfest.
> Is this still a feature undergoing research then?

It's still a planned feature, but I didn't have time to continue work
for 2014-01.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Thom Brown

Date:

24 January 2014, 17:55:01

On 24 January 2014 17:53, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Thom Brown wrote:
>> On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > Erik Rijkers wrote:
>> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
>> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
>> >> >
>> >> >> [minmax-5.patch]
>> >> >
>> >> > I have the impression it's not quite working correctly.
>> >
>> > Here's a version 7 of the patch, which fixes these bugs and adds
>> > opclasses for a bunch more types (timestamp, timestamptz, date, time,
>> > timetz), courtesy of Martín Marqués.  It's also been rebased to apply
>> > cleanly on top of today's master branch.
>> >
>> > I have also added a selectivity function, but I'm not positive that it's
>> > very useful yet.
>>
>> This patch doesn't appear to have been submitted to any Commitfest.
>> Is this still a feature undergoing research then?
>
> It's still a planned feature, but I didn't have time to continue work
> for 2014-01.

Alles klar.

Thanks

--
Thom

Re: Minmax indexes

From

Claudio Freire

Date:

24 January 2014, 17:58:49

On Fri, Jan 24, 2014 at 2:54 PM, Thom Brown <thom@linux.com> wrote:
> On 24 January 2014 17:53, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Thom Brown wrote:
>>> On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> > Erik Rijkers wrote:
>>> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
>>> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
>>> >> >
>>> >> >> [minmax-5.patch]
>>> >> >
>>> >> > I have the impression it's not quite working correctly.
>>> >
>>> > Here's a version 7 of the patch, which fixes these bugs and adds
>>> > opclasses for a bunch more types (timestamp, timestamptz, date, time,
>>> > timetz), courtesy of Martín Marqués.  It's also been rebased to apply
>>> > cleanly on top of today's master branch.
>>> >
>>> > I have also added a selectivity function, but I'm not positive that it's
>>> > very useful yet.
>>>
>>> This patch doesn't appear to have been submitted to any Commitfest.
>>> Is this still a feature undergoing research then?
>>
>> It's still a planned feature, but I didn't have time to continue work
>> for 2014-01.

What's the status?

I believe I have more than a use for minmax indexes, and wouldn't mind
lending a hand if it's within my grasp.

Re: Minmax indexes

From

Greg Stark

Date:

25 January 2014, 01:57:05

On Fri, Jan 24, 2014 at 12:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>
> What's the status?
>
> I believe I have more than a use for minmax indexes, and wouldn't mind
> lending a hand if it's within my grasp.

I'm also interested in looking at this. Mostly because I have ideas
for other "summary" functions that would be interesting and could use
the same infrastructure otherwise.

-- 
greg

Re: Minmax indexes

From

Alvaro Herrera

Date:

15 June 2014, 02:34:32

Robert Haas wrote:
> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Here's an updated version of this patch, with fixes to all the bugs
> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> > Amit Kapila for the reports.
>
> I'm not very happy with the use of a separate relation fork for
> storing this data.

Here's a new version of this patch.  Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.

I added a few pageinspect functions that let one explore the data in the
index.  With this you can start by reading the metapage, and from there
obtain the block numbers for the revmap array pages; and explore revmap
array pages to read regular revmap pages, which contain the TIDs to
index entries.  All these pageinspect functions don't currently have any
documentation, but it's as easy as

  with idxname as (select 'ti'::text as idxname)
select *
  from idxname,
       generate_series(0, pg_relation_size(idxname) / 8192 - 1) i,
       minmax_page_type(get_raw_page(idxname, i::int));

 select *        -- data in metapage
   from
       minmax_metapage_info(get_raw_page('ti', 0));

  select *        -- data in revmap array pages
    from minmax_revmap_array_data(get_raw_page('ti', 6));

  select logblk, unnest(pages)    -- data in regular revmap pages
    from minmax_revmap_data(get_raw_page('ti', 15));

  select *        -- data in regular index pages
    from minmax_page_items(get_raw_page('ti', 2), 'ti'::regclass);

Note that in this last case you need to give it the OID of the index as
the second parameter, so that it can construct a tupledesc for decoding
the min/max data.

I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple.  This saves a lot of index bloat.  This required a new entry
point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change.  This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.

I have also added a reloption for the size of each page range, so you
can do
  create index ti on t using minmax (a) with (pages_per_range = 2);
The default is 128 pages per range, and I have an arbitrary maximum of
131072 (default size of a 1GB segment).  There doesn't seem to be much
point in having larger page ranges; intuitively I think page ranges
should be more or less the size of kernel readahead, but I haven't
tested this.

I didn't want to rebase past 0ef0b6784 in a hurry.  I only know this
applies cleanly on top of fe7337f2dc, so please use that if you want to
play with it.  I will post a rebased version shortly.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-8.patch

Re: Minmax indexes

From

Robert Haas

Date:

17 June 2014, 14:26:28

On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>> > Here's an updated version of this patch, with fixes to all the bugs
>> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
>> > Amit Kapila for the reports.
>>
>> I'm not very happy with the use of a separate relation fork for
>> storing this data.
>
> Here's a new version of this patch.  Now the revmap is not stored in a
> separate fork, but together with all the regular data, as explained
> elsewhere in the thread.

Cool.

Have you thought more about this comment from Heikki?

http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com

I'm concerned that we could end up with one index type of this general
nature for min/max type operations, and then another very similar
index type for geometric operators or text-search operators or what
have you.  Considering the overhead in adding and maintaining an index
AM, I think we should try to be sure that we've done a reasonably
solid job making each one as general as we reasonably can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Andres Freund

Date:

17 June 2014, 14:31:58

On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Robert Haas wrote:
> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> >> <alvherre@2ndquadrant.com> wrote:
> >> > Here's an updated version of this patch, with fixes to all the bugs
> >> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> >> > Amit Kapila for the reports.
> >>
> >> I'm not very happy with the use of a separate relation fork for
> >> storing this data.
> >
> > Here's a new version of this patch.  Now the revmap is not stored in a
> > separate fork, but together with all the regular data, as explained
> > elsewhere in the thread.
> 
> Cool.
> 
> Have you thought more about this comment from Heikki?
> 
> http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com

Is there actually a significant usecase behind that wish or just a
general demand for being generic? To me it seems fairly unlikely you'd
end up with something useful by doing a minmax index over bounding
boxes.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes

From

Greg Stark

Date:

17 June 2014, 15:26:40

On Tue, Jun 17, 2014 at 3:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Is there actually a significant usecase behind that wish or just a
> general demand for being generic? To me it seems fairly unlikely you'd
> end up with something useful by doing a minmax index over bounding
> boxes.

Isn't min/max just a 2d bounding box? If you do a bulk data load of
something like the census data then sure, every page will have data
points for some geometrically clustered set of data.

I had in mind to do a small bloom filter per block. In general any
kind of predicate like bounding box should work.

-- 
greg

Re: Minmax indexes

From

Robert Haas

Date:

17 June 2014, 15:48:20

On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
>> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>> > Robert Haas wrote:
>> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>> >> <alvherre@2ndquadrant.com> wrote:
>> >> > Here's an updated version of this patch, with fixes to all the bugs
>> >> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
>> >> > Amit Kapila for the reports.
>> >>
>> >> I'm not very happy with the use of a separate relation fork for
>> >> storing this data.
>> >
>> > Here's a new version of this patch.  Now the revmap is not stored in a
>> > separate fork, but together with all the regular data, as explained
>> > elsewhere in the thread.
>>
>> Cool.
>>
>> Have you thought more about this comment from Heikki?
>>
>> http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com
>
> Is there actually a significant usecase behind that wish or just a
> general demand for being generic? To me it seems fairly unlikely you'd
> end up with something useful by doing a minmax index over bounding
> boxes.

Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are.  As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people.  What is your reason for thinking that it would be
any less useful in this context?

I do also think that a general demand for being generic ought to carry
some weight.  We have gone to great lengths to make sure that our
indexing can handle more than just < and >, where a lot of other
products have not bothered.  I think we have gotten a lot of mileage
out of that decision and feel that we shouldn't casually back away
from it.  Obviously, we do already have some special-case
optimizations and will likely have more in the future, and there are
can certainly be valid reasons for taking that approach. But it needs
to be justified in some way; we shouldn't accept a less-generic
approach blindly, without questioning whether it's possible to do
better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Andres Freund

Date:

17 June 2014, 16:04:43

On 2014-06-17 11:48:10 -0400, Robert Haas wrote:
> On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
> >> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
> >> <alvherre@2ndquadrant.com> wrote:
> >> > Robert Haas wrote:
> >> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> >> >> <alvherre@2ndquadrant.com> wrote:
> >> >> > Here's an updated version of this patch, with fixes to all the bugs
> >> >> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> >> >> > Amit Kapila for the reports.
> >> >>
> >> >> I'm not very happy with the use of a separate relation fork for
> >> >> storing this data.
> >> >
> >> > Here's a new version of this patch.  Now the revmap is not stored in a
> >> > separate fork, but together with all the regular data, as explained
> >> > elsewhere in the thread.
> >>
> >> Cool.
> >>
> >> Have you thought more about this comment from Heikki?
> >>
> >> http://www.postgresql.org/message-id/52495DD3.9010809@vmware.com
> >
> > Is there actually a significant usecase behind that wish or just a
> > general demand for being generic? To me it seems fairly unlikely you'd
> > end up with something useful by doing a minmax index over bounding
> > boxes.
> 
> Well, I'm not the guy who does things with geometric data, but I don't
> want to ignore the significant percentage of our users who are.  As
> you must surely know, the GIST implementations for geometric data
> types store bounding boxes on internal pages, and that seems to be
> useful to people.  What is your reason for thinking that it would be
> any less useful in this context?

For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
That'ill imo work well if there are consecutive rows in the table that
can be summarized into one min/max range. That's quite likely to happen
for common applications of number of scalar datatypes. But the
likelihood of placing sufficiently many rows with very similar bounding
boxes close together seems much less relevant in practice. And I think
that's generally likely for operations which can't be well represented
as btree opclasses - the substructure that implies inside a Datum will
make correlation between consecutive rows less likely.

Maybe I've a major intuition failure here though...

> I do also think that a general demand for being generic ought to carry
> some weight.

Agreed. It's always a balance act. But it's not like this doesn't use a
datatype abstraction concept...

> We have gone to great lengths to make sure that our
> indexing can handle more than just < and >, where a lot of other
> products have not bothered.  I think we have gotten a lot of mileage
> out of that decision and feel that we shouldn't casually back away
> from it.

I don't see this as a case of backing away from that though?

> we shouldn't accept a less-generic
> approach blindly, without questioning whether it's possible to do
> better.

But the aim shouldn't be to add genericity that's not going to be used,
but to add it where it's somewhat likely to help...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes

From

Robert Haas

Date:

17 June 2014, 16:14:12

On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Well, I'm not the guy who does things with geometric data, but I don't
>> want to ignore the significant percentage of our users who are.  As
>> you must surely know, the GIST implementations for geometric data
>> types store bounding boxes on internal pages, and that seems to be
>> useful to people.  What is your reason for thinking that it would be
>> any less useful in this context?
>
> For me minmax indexes are helpful because they allow to generate *small*
> 'coarse' indexes over large volumes of data. From my pov that's possible
> possible because they don't contain item pointers for every contained
> row.
> That'ill imo work well if there are consecutive rows in the table that
> can be summarized into one min/max range. That's quite likely to happen
> for common applications of number of scalar datatypes. But the
> likelihood of placing sufficiently many rows with very similar bounding
> boxes close together seems much less relevant in practice. And I think
> that's generally likely for operations which can't be well represented
> as btree opclasses - the substructure that implies inside a Datum will
> make correlation between consecutive rows less likely.

Well, I don't know: suppose you're loading geospatial data showing the
location of every building in some country.  It might easily be the
case that the data is or can be loaded in an order that provides
pretty good spatial locality, leading to tight bounding boxes over
physically consecutive data ranges.

But I'm not trying to say that we absolutely have to support that kind
of thing; what I am trying to say is that there should be a README or
a mailing list post or some such that says: "We thought about how
generic to make this.   We considered A, B, and C.  We rejected C as
too narrow, and A because if we made it that general it would have
greatly enlarged the disk footprint for the following reasons.
Therefore we selected B."  Basically, I think Heikki asked a good
question - which was "could we abstract this more?" - and I can't
recall seeing a clear answer explaining why we could or couldn't and
what the trade-offs would be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Andres Freund

Date:

17 June 2014, 18:17:08

On 2014-06-17 12:14:00 -0400, Robert Haas wrote:
> On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Well, I'm not the guy who does things with geometric data, but I don't
> >> want to ignore the significant percentage of our users who are.  As
> >> you must surely know, the GIST implementations for geometric data
> >> types store bounding boxes on internal pages, and that seems to be
> >> useful to people.  What is your reason for thinking that it would be
> >> any less useful in this context?
> >
> > For me minmax indexes are helpful because they allow to generate *small*
> > 'coarse' indexes over large volumes of data. From my pov that's possible
> > possible because they don't contain item pointers for every contained
> > row.
> > That'ill imo work well if there are consecutive rows in the table that
> > can be summarized into one min/max range. That's quite likely to happen
> > for common applications of number of scalar datatypes. But the
> > likelihood of placing sufficiently many rows with very similar bounding
> > boxes close together seems much less relevant in practice. And I think
> > that's generally likely for operations which can't be well represented
> > as btree opclasses - the substructure that implies inside a Datum will
> > make correlation between consecutive rows less likely.
> 
> Well, I don't know: suppose you're loading geospatial data showing the
> location of every building in some country.  It might easily be the
> case that the data is or can be loaded in an order that provides
> pretty good spatial locality, leading to tight bounding boxes over
> physically consecutive data ranges.

Well, it might be doable to correlate them along one axis, but along
both?  That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.
Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.

I think hierarchical datastructures are so much better suited for this,
that there's little point trying to fit them into minmax. I can very
well imagine that there's benefit in a gist support for only storing one
pointer per block instead of one pointer per item or such. But it seems
like separate feature.

> But I'm not trying to say that we absolutely have to support that kind
> of thing; what I am trying to say is that there should be a README or
> a mailing list post or some such that says: "We thought about how
> generic to make this.  We considered A, B, and C.  We rejected C as
> too narrow, and A because if we made it that general it would have
> greatly enlarged the disk footprint for the following reasons.
> Therefore we selected B."

Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...

> Basically, I think Heikki asked a good
> question - which was "could we abstract this more?" - and I can't
> recall seeing a clear answer explaining why we could or couldn't and
> what the trade-offs would be.

'could we abstract more' imo is a pretty bad design guideline. It's 'is
there benefit in abstracting more'. Otherwise you end up with way to
complicated systems.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes

From

Claudio Freire

Date:

17 June 2014, 18:22:58

On Tue, Jun 17, 2014 at 1:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> For me minmax indexes are helpful because they allow to generate *small*
> 'coarse' indexes over large volumes of data. From my pov that's possible
> possible because they don't contain item pointers for every contained
> row.

But minmax is just a specific form of bloom filter.

This could certainly be generalized to a bloom filter index with some
set of bloom&hashing operators (minmax being just one).

Re: Minmax indexes

From

Josh Berkus

Date:

17 June 2014, 21:35:05

On 06/17/2014 09:14 AM, Robert Haas wrote:
> Well, I don't know: suppose you're loading geospatial data showing the
> location of every building in some country.  It might easily be the
> case that the data is or can be loaded in an order that provides
> pretty good spatial locality, leading to tight bounding boxes over
> physically consecutive data ranges.

I admin a production application which has exactly this.  However, that
application doesn't have big enough data to benefit from minmax indexes;
it uses the basic spatial indexes.

So, my $0.02: bounding box minmax falls under the heading of "would be
nice to have, but not if it delays the feature".

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Minmax indexes

From

Greg Stark

Date:

17 June 2014, 23:48:54

On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Well, it might be doable to correlate them along one axis, but along
> both?  That's more complicated... And even alongside one axis you
> already get into problems if your geometries are irregularly sized.
> Asingle large polygon will completely destroy indexability for anything
> stored physically close by because suddently the minmax range will be
> huge... So you'll need to cleverly sort for that as well.

I think there's a misunderstanding here, possibly mine. My
understanding is that a min/max index will always be exactly the same
size for a given size table. It stores the minimum and maximum value
of the key for each page. Then you can do a bitmap scan by comparing
the search key with each page's minimum and maximum to see if that
page needs to be included in the scan. The failure mode is not that
the index is large but that a page that has an outlier will be
included in every scan as a false positive incurring an extra iop.

I don't think it's implausible at all that Geometric data would work
well. If you load Geometric data it's very common to load data by
geographic area so that all objects in San Francisco in one part of
the data load, probably even by zip code or census block.

What operations would an opclass for min/max need? I think it would be
pretty similar to the operators that GiST needs (thankfully minus the
baroque page split function):

An aggregate to generate a min/max "bounding box" from several values
A function which takes an "bounding box" and a new value and returns
the new "bounding box"
A function which tests if a value is in a "bounding box"
A function which tests if a "bounding box" overlaps a "bounding box"

The nice thing is this would let us add things like range @> (contains
element) to the plain integer min/max case.

-- 
greg

Re: Minmax indexes

From

Heikki Linnakangas

Date:

18 June 2014, 09:18:46

On 06/17/2014 09:16 PM, Andres Freund wrote:
> On 2014-06-17 12:14:00 -0400, Robert Haas wrote:
>> On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>>>> Well, I'm not the guy who does things with geometric data, but I don't
>>>> want to ignore the significant percentage of our users who are.  As
>>>> you must surely know, the GIST implementations for geometric data
>>>> types store bounding boxes on internal pages, and that seems to be
>>>> useful to people.  What is your reason for thinking that it would be
>>>> any less useful in this context?
>>>
>>> For me minmax indexes are helpful because they allow to generate *small*
>>> 'coarse' indexes over large volumes of data. From my pov that's possible
>>> possible because they don't contain item pointers for every contained
>>> row.
>>> That'ill imo work well if there are consecutive rows in the table that
>>> can be summarized into one min/max range. That's quite likely to happen
>>> for common applications of number of scalar datatypes. But the
>>> likelihood of placing sufficiently many rows with very similar bounding
>>> boxes close together seems much less relevant in practice. And I think
>>> that's generally likely for operations which can't be well represented
>>> as btree opclasses - the substructure that implies inside a Datum will
>>> make correlation between consecutive rows less likely.
>>
>> Well, I don't know: suppose you're loading geospatial data showing the
>> location of every building in some country.  It might easily be the
>> case that the data is or can be loaded in an order that provides
>> pretty good spatial locality, leading to tight bounding boxes over
>> physically consecutive data ranges.
>
> Well, it might be doable to correlate them along one axis, but along
> both?  That's more complicated... And even alongside one axis you
> already get into problems if your geometries are irregularly sized.

Sure, there are cases where it would be useless. But it's easy to 
imagine scenarios where it would work well, where points are loaded in 
clusters and points that are close to each other also end up physically 
close to each other.

> Asingle large polygon will completely destroy indexability for anything
> stored physically close by because suddently the minmax range will be
> huge... So you'll need to cleverly sort for that as well.

That's an inherent risk with minmax indexes: insert a few rows to the 
"wrong" locations in the heap, and the selectivity of the index degrades 
rapidly.

The main problem with using it for geometric types is that you can't 
easily CLUSTER the table to make the minmax index effective again. But 
there are ways around that.

>> But I'm not trying to say that we absolutely have to support that kind
>> of thing; what I am trying to say is that there should be a README or
>> a mailing list post or some such that says: "We thought about how
>> generic to make this.  We considered A, B, and C.  We rejected C as
>> too narrow, and A because if we made it that general it would have
>> greatly enlarged the disk footprint for the following reasons.
>> Therefore we selected B."
>
> Isn't 'simpler implementation' a valid reason that's already been
> discussed onlist? Obviously simpler implementation doesn't trump
> everything, but it's one valid reason...
> Note that I have zap to do with the design of this feature. I work for
> the same company as Alvaro, but that's pretty much it...

Without some analysis (e.g implementing it and comparing), I don't buy 
that it makes the implementation simpler to restrict it in this way. 
Maybe it does, but often it's actually simpler to solve the general case.

- Heikki

Re: Minmax indexes

From

Andres Freund

Date:

18 June 2014, 10:46:48

On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote:
> On 06/17/2014 09:16 PM, Andres Freund wrote:
> >Well, it might be doable to correlate them along one axis, but along
> >both?  That's more complicated... And even alongside one axis you
> >already get into problems if your geometries are irregularly sized.
> 
> Sure, there are cases where it would be useless. But it's easy to imagine
> scenarios where it would work well, where points are loaded in clusters and
> points that are close to each other also end up physically close to each
> other.

> >Asingle large polygon will completely destroy indexability for anything
> >stored physically close by because suddently the minmax range will be
> >huge... So you'll need to cleverly sort for that as well.
> 
> That's an inherent risk with minmax indexes: insert a few rows to the
> "wrong" locations in the heap, and the selectivity of the index degrades
> rapidly.

Sure. But it's fairly normal to have natural clusteredness in many
columns (surrogate keys, dateseries type of data). Even if you insert
geometric types in a geographic clusters you'll have worse results
because some bounding boxes will be big and such.

And:
> The main problem with using it for geometric types is that you can't easily
> CLUSTER the table to make the minmax index effective again. But there are
> ways around that.

Which are? Sure you can try stuff like recreating the table, sorting
rows with boundary boxes area above threshold first, and then go on to
sort by the lop left corner of the bounding box. But that'll be neither
builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER
for types with a btree opclass which will yield the perfect order.

> >Isn't 'simpler implementation' a valid reason that's already been
> >discussed onlist? Obviously simpler implementation doesn't trump
> >everything, but it's one valid reason...
> >Note that I have zap to do with the design of this feature. I work for
> >the same company as Alvaro, but that's pretty much it...
> 
> Without some analysis (e.g implementing it and comparing), I don't buy that
> it makes the implementation simpler to restrict it in this way. Maybe it
> does, but often it's actually simpler to solve the general case.

So to implement a feature one now has to implement the most generic
variant as a prototype first? Really?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes

From

Andres Freund

Date:

18 June 2014, 10:53:40

On 2014-06-17 16:48:07 -0700, Greg Stark wrote:
> On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Well, it might be doable to correlate them along one axis, but along
> > both?  That's more complicated... And even alongside one axis you
> > already get into problems if your geometries are irregularly sized.
> > Asingle large polygon will completely destroy indexability for anything
> > stored physically close by because suddently the minmax range will be
> > huge... So you'll need to cleverly sort for that as well.
> 
> I think there's a misunderstanding here, possibly mine. My
> understanding is that a min/max index will always be exactly the same
> size for a given size table. It stores the minimum and maximum value
> of the key for each page. Then you can do a bitmap scan by comparing
> the search key with each page's minimum and maximum to see if that
> page needs to be included in the scan. The failure mode is not that
> the index is large but that a page that has an outlier will be
> included in every scan as a false positive incurring an extra iop.

I just rechecked, and no, it doesn't, by default, store a range for each
page. It's MINMAX_DEFAULT_PAGES_PER_RANGE=128 pages by
default... Haven't checked what's the lowest it can be se tto.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes

From

Heikki Linnakangas

Date:

18 June 2014, 11:51:44

On 06/18/2014 01:46 PM, Andres Freund wrote:
> On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote:
>> The main problem with using it for geometric types is that you can't easily
>> CLUSTER the table to make the minmax index effective again. But there are
>> ways around that.
>
> Which are? Sure you can try stuff like recreating the table, sorting
> rows with boundary boxes area above threshold first, and then go on to
> sort by the lop left corner of the bounding box.

Right, something like that. Or cluster using some other column that 
correlates with the geometry, like a zip code.

> But that'll be neither
> builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER
> for types with a btree opclass which will yield the perfect order.

Sure.

BTW, CLUSTERing by a geometric type would be useful anyway, even without 
minmax indexes.

>>> Isn't 'simpler implementation' a valid reason that's already been
>>> discussed onlist? Obviously simpler implementation doesn't trump
>>> everything, but it's one valid reason...
>>> Note that I have zap to do with the design of this feature. I work for
>>> the same company as Alvaro, but that's pretty much it...
>>
>> Without some analysis (e.g implementing it and comparing), I don't buy that
>> it makes the implementation simpler to restrict it in this way. Maybe it
>> does, but often it's actually simpler to solve the general case.
>
> So to implement a feature one now has to implement the most generic
> variant as a prototype first? Really?

Implementing something is a good way to demonstrate how it would look 
like. But no, I don't insist on implementing every possible design 
whenever a new feature is proposed.

I liked Greg's sketch of what the opclass support functions would be. It 
doesn't seem significantly more complicated than what's in the patch now.

- Heikki

Re: Minmax indexes

From

Robert Haas

Date:

18 June 2014, 12:03:55

On Tue, Jun 17, 2014 at 2:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> But I'm not trying to say that we absolutely have to support that kind
>> of thing; what I am trying to say is that there should be a README or
>> a mailing list post or some such that says: "We thought about how
>> generic to make this.  We considered A, B, and C.  We rejected C as
>> too narrow, and A because if we made it that general it would have
>> greatly enlarged the disk footprint for the following reasons.
>> Therefore we selected B."
>
> Isn't 'simpler implementation' a valid reason that's already been
> discussed onlist? Obviously simpler implementation doesn't trump
> everything, but it's one valid reason...
> Note that I have zap to do with the design of this feature. I work for
> the same company as Alvaro, but that's pretty much it...

It really *hasn't* been discussed on-list.  See these emails,
discussing the same ideas, from 8 months ago:

http://www.postgresql.org/message-id/5249B2D3.6030606@vmware.com
http://www.postgresql.org/message-id/CA+TgmoYSCbW-UC8LQV96sziGnXSuzAyQbfdQmK-FGu22HdKkaw@mail.gmail.com

Now, Alvaro did not respond to those emails, nor did anyone involved
in the development of the feature.  There may be an argument that
implementing that would be too complicated, but Heikki said he didn't
think it would be, and nobody's made a concrete argument as to why
he's wrong (and Heikki knows a lot about indexing).

>> Basically, I think Heikki asked a good
>> question - which was "could we abstract this more?" - and I can't
>> recall seeing a clear answer explaining why we could or couldn't and
>> what the trade-offs would be.
>
> 'could we abstract more' imo is a pretty bad design guideline. It's 'is
> there benefit in abstracting more'. Otherwise you end up with way to
> complicated systems.

On the flip side, if you don't abstract enough, you end up being able
to cover only a small set of the relevant use cases, or else you end
up with a bunch of poorly-coordinated tools to cover slightly
different use cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Vik Fearing

Date:

19 June 2014, 09:42:26

On 06/18/2014 12:46 PM, Andres Freund wrote:
>>> Isn't 'simpler implementation' a valid reason that's already been
>>> > >discussed onlist? Obviously simpler implementation doesn't trump
>>> > >everything, but it's one valid reason...
>>> > >Note that I have zap to do with the design of this feature. I work for
>>> > >the same company as Alvaro, but that's pretty much it...
>> > 
>> > Without some analysis (e.g implementing it and comparing), I don't buy that
>> > it makes the implementation simpler to restrict it in this way. Maybe it
>> > does, but often it's actually simpler to solve the general case.
>
> So to implement a feature one now has to implement the most generic
> variant as a prototype first? Really?

Well, there is the inventor's paradox to consider.
-- 
Vik

Re: Minmax indexes

From

Heikki Linnakangas

Date:

19 June 2014, 13:06:15

On 06/18/2014 06:09 PM, Claudio Freire wrote:
> On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote:
>> An aggregate to generate a min/max "bounding box" from several values
>> A function which takes an "bounding box" and a new value and returns
>> the new "bounding box"
>> A function which tests if a value is in a "bounding box"
>> A function which tests if a "bounding box" overlaps a "bounding box"
>
> Which I'd generalize a bit further by renaming "bounding box" with
> "compressed set", and allow it to be parameterized.

What do you mean by parameterized?

> So, you have:
>
> An aggregate to generate a "compressed set" from several values
> A function which adds a new value to the "compressed set" and returns
> the new "compressed set"
> A function which tests if a value is in a "compressed set"
> A function which tests if a "compressed set" overlaps another
> "compressed set" of equal type

Yeah, something like that. I'm not sure I like the "compressed set" term 
any more than bounding box, though. GiST seems to have avoided naming 
the thing, and just talks about "index entries". But if we can come up 
with a good name, that would be more clear.

> One problem with such a generalized implementation would be, that I'm
> not sure in-place modification of the "compressed set" on-disk can be
> assumed to be safe on all cases. Surely, for strictly-enlarging sets
> it would, but while min/max and bloom filters both fit the bill, it's
> not clear that one can assume this for all structures.

I don't understand what you mean. It's a fundamental property of minmax 
indexes that you can always replace the "min" or "max" or "compressing 
set" or "bounding box" or whatever with another datum that represents 
all the keys that the old one did, plus some.

- Heikki

Re: Minmax indexes

From

Tom Lane

Date:

19 June 2014, 13:43:38

Vik Fearing <vik.fearing@dalibo.com> writes:
> On 06/18/2014 12:46 PM, Andres Freund wrote:
>> So to implement a feature one now has to implement the most generic
>> variant as a prototype first? Really?

> Well, there is the inventor's paradox to consider.

I have not seen anyone demanding a different implementation in this
thread.  What *has* been asked for, and not supplied, is a concrete
defense of the particular level of generality that's been selected
in this implementation.  It's not at all clear to the rest of us
whether it was the right choice, and that is something that ought
to be asked now not later.
        regards, tom lane

Re: Minmax indexes

From

Greg Stark

Date:

19 June 2014, 16:33:03

On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Implementing something is a good way to demonstrate how it would look like.
> But no, I don't insist on implementing every possible design whenever a new
> feature is proposed.
>
> I liked Greg's sketch of what the opclass support functions would be. It
> doesn't seem significantly more complicated than what's in the patch now.

As a counter-point to my own point there will be nothing stopping us
in the future from generalizing things. Dealing with catalogs is
mostly book-keeping headaches and careful work. it's something that
might be well-suited for a GSOC or first patch from someone looking to
familiarize themselves with the system architecture. It's hard to
invent a whole new underlying infrastructure at the same time as
dealing with all that book-keeping and it's hard for someone
familiarizing themselves with the system to also have a great new
idea. Having tasks like this that are easy to explain and that mentor
understands well can be easier to manage than tasks where the newcomer
has some radical new idea.

-- 
greg

Re: Minmax indexes

From

Claudio Freire

Date:

19 June 2014, 17:20:41

On Thu, Jun 19, 2014 at 10:06 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 06/18/2014 06:09 PM, Claudio Freire wrote:
>>
>> On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote:
>>>
>>> An aggregate to generate a min/max "bounding box" from several values
>>> A function which takes an "bounding box" and a new value and returns
>>> the new "bounding box"
>>> A function which tests if a value is in a "bounding box"
>>> A function which tests if a "bounding box" overlaps a "bounding box"
>>
>>
>> Which I'd generalize a bit further by renaming "bounding box" with
>> "compressed set", and allow it to be parameterized.
>
>
> What do you mean by parameterized?

Bloom filters can be paired with number of hashes, number of bit
positions, and hash function, so it's not a simple bloom filter index,
but a bloom filter index with N SHA-1-based hashes spread on a
K-length bitmap.

>> So, you have:
>>
>> An aggregate to generate a "compressed set" from several values
>> A function which adds a new value to the "compressed set" and returns
>> the new "compressed set"
>> A function which tests if a value is in a "compressed set"
>> A function which tests if a "compressed set" overlaps another
>> "compressed set" of equal type
>
>
> Yeah, something like that. I'm not sure I like the "compressed set" term any
> more than bounding box, though. GiST seems to have avoided naming the thing,
> and just talks about "index entries". But if we can come up with a good
> name, that would be more clear.

I don't want to use the term bloom filter since it's very specific of
a specific technique, but it's basically that - an approximate set
without false negatives. Ie: compressed set.

It's not a bounding box either when using bloom filters. So...

>> One problem with such a generalized implementation would be, that I'm
>> not sure in-place modification of the "compressed set" on-disk can be
>> assumed to be safe on all cases. Surely, for strictly-enlarging sets
>> it would, but while min/max and bloom filters both fit the bill, it's
>> not clear that one can assume this for all structures.
>
>
> I don't understand what you mean. It's a fundamental property of minmax
> indexes that you can always replace the "min" or "max" or "compressing set"
> or "bounding box" or whatever with another datum that represents all the
> keys that the old one did, plus some.

Yes, and bloom filters happen to fall on that category too.

Never mind what I said. I was thinking of other potential and
imaginary implementation that supports removal or updates, that might
need care with transaction lifetimes, but that's easily fixed by
letting vacuum or some lazy process do the deleting just as it happens
with other indexes anyway.

So, I guess the interface must include also the invariant that
compressed sets only grow, never shrink unless from a rebuild or a
vacuum operation.

Re: Minmax indexes

From

Claudio Freire

Date:

19 June 2014, 19:01:26

On Wed, Jun 18, 2014 at 8:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>
> I liked Greg's sketch of what the opclass support functions would be. It
> doesn't seem significantly more complicated than what's in the patch now.

Which was

On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote:
> An aggregate to generate a min/max "bounding box" from several values
> A function which takes an "bounding box" and a new value and returns
> the new "bounding box"
> A function which tests if a value is in a "bounding box"
> A function which tests if a "bounding box" overlaps a "bounding box"

Which I'd generalize a bit further by renaming "bounding box" with
"compressed set", and allow it to be parameterized.

So, you have:

An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal type

If you can define different compressed sets, you can use this to
generate both min/max indexes as well as bloom filter indexes. Whether
we'd want to have both is perhaps questionable, but having the ability
to is probably desirable.

One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.

Adding also a "in-place updateable" bit to the "type" would perhaps
inflate the complexity of the patch due to the need to provide both
code paths?

Re: Minmax indexes

From

Martijn van Oosterhout

Date:

21 June 2014, 18:09:19

I'm sorry if I missed something, but ISTM this is beginning to look a
lot like GiST. This was pointed out by Robert Haas last year.

On Wed, Jun 18, 2014 at 12:09:42PM -0300, Claudio Freire wrote:
> So, you have:
>
> An aggregate to generate a "compressed set" from several values

Which GiST does by calling 'compress' on each value, and the 'unions' the
results together.

> A function which adds a new value to the "compressed set" and returns
> the new "compressed set"

Again, 'compress' + 'union'

> A function which tests if a value is in a "compressed set"

Which GiST does using 'compress' +'consistant'

> A function which tests if a "compressed set" overlaps another
> "compressed set" of equal type

Which GiST calls 'consistant'

So I'm wondering why you can't just reuse the btree_gist functions we
already have in contrib.  It seems to me that these MinMax indexes are
in fact a variation on GiST that indexes the pages of a table based
upon the 'union' of all the elements in a page.  By reusing the GiST
operator class you get support for many datatypes for free.

> If you can define different compressed sets, you can use this to
> generate both min/max indexes as well as bloom filter indexes. Whether
> we'd want to have both is perhaps questionable, but having the ability
> to is probably desirable.

You could implement bloom filter in GiST too. It's been discussed
before but I can't find any implementation. Probably because the filter
needs to be parameterised and if you store the bloom filter for each
element it gets expensive very quickly. However, hooked into a minmax
structure which only indexes whole pages it could be quite efficient.

> One problem with such a generalized implementation would be, that I'm
> not sure in-place modification of the "compressed set" on-disk can be
> assumed to be safe on all cases. Surely, for strictly-enlarging sets
> it would, but while min/max and bloom filters both fit the bill, it's
> not clear that one can assume this for all structures.

I think GiST has already solved this problem.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: Minmax indexes

From

Heikki Linnakangas

Date:

23 June 2014, 15:22:06

Some comments, aside from the design wrt. bounding boxes etc. :

On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> Here's an updated version of this patch, with fixes to all the bugs
>>> reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
>>> Amit Kapila for the reports.
>>
>> I'm not very happy with the use of a separate relation fork for
>> storing this data.
>
> Here's a new version of this patch.  Now the revmap is not stored in a
> separate fork, but together with all the regular data, as explained
> elsewhere in the thread.

Thanks! Please update the README accordingly.

If I understand the code correctly, the revmap is a three-level deep 
structure. The bottom level consists of "regular revmap pages", and each 
regular revmap page is filled with ItemPointerDatas, which point to the 
index tuples. The middle level consists of "array revmap pages", and 
each array revmap page contains an array of BlockNumbers of the "regular 
revmap" pages. The top level is an array of BlockNumbers of the array 
revmap pages, and it is stored in the metapage.

With 8k block size, that's just enough to cover the full range of 2^32-1 
blocks that you'll need if you set mm_pages_per_range=1. Each regular 
revmap page can store about 8192/6 = 1365 item pointers, each array 
revmap page can store about 8192/4 = 2048 block references, and the size 
of the top array is 8192/4. That's just enough; to store the required 
number of array pages in the top array, the array needs to be 
(2^32/1365)/2048)=1536 elements large.

But with 4k or smaller blocks, it's not enough.

I wonder if it would be simpler to just always store the revmap pages in 
the beginning of the index, before any other pages. Finding the revmap 
page would then be just as easy as with a separate fork. When the 
table/index is extended so that a new revmap page is needed, move the 
existing page at that block out of the way. Locking needs some 
consideration, but I think it would be feasible and simpler than you 
have now.

> I have followed the suggestion by Amit to overwrite the index tuple when
> a new heap tuple is inserted, instead of creating a separate index
> tuple.  This saves a lot of index bloat.  This required a new entry
> point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
> function PageIndexDeleteNoCompact which is similar in spirit to
> PageIndexMultiDelete except that item pointers do not change.  This is
> necessary because the revmap stores item pointers, and such reference
> would break if we were to renumber items in index pages.

ISTM that when the old tuple cannot be updated in-place, the new index 
tuple is inserted with mm_doinsert(), but the old tuple is never deleted.

- Heikki

-- 
- Heikki

Re: Minmax indexes

From

Alvaro Herrera

Date:

23 June 2014, 17:08:05

Heikki Linnakangas wrote:
> Some comments, aside from the design wrt. bounding boxes etc. :

Thanks.  I haven't commented on that sub-thread because I think it's
possible to come up with a reasonable design that solves the issue by
adding a couple of amprocs.  I need to do some more thinking to ensure
it is really workable, and then I'll post my ideas.

> On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
> >Robert Haas wrote:

> If I understand the code correctly, the revmap is a three-level deep
> structure. The bottom level consists of "regular revmap pages", and
> each regular revmap page is filled with ItemPointerDatas, which
> point to the index tuples. The middle level consists of "array
> revmap pages", and each array revmap page contains an array of
> BlockNumbers of the "regular revmap" pages. The top level is an
> array of BlockNumbers of the array revmap pages, and it is stored in
> the metapage.

Yep, that's correct.  Essentially, we still have the revmap as a linear
space (containing TIDs); the other two levels on top of that are only
there to enable locating the physical page numbers for each revmap
logical page.  I make one exception that the first logical revmap page
is always stored in page 1, to optimize the case of a smallish table
(~1360 page ranges; approximately 1.3 gigabytes of data at 128 pages per
range, or 170 megabytes at 16 pages per range.)

Each page has a page header (24 bytes) and special space (4 bytes), so
it has 8192-28=8164 bytes available for data, so 1360 item pointers.

> With 8k block size, that's just enough to cover the full range of
> 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
> regular revmap page can store about 8192/6 = 1365 item pointers,
> each array revmap page can store about 8192/4 = 2048 block
> references, and the size of the top array is 8192/4. That's just
> enough; to store the required number of array pages in the top
> array, the array needs to be (2^32/1365)/2048)=1536 elements large.
> 
> But with 4k or smaller blocks, it's not enough.

Yeah.  As I said elsewhere, actual useful values are likely to be close
to the read-ahead setting of the underlying disk; by default that'd be
16 pages (128kB), but I think it's common advice to increase the kernel
setting to improve performance.  I don't think we don't need to prevent
minmax indexes with pages_per_range=1, but I don't think we need to
ensure that that setting works with the largest tables, either, because
it doesn't make any sense to set it up like that.

Also, while there are some recommendations to set up a system with
larger page sizes (32kB), I have never seen any recommendation to set it
lower.  It wouldn't make sense to build a system that has very large
tables and use a smaller page size.

So in other words, yes, you're correct that the mechanism doesn't work
in some cases (small page size and index configured for highest level of
detail), but the conditions are such that I don't think it matters.

ISTM the thing to do here is to do the math at index creation time, and
if we find that we don't have enough space in the metapage for all array
revmap pointers we need, bail out and require the index to be created
with a larger pages_per_range setting.

> I wonder if it would be simpler to just always store the revmap
> pages in the beginning of the index, before any other pages. Finding
> the revmap page would then be just as easy as with a separate fork.
> When the table/index is extended so that a new revmap page is
> needed, move the existing page at that block out of the way. Locking
> needs some consideration, but I think it would be feasible and
> simpler than you have now.

Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.

> >I have followed the suggestion by Amit to overwrite the index tuple when
> >a new heap tuple is inserted, instead of creating a separate index
> >tuple.  This saves a lot of index bloat.  This required a new entry
> >point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
> >function PageIndexDeleteNoCompact which is similar in spirit to
> >PageIndexMultiDelete except that item pointers do not change.  This is
> >necessary because the revmap stores item pointers, and such reference
> >would break if we were to renumber items in index pages.
> 
> ISTM that when the old tuple cannot be updated in-place, the new
> index tuple is inserted with mm_doinsert(), but the old tuple is
> never deleted.

It's deleted by the next vacuum.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Heikki Linnakangas

Date:

23 June 2014, 19:11:15

On 06/23/2014 08:07 PM, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> With 8k block size, that's just enough to cover the full range of
>> 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
>> regular revmap page can store about 8192/6 = 1365 item pointers,
>> each array revmap page can store about 8192/4 = 2048 block
>> references, and the size of the top array is 8192/4. That's just
>> enough; to store the required number of array pages in the top
>> array, the array needs to be (2^32/1365)/2048)=1536 elements large.
>>
>> But with 4k or smaller blocks, it's not enough.
>
> Yeah.  As I said elsewhere, actual useful values are likely to be close
> to the read-ahead setting of the underlying disk; by default that'd be
> 16 pages (128kB), but I think it's common advice to increase the kernel
> setting to improve performance.

My gut feeling is that it might well be best to set pages_per_page=1. 
Even if you do the same amount of I/O, thanks to kernel read-ahead, you 
might still avoid processing a lot of tuples. But would need to see some 
benchmarks to know..

> I don't think we don't need to prevent
> minmax indexes with pages_per_range=1, but I don't think we need to
> ensure that that setting works with the largest tables, either, because
> it doesn't make any sense to set it up like that.
>
> Also, while there are some recommendations to set up a system with
> larger page sizes (32kB), I have never seen any recommendation to set it
> lower.  It wouldn't make sense to build a system that has very large
> tables and use a smaller page size.
>
> So in other words, yes, you're correct that the mechanism doesn't work
> in some cases (small page size and index configured for highest level of
> detail), but the conditions are such that I don't think it matters.
>
> ISTM the thing to do here is to do the math at index creation time, and
> if we find that we don't have enough space in the metapage for all array
> revmap pointers we need, bail out and require the index to be created
> with a larger pages_per_range setting.

Yeah, I agree that would be acceptable.

I feel that the below would nevertheless be simpler:

>> I wonder if it would be simpler to just always store the revmap
>> pages in the beginning of the index, before any other pages. Finding
>> the revmap page would then be just as easy as with a separate fork.
>> When the table/index is extended so that a new revmap page is
>> needed, move the existing page at that block out of the way. Locking
>> needs some consideration, but I think it would be feasible and
>> simpler than you have now.
>
> Moving index items around is not easy, because you'd have to adjust the
> revmap to rewrite the item pointers.

Hmm. Two alternative schemes come to mind:

1. Move each index tuple off the page individually, updating the revmap 
while you do it, until the page is empty. Updating the revmap for a 
single index tuple isn't difficult; you have to do it anyway when an 
index tuple is replaced. (MMTuples don't contain a heap block number 
ATM, but IMHO they should, see below)

2. Store the new block number of the page that you moved out of the way 
in the revmap page, and leave the revmap pointers unchanged. The revmap 
pointers can be updated later, lazily.

Both of those seem pretty straightforward.

>>> I have followed the suggestion by Amit to overwrite the index tuple when
>>> a new heap tuple is inserted, instead of creating a separate index
>>> tuple.  This saves a lot of index bloat.  This required a new entry
>>> point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
>>> function PageIndexDeleteNoCompact which is similar in spirit to
>>> PageIndexMultiDelete except that item pointers do not change.  This is
>>> necessary because the revmap stores item pointers, and such reference
>>> would break if we were to renumber items in index pages.
>>
>> ISTM that when the old tuple cannot be updated in-place, the new
>> index tuple is inserted with mm_doinsert(), but the old tuple is
>> never deleted.
>
> It's deleted by the next vacuum.

Ah I see. Vacuum reads the whole index, and builds an in-memory hash 
table that contains an ItemPointerData for every tuple in the index. 
Doesn't that require a lot of memory, for a large index? That might be 
acceptable - you ought to have plenty of RAM if you're pushing around 
multi-terabyte tables - but it would nevertheless be nice to not have a 
hard requirement for something as essential as vacuum.

In addition to the hash table, remove_deletable_tuples() pallocs an 
array to hold an ItemPointer for every index tuple about to be removed. 
A single palloc is limited to 1GB, so that will fail outright if there 
are more than ~179 million dead index tuples. You're unlikely to hit 
that in practice, but if you do, you'll never be able to vacuum the 
index. So that's not very nice.

Wouldn't it be simpler to remove the old tuple atomically with inserting 
the new tuple and updating the revmap? Or at least mark the old tuple as 
deletable, so that vacuum can just delete it, without building the large 
hash table to determine that it's deletable.

As it is, remove_deletable_tuples looks racy:

1. Vacuum begins, and remove_deletable_tuples performs the first pass 
over the regular, non-revmap index pages, building the hash table of all 
items in the index.

2. Another process inserts a new row to the heap, which causes a new 
minmax tuple to be inserted and the revmap to be updated to point to the 
new tuple.

3. Vacuum proceeds to scan the revmap. It will find the updated revmap 
entry that points to the new index tuple. The new index tuples is not 
found in the hash table, so it throws an error: "reverse map references 
nonexistant (sic) index tuple".

I think to fix that you can just ignore tuples that are not found in the 
hash table. (Although as I said above I think it would be simpler to not 
leave behind any dead index tuples in the first place and get rid of the 
vacuum scans altogether)

Regarding locking, I think it would be good to mention explicitly the 
order that the pages must be locked if you need to lock multiple pages 
at the same time, to avoid deadlock. Based on the Locking 
considerations-section in the README, I believe the order is that you 
always lock the regular index page first, and then the revmap page. 
There's no mention of the order of locking two regular or two revmap 
pages, but I guess you never do that ATM.

I'm quite surprised by the use of LockTuple on the index tuples. I think 
the main reason for needing that is the fact that MMTuple doesn't store 
the heap (range) block number that the tuple points to: LockTuple is 
required to ensure that the tuple doesn't go away while a scan is 
following a pointer from the revmap to it. If the MMTuple contained the 
BlockNumber, a scan could check that and go back to the revmap if it 
doesn't match. Alternatively, you could keep the revmap page locked when 
you follow a pointer to the regular index page.

The lack of a block number on index tuples also makes my idea of moving 
tuples out of the way of extending the revmap much more difficult; 
there's no way to find the revmap entry pointing to an index tuple, 
short of scanning the whole revmap. And also on general robustness 
grounds, and for debugging purposes, it would be nice to have the block 
number there.

- Heikki

Re: Minmax indexes

From

Robert Haas

Date:

23 June 2014, 19:34:33

On Thu, Jun 19, 2014 at 12:32 PM, Greg Stark <stark@mit.edu> wrote:
> On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Implementing something is a good way to demonstrate how it would look like.
>> But no, I don't insist on implementing every possible design whenever a new
>> feature is proposed.
>>
>> I liked Greg's sketch of what the opclass support functions would be. It
>> doesn't seem significantly more complicated than what's in the patch now.
>
> As a counter-point to my own point there will be nothing stopping us
> in the future from generalizing things. Dealing with catalogs is
> mostly book-keeping headaches and careful work. it's something that
> might be well-suited for a GSOC or first patch from someone looking to
> familiarize themselves with the system architecture. It's hard to
> invent a whole new underlying infrastructure at the same time as
> dealing with all that book-keeping and it's hard for someone
> familiarizing themselves with the system to also have a great new
> idea. Having tasks like this that are easy to explain and that mentor
> understands well can be easier to manage than tasks where the newcomer
> has some radical new idea.

Generalizing this in the future would be highly likely to change the
on-disk format for existing indexes, which would be a problem for
pg_upgrade.  I think we will likely be stuck with whatever the initial
on-disk format looks like for a very long time, which is why I think
we need to try rather hard to get this right the first time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Alvaro Herrera

Date:

09 July 2014, 21:16:38

Claudio Freire wrote:

> An aggregate to generate a "compressed set" from several values
> A function which adds a new value to the "compressed set" and returns
> the new "compressed set"
> A function which tests if a value is in a "compressed set"
> A function which tests if a "compressed set" overlaps another
> "compressed set" of equal type
>
> If you can define different compressed sets, you can use this to
> generate both min/max indexes as well as bloom filter indexes. Whether
> we'd want to have both is perhaps questionable, but having the ability
> to is probably desirable.

Here's a new version of this patch, which is more generic the original
versions, and similar to what you describe.

The way it works now, each opclass needs to have three support
procedures; I've called them getOpers, maybeUpdateValues, and compare.
(I realize these names are pretty bad, and will be changing them.)
getOpers is used to obtain information about what is stored for that
data type; it says how many datum values are stored for a column of that
type (two for sortable: min and max), and how many operators it needs
setup.  Then, the generic code fills in a MinmaxDesc(riptor) and creates
an initial DeformedMMTuple (which is a rather ugly name for a minmax
tuple held in memory).  The maybeUpdateValues amproc can then be called
when there's a new heap tuple, which updates the DeformedMMTuple to
account for the new tuple (in essence, it's a union of the original
values and the new tuple).  This can be done repeatedly (when a new
index is being created) or only once (when a new heap tuple is inserted
into an existing index).  There is no need for an "aggregate".

This DeformedMMTuple can easily be turned into the on-disk
representation; there is no hardcoded assumption on the number of index
values stored per heap column, so it is possible to build an opclass
that stores a bounding box column for a geometry heap column, for
instance.

Then we have the "compare" amproc.  This is used during index scans;
after extracting an index tuple, it is turned into DeformedMMTuple, and
the "compare" amproc for each column is called with the values of scan
keys.  (Now that I think about this, it seems pretty much what
"consistent" is for GiST opclasses).  A true return value indicates that
the scan key matches the page range boundaries and thus all pages in the
range are added to the output TID bitmap.

Of course, you can have multicolumn indexes, and (as would be expected)
each column can have totally different opclasses; so for instance you
could have an integer column and a geometric column in the same index,
and it should work fine.  In a query that constrained both columns, only
those page ranges that satisfied the scan keys for both columns would be
returned.

I think this level of abstraction is good --- AFAICS it is easy to
implement opclasses for datatypes not suitable for btree opclasses such
as geometric ones, etc.  This answers the concerns of those who wanted
to see this support datatypes that don't have the concept of min/max at
all.  I'm not sure about bloom filters, as I've forgotten how those
work.  Of course, the implementation of min/max is there: that logic has
been abstracted out into what I've called "sortable opfamilies" for now
(name suggestions welcome) --- it can be used for any datatype with a
btree opclass.

I think design-wise it ended up making a lot of sense, because all the
opclass-specific assumptions about usage of "min" and "max" values and
comparisons using the less-than etc operators are contained in the
mmsortable.c file, and the basic minmax.c file only needs to know to
call the right opclass-specific procedures.  The basic code might need
some tweaks to ensure that we're not assuming anything about the
datatypes of the stored values (as opposed to the datatypes of the
indexed columns), but this is a matter of tweaking the MinmaxDesc and
the getOpers amprocs; it wouldn't require changing the on-disk
representation, and thus is upgrade-compatible.

There's a bit of boilerplate code in the amproc routines which would be
nice to be able to remove (mainly involving null values), but I think to
do that we would need to split those three amprocs into maybe four or
five, which is not as nice, so I'm not real sure about doing it.

All this being said, I'm sticking to the name "Minmax indexes".  There
was a poll in pgsql-advocacy
http://www.postgresql.org/message-id/53A0B4F8.8080803@agliodbs.com
about a new name, but there were no suggestions supported by more than
one person.  If a brilliant new name comes up, I'm open to changing it.

Another thing I noticed is that version 8 of the patch blindly believed
the "pages_per_range" declared in catalogs.  This meant that if somebody
did "alter index foo set pages_per_range=123" the index would
immediately break (i.e. return corrupted results when queried).  I have
fixed this by storing the pages_per_range value used to construct the
index in the metapage.  Now if you do the ALTER INDEX thing, the new
value is only used when the index is recreated by REINDEX.

There are still things to go over before this is committable,
particularly regarding vacuuming the index, but as far as index creation
and scanning it should be good to test.  (Vacuuming should work just
fine most of the time also, but there are a few wrinkles pointed out by
Heikki.)

One thing I've disabled for now is the pageinspect code that displays
index items.  I need to rewrite that.

Closing thought: thinking more about it, the bit about returning
function OIDs by getOpers when creating a MinmaxDesc seems unnecessary.
I think we could go by with just returning the number of values stored
in the column, and have the operators be part of an opaque struct that's
initialized and only touched by the opclass amprocs, not by the generic
code.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-9.patch

Re: Minmax indexes

From

Alvaro Herrera

Date:

09 July 2014, 21:41:46

Heikki Linnakangas wrote:
> On 06/23/2014 08:07 PM, Alvaro Herrera wrote:

> I feel that the below would nevertheless be simpler:
> 
> >>I wonder if it would be simpler to just always store the revmap
> >>pages in the beginning of the index, before any other pages. Finding
> >>the revmap page would then be just as easy as with a separate fork.
> >>When the table/index is extended so that a new revmap page is
> >>needed, move the existing page at that block out of the way. Locking
> >>needs some consideration, but I think it would be feasible and
> >>simpler than you have now.
> >
> >Moving index items around is not easy, because you'd have to adjust the
> >revmap to rewrite the item pointers.
> 
> Hmm. Two alternative schemes come to mind:
> 
> 1. Move each index tuple off the page individually, updating the
> revmap while you do it, until the page is empty. Updating the revmap
> for a single index tuple isn't difficult; you have to do it anyway
> when an index tuple is replaced. (MMTuples don't contain a heap
> block number ATM, but IMHO they should, see below)
> 
> 2. Store the new block number of the page that you moved out of the
> way in the revmap page, and leave the revmap pointers unchanged. The
> revmap pointers can be updated later, lazily.
> 
> Both of those seem pretty straightforward.

The trouble I have with moving blocks around to make space, is that it
would cause the index to have periodic hiccups to make room for the new
revmap pages.  One nice property that these indexes are supposed to have
is that the effect into insertion times should be pretty minimal.  That
would cease to be the case if we have to do your proposed block moves.

> >>ISTM that when the old tuple cannot be updated in-place, the new
> >>index tuple is inserted with mm_doinsert(), but the old tuple is
> >>never deleted.
> >
> >It's deleted by the next vacuum.
> 
> Ah I see. Vacuum reads the whole index, and builds an in-memory hash
> table that contains an ItemPointerData for every tuple in the index.
> Doesn't that require a lot of memory, for a large index? That might
> be acceptable - you ought to have plenty of RAM if you're pushing
> around multi-terabyte tables - but it would nevertheless be nice to
> not have a hard requirement for something as essential as vacuum.

I guess if you're expecting that pages_per_range=1 is a common case,
yeah it might become an issue eventually.  One idea I just had is to
have a bit for each index tuple, which is set whenever the revmap no
longer points to it.  That way, vacuuming is much easier: just scan the
index and delete all tuples having that bit set.  No need for this hash
table stuff.  I am still concerned with adding more overhead whenever a
page range is modified, so that insertions in the table continue to be
fast.  If we're going to dirty the index every time, it might not be so
fast anymore.  But then maybe I'm worrying about nothing; I will have to
measure how slower it is.

> Wouldn't it be simpler to remove the old tuple atomically with
> inserting the new tuple and updating the revmap? Or at least mark
> the old tuple as deletable, so that vacuum can just delete it,
> without building the large hash table to determine that it's
> deletable.

Yes, it might be simpler, but it'd require dirtying more pages on
insertions (and holding more page-level locks, for longer.  Not good for
concurrent access).

> I'm quite surprised by the use of LockTuple on the index tuples. I
> think the main reason for needing that is the fact that MMTuple
> doesn't store the heap (range) block number that the tuple points
> to: LockTuple is required to ensure that the tuple doesn't go away
> while a scan is following a pointer from the revmap to it. If the
> MMTuple contained the BlockNumber, a scan could check that and go
> back to the revmap if it doesn't match. Alternatively, you could
> keep the revmap page locked when you follow a pointer to the regular
> index page.

There's the intention that these accesses be kept as concurrent as
possible; this is why we don't want to block the whole page.  Locking
individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
because we can only lock a single tuple in any one index scan, so
there's no unbounded growth of the lock table.

I prefer not to have BlockNumbers in index tuples, because that would
make them larger for not much gain.  That data would mostly be
redundant, and would be necessary only for vacuuming.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Josh Berkus

Date:

09 July 2014, 22:45:31

On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
> The way it works now, each opclass needs to have three support
> procedures; I've called them getOpers, maybeUpdateValues, and compare.
> (I realize these names are pretty bad, and will be changing them.)

I kind of like "maybeUpdateValues".  Very ... NoSQL-ish.  "Maybe update
the values, maybe not."  ;-)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Minmax indexes

From

Peter Geoghegan

Date:

09 July 2014, 22:54:10

On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> All this being said, I'm sticking to the name "Minmax indexes".  There
> was a poll in pgsql-advocacy
> http://www.postgresql.org/message-id/53A0B4F8.8080803@agliodbs.com
> about a new name, but there were no suggestions supported by more than
> one person.  If a brilliant new name comes up, I'm open to changing it.

How about "summarizing indexes"? That seems reasonably descriptive.

-- 
Peter Geoghegan

Re: Minmax indexes

From

Alvaro Herrera

Date:

09 July 2014, 23:14:08

Josh Berkus wrote:
> On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
> > The way it works now, each opclass needs to have three support
> > procedures; I've called them getOpers, maybeUpdateValues, and compare.
> > (I realize these names are pretty bad, and will be changing them.)
> 
> I kind of like "maybeUpdateValues".  Very ... NoSQL-ish.  "Maybe update
> the values, maybe not."  ;-)

:-)  Well, that's exactly what happens.  If we insert a new tuple into
the table, and the existing summarizing tuple (to use Peter's term)
already covers it, then we don't need to update the index tuple at all.
What this name doesn't say is what values are to be maybe-updated.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Greg Stark

Date:

09 July 2014, 23:36:34

On Wed, Jul 9, 2014 at 10:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> there is no hardcoded assumption on the number of index
> values stored per heap column, so it is possible to build an opclass
> that stores a bounding box column for a geometry heap column, for
> instance.

I think the more Postgresy thing to do is to store one datum per heap
column. It's up to the opclass to find or make a composite data type
that stores all the necessary state. So you could make a minmax_accum
data type like NumericAggState in numeric.c:numeric_accum() or the
array of floats in float8_accum. For a bounding box a 2d geometric
min/max index could use the "box" data type for example. The way
you've done it seems more convenient but there's something to be said
for using the same style for different areas. A single bounding box
accumulator function would probably suffice for both an aggregate and
index opclass for example.

But this sounds pretty great. I think it would let me do the bloom
filter index I had in mind fairly straightforwardly. The result would
be something very similar to a bitmap index. I'm not sure if there's a
generic term that includes bitmap indexes or other summary functions
like bounding boxes (which min/max is basically -- a 1D bounding box).

Thanks a lot for listening and being so open, I think what you
describe is a lot more flexible than what you had before and I can see
some pretty great things coming out of it (including min/max itself of
course).
-- 
greg

Re: Minmax indexes

From

Claudio Freire

Date:

10 July 2014, 17:40:17

On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Another thing I noticed is that version 8 of the patch blindly believed
> the "pages_per_range" declared in catalogs.  This meant that if somebody
> did "alter index foo set pages_per_range=123" the index would
> immediately break (i.e. return corrupted results when queried).  I have
> fixed this by storing the pages_per_range value used to construct the
> index in the metapage.  Now if you do the ALTER INDEX thing, the new
> value is only used when the index is recreated by REINDEX.

This seems a lot like parameterizing. So I guess the only thing left
is to issue a NOTICE when said alter takes place (I don't see that on
the patch, but maybe it's there?)

Re: Minmax indexes

From

Alvaro Herrera

Date:

10 July 2014, 19:21:10

Claudio Freire wrote:
> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Another thing I noticed is that version 8 of the patch blindly believed
> > the "pages_per_range" declared in catalogs.  This meant that if somebody
> > did "alter index foo set pages_per_range=123" the index would
> > immediately break (i.e. return corrupted results when queried).  I have
> > fixed this by storing the pages_per_range value used to construct the
> > index in the metapage.  Now if you do the ALTER INDEX thing, the new
> > value is only used when the index is recreated by REINDEX.
> 
> This seems a lot like parameterizing.

I don't understand what that means -- care to elaborate?

> So I guess the only thing left is to issue a NOTICE when said alter
> takes place (I don't see that on the patch, but maybe it's there?)

That's not in the patch.  I don't think we have an appropriate place to
emit such a notice.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Josh Berkus

Date:

10 July 2014, 20:50:17

On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
>> So I guess the only thing left is to issue a NOTICE when said alter
>> > takes place (I don't see that on the patch, but maybe it's there?)
> That's not in the patch.  I don't think we have an appropriate place to
> emit such a notice.

What do you mean by "don't have an appropriate place"?

The suggestion is that when a user does:

ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100

they should get a NOTICE:

"NOTICE: changes to pages per range will not take effect until the index
is REINDEXed"

otherwise, we're going to get a lot of "I Altered the pages per range,
but performance didn't change" emails.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Minmax indexes

From

Jaime Casanova

Date:

10 July 2014, 21:30:55

On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
>>> So I guess the only thing left is to issue a NOTICE when said alter
>>> > takes place (I don't see that on the patch, but maybe it's there?)
>> That's not in the patch.  I don't think we have an appropriate place to
>> emit such a notice.
>
> What do you mean by "don't have an appropriate place"?
>
> The suggestion is that when a user does:
>
> ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
>
> they should get a NOTICE:
>
> "NOTICE: changes to pages per range will not take effect until the index
> is REINDEXed"
>
> otherwise, we're going to get a lot of "I Altered the pages per range,
> but performance didn't change" emails.
>

How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?

we don't get a notice for these cases either

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566         Cell: +593 987171157

Re: Minmax indexes

From

Alvaro Herrera

Date:

10 July 2014, 21:34:11

Josh Berkus wrote:
> On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
> >> So I guess the only thing left is to issue a NOTICE when said alter
> >> > takes place (I don't see that on the patch, but maybe it's there?)
> > That's not in the patch.  I don't think we have an appropriate place to
> > emit such a notice.
> 
> What do you mean by "don't have an appropriate place"?

What I think should happen is that if the value is changed, the index
sholud be rebuilt right there.  But there is no way to have this occur
from the generic tablecmds.c code.  Maybe we should extend the AM
interface so that they are notified of changes and can take action.
Inserting AM-specific code into tablecmds.c seems pretty wrong to me --
existing stuff for WITH CHECK OPTION views notwithstanding.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Josh Berkus

Date:

10 July 2014, 21:50:15

On 07/10/2014 02:30 PM, Jaime Casanova wrote:
> How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
> from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?
> 
> we don't get a notice for these cases either

Good idea.  We should also emit notices for those.  Well, maybe not for
fillfactor.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Minmax indexes

From

Jeff Janes

Date:

10 July 2014, 22:17:58

On Thu, Jul 10, 2014 at 2:30 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
>>> So I guess the only thing left is to issue a NOTICE when said alter
>>> > takes place (I don't see that on the patch, but maybe it's there?)
>> That's not in the patch. I don't think we have an appropriate place to
>> emit such a notice.
>
> What do you mean by "don't have an appropriate place"?
>
> The suggestion is that when a user does:
>
> ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
>
> they should get a NOTICE:
>
> "NOTICE: changes to pages per range will not take effect until the index
> is REINDEXed"
>
> otherwise, we're going to get a lot of "I Altered the pages per range,
> but performance didn't change" emails.
>

How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?

we don't get a notice for these cases either

I think those are different. They don't rewrite existing data in the table, but they are applied to new (and updated) data. My understanding is that changing PAGES_PER_RANGE will have no effect on future data until a re-index is done, even if the entire table eventually turns over.

Cheers,

Jeff

Re: Minmax indexes

From

Greg Stark

Date:

10 July 2014, 22:30:06

On Thu, Jul 10, 2014 at 10:29 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> What I think should happen is that if the value is changed, the index
> sholud be rebuilt right there.

I disagree. It would be a non-orthogonal interface if ALTER TABLE
sometimes causes the index to be rebuilt and sometimes just makes a
configuration change. I already see a lot of user confusion when some
ALTER TABLE commands rewrite the table and some are quick meta data
changes.

Especially in this case where the type of configuration being changed
is just an internal storage parameter and the user visible shape of
the index is unchanged it would be weird to rebuild the index.

IMHO the "right" thing to do is just to say this parameter is
read-only and have the AM throw an error when the user changes it. But
even that would require an AM callback for the AM to even know about
the change.

-- 
greg

Re: Minmax indexes

From

Fujii Masao

Date:

11 July 2014, 06:21:43

On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Claudio Freire wrote:
>
>> An aggregate to generate a "compressed set" from several values
>> A function which adds a new value to the "compressed set" and returns
>> the new "compressed set"
>> A function which tests if a value is in a "compressed set"
>> A function which tests if a "compressed set" overlaps another
>> "compressed set" of equal type
>>
>> If you can define different compressed sets, you can use this to
>> generate both min/max indexes as well as bloom filter indexes. Whether
>> we'd want to have both is perhaps questionable, but having the ability
>> to is probably desirable.
>
> Here's a new version of this patch, which is more generic the original
> versions, and similar to what you describe.

I've not read the discussion so far at all, but I found the problem
when I played with this patch. Sorry if this has already been discussed.

=# create table test as select num from generate_series(1,10) num;
SELECT 10
=# create index testidx on test using minmax (num);
CREATE INDEX
=# alter table test alter column num type text;
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

Regards,

-- 
Fujii Masao

Re: Minmax indexes

From

Simon Riggs

Date:

11 July 2014, 07:00:09

On 9 July 2014 23:54, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> All this being said, I'm sticking to the name "Minmax indexes".  There
>> was a poll in pgsql-advocacy
>> http://www.postgresql.org/message-id/53A0B4F8.8080803@agliodbs.com
>> about a new name, but there were no suggestions supported by more than
>> one person.  If a brilliant new name comes up, I'm open to changing it.
>
> How about "summarizing indexes"? That seems reasonably descriptive.

-1 for another name change. That boat sailed some months back.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Simon Riggs

Date:

11 July 2014, 07:02:26

On 10 July 2014 00:13, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Josh Berkus wrote:
>> On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
>> > The way it works now, each opclass needs to have three support
>> > procedures; I've called them getOpers, maybeUpdateValues, and compare.
>> > (I realize these names are pretty bad, and will be changing them.)
>>
>> I kind of like "maybeUpdateValues".  Very ... NoSQL-ish.  "Maybe update
>> the values, maybe not."  ;-)
>
> :-)  Well, that's exactly what happens.  If we insert a new tuple into
> the table, and the existing summarizing tuple (to use Peter's term)
> already covers it, then we don't need to update the index tuple at all.
> What this name doesn't say is what values are to be maybe-updated.

There are lots of functions that maybe-do-things, that's just modular
programming. Not sure we need to prefix things with maybe to explain
that, otherwise we'd have maybeXXX everywhere.

More descriptive name would be MaintainIndexBounds() or similar.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Alvaro Herrera

Date:

11 July 2014, 13:44:59

Fujii Masao wrote:
> On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > Here's a new version of this patch, which is more generic the original
> > versions, and similar to what you describe.
> 
> I've not read the discussion so far at all, but I found the problem
> when I played with this patch. Sorry if this has already been discussed.
> 
> =# create table test as select num from generate_series(1,10) num;
> SELECT 10
> =# create index testidx on test using minmax (num);
> CREATE INDEX
> =# alter table test alter column num type text;
> ERROR:  could not determine which collation to use for string comparison
> HINT:  Use the COLLATE clause to set the collation explicitly.

Ah, yes, I need to pass down collation OIDs to comparison functions.
That's marked as XXX in various places in the code.  Sorry I forgot to
mention that.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Claudio Freire

Date:

11 July 2014, 17:00:40

On Thu, Jul 10, 2014 at 4:20 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Claudio Freire wrote:
>> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > Another thing I noticed is that version 8 of the patch blindly believed
>> > the "pages_per_range" declared in catalogs.  This meant that if somebody
>> > did "alter index foo set pages_per_range=123" the index would
>> > immediately break (i.e. return corrupted results when queried).  I have
>> > fixed this by storing the pages_per_range value used to construct the
>> > index in the metapage.  Now if you do the ALTER INDEX thing, the new
>> > value is only used when the index is recreated by REINDEX.
>>
>> This seems a lot like parameterizing.
>
> I don't understand what that means -- care to elaborate?

We've been talking about bloom filters, and how their shape differs
according to the parameters of the bloom filter (number of hashes,
hash type, etc).

But after seeing this case of pages_per_range, I noticed it's an
effective-enough mechanism. Like:

CREATE INDEX ix_blah ON some_table USING bloom (somecol) WITH (BLOOM_HASHES=15, BLOOM_BUCKETS=1024,
PAGES_PER_RANGE=64);

Marking as read-only is ok, or emitting a NOTICE so that if anyone
changes those parameters that change the shape of the index, they know
it needs a rebuild would be OK too. Both mechanisms work for me.

Re: Minmax indexes

From

Greg Stark

Date:

11 July 2014, 18:48:11

On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Marking as read-only is ok, or emitting a NOTICE so that if anyone
> changes those parameters that change the shape of the index, they know
> it needs a rebuild would be OK too. Both mechanisms work for me.

We don't actually have any of these mechanisms. They wouldn't be bad
things to have but I don't think we should gate adding new types of
indexes on adding them. In particular, the index could just hard code
a value for these parameters and having them be parameterized is
clearly better even if that doesn't produce all the warnings or
rebuild things automatically or whatever.

-- 
greg

Re: Minmax indexes

From

Claudio Freire

Date:

11 July 2014, 19:07:52

On Fri, Jul 11, 2014 at 3:47 PM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Marking as read-only is ok, or emitting a NOTICE so that if anyone
>> changes those parameters that change the shape of the index, they know
>> it needs a rebuild would be OK too. Both mechanisms work for me.
>
> We don't actually have any of these mechanisms. They wouldn't be bad
> things to have but I don't think we should gate adding new types of
> indexes on adding them. In particular, the index could just hard code
> a value for these parameters and having them be parameterized is
> clearly better even if that doesn't produce all the warnings or
> rebuild things automatically or whatever.


No, I agree, it's just a nice to have.

But at least the docs should mention it.

Re: Minmax indexes

From

Robert Haas

Date:

14 July 2014, 19:56:38

On Wed, Jul 9, 2014 at 5:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> The way it works now, each opclass needs to have three support
> procedures; I've called them getOpers, maybeUpdateValues, and compare.
> (I realize these names are pretty bad, and will be changing them.)
> getOpers is used to obtain information about what is stored for that
> data type; it says how many datum values are stored for a column of that
> type (two for sortable: min and max), and how many operators it needs
> setup.  Then, the generic code fills in a MinmaxDesc(riptor) and creates
> an initial DeformedMMTuple (which is a rather ugly name for a minmax
> tuple held in memory).  The maybeUpdateValues amproc can then be called
> when there's a new heap tuple, which updates the DeformedMMTuple to
> account for the new tuple (in essence, it's a union of the original
> values and the new tuple).  This can be done repeatedly (when a new
> index is being created) or only once (when a new heap tuple is inserted
> into an existing index).  There is no need for an "aggregate".
>
> This DeformedMMTuple can easily be turned into the on-disk
> representation; there is no hardcoded assumption on the number of index
> values stored per heap column, so it is possible to build an opclass
> that stores a bounding box column for a geometry heap column, for
> instance.
>
> Then we have the "compare" amproc.  This is used during index scans;
> after extracting an index tuple, it is turned into DeformedMMTuple, and
> the "compare" amproc for each column is called with the values of scan
> keys.  (Now that I think about this, it seems pretty much what
> "consistent" is for GiST opclasses).  A true return value indicates that
> the scan key matches the page range boundaries and thus all pages in the
> range are added to the output TID bitmap.

This sounds really great.  I agree that it needs some renaming.  I
think renaming what you are calling "compare" to "consistent" would be
an excellent idea, to match GiST.  "maybeUpdateValues" sounds like it
does the equivalent of GIST's "compress" on the new value followed by
a "union" with the existing summary item.  I don't think it's
necessary to separate those out, though.  You could perhaps call it
something like "add_item".

Also, FWIW, I liked Peter's idea of calling these "summarizing
indexes" or perhaps "summary" would be a bit shorter and mean the same
thing.  "minmax" wouldn't be the end of the world, but since you've
gone to the trouble of making this more generic I think giving it a
more generic name would be a very good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Heikki Linnakangas

Date:

29 July 2014, 11:33:52

On 07/10/2014 12:41 AM, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> On 06/23/2014 08:07 PM, Alvaro Herrera wrote:
>
>> I feel that the below would nevertheless be simpler:
>>
>>>> I wonder if it would be simpler to just always store the revmap
>>>> pages in the beginning of the index, before any other pages. Finding
>>>> the revmap page would then be just as easy as with a separate fork.
>>>> When the table/index is extended so that a new revmap page is
>>>> needed, move the existing page at that block out of the way. Locking
>>>> needs some consideration, but I think it would be feasible and
>>>> simpler than you have now.
>>>
>>> Moving index items around is not easy, because you'd have to adjust the
>>> revmap to rewrite the item pointers.
>>
>> Hmm. Two alternative schemes come to mind:
>>
>> 1. Move each index tuple off the page individually, updating the
>> revmap while you do it, until the page is empty. Updating the revmap
>> for a single index tuple isn't difficult; you have to do it anyway
>> when an index tuple is replaced. (MMTuples don't contain a heap
>> block number ATM, but IMHO they should, see below)
>>
>> 2. Store the new block number of the page that you moved out of the
>> way in the revmap page, and leave the revmap pointers unchanged. The
>> revmap pointers can be updated later, lazily.
>>
>> Both of those seem pretty straightforward.
>
> The trouble I have with moving blocks around to make space, is that it
> would cause the index to have periodic hiccups to make room for the new
> revmap pages.  One nice property that these indexes are supposed to have
> is that the effect into insertion times should be pretty minimal.  That
> would cease to be the case if we have to do your proposed block moves.

Approach 2 above is fairly quick, quick enough that no-one would notice 
the "hiccup". Moving the tuples individually (approach 1) would be slower.

>>>> ISTM that when the old tuple cannot be updated in-place, the new
>>>> index tuple is inserted with mm_doinsert(), but the old tuple is
>>>> never deleted.
>>>
>>> It's deleted by the next vacuum.
>>
>> Ah I see. Vacuum reads the whole index, and builds an in-memory hash
>> table that contains an ItemPointerData for every tuple in the index.
>> Doesn't that require a lot of memory, for a large index? That might
>> be acceptable - you ought to have plenty of RAM if you're pushing
>> around multi-terabyte tables - but it would nevertheless be nice to
>> not have a hard requirement for something as essential as vacuum.
>
> I guess if you're expecting that pages_per_range=1 is a common case,
> yeah it might become an issue eventually.

Not sure, but I find it easier to think of the patch that way. In any 
case, it would be nice to avoid the problem, even if it's not common.

> One idea I just had is to
> have a bit for each index tuple, which is set whenever the revmap no
> longer points to it.  That way, vacuuming is much easier: just scan the
> index and delete all tuples having that bit set.

The bit needs to be set atomically with the insertion of the new tuple, 
so why not just remove the old tuple right away?

>> Wouldn't it be simpler to remove the old tuple atomically with
>> inserting the new tuple and updating the revmap? Or at least mark
>> the old tuple as deletable, so that vacuum can just delete it,
>> without building the large hash table to determine that it's
>> deletable.
>
> Yes, it might be simpler, but it'd require dirtying more pages on
> insertions (and holding more page-level locks, for longer.  Not good for
> concurrent access).

I wouldn't worry much about the performance and concurrency of this 
operation. Remember that the majority of updates are expected to not 
have to update the index, otherwise the minmax index will degenerate 
quickly and performance will suck anyway. And even when updating the 
index is needed, in most cases the new tuple fits on the same page, 
after removing the old one. So the case where you have to insert a new 
index tuple, remove old one (or mark it dead), and update the revmap to 
point to the new tuple, is rare.

>> I'm quite surprised by the use of LockTuple on the index tuples. I
>> think the main reason for needing that is the fact that MMTuple
>> doesn't store the heap (range) block number that the tuple points
>> to: LockTuple is required to ensure that the tuple doesn't go away
>> while a scan is following a pointer from the revmap to it. If the
>> MMTuple contained the BlockNumber, a scan could check that and go
>> back to the revmap if it doesn't match. Alternatively, you could
>> keep the revmap page locked when you follow a pointer to the regular
>> index page.
>
> There's the intention that these accesses be kept as concurrent as
> possible; this is why we don't want to block the whole page.  Locking
> individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
> because we can only lock a single tuple in any one index scan, so
> there's no unbounded growth of the lock table.
>
> I prefer not to have BlockNumbers in index tuples, because that would
> make them larger for not much gain.  That data would mostly be
> redundant, and would be necessary only for vacuuming.

Don't underestimate the value of easier debugging. I wouldn't worry much 
about shaving four bytes from the tuple, these indexes are tiny in any 
case. Keep it simple at first, and optimize later if necessary.

In fact, I'd suggest just using normal IndexTuple instead of the custom 
MMTuple struct, store the block number in t_tid and leave offset number 
field of that unused. That wastes 2 more bytes per tuple, but that's 
insignificant too. I feel that it probably would be worth it just to 
keep thing simple, and you'd e.g. be able to use index_deform_tuple() as is.

- Heikki

Re: Minmax indexes

From

Alvaro Herrera

Date:

05 August 2014, 23:41:55

Thanks for all the feedback on version 9.  Here's version 13.  (The
intermediate versions are just tags in my private tree which I created
each time I rebased.  Please bear with me here.)

I have chosen to keep the name "minmax", even if the opclasses now let
one implement completely different things on top of it such as geometry
bounding boxes and bloom filters (aka bitmap indexes).  I don't see a
need for a rename: essentially, in PR we can just say "we have these
neat minmax indexes that other databases also have, but instead of just
being used for integer data, they can also be used for geometry, GIS and
bitmap indexes, so as always we're more powerful than everyone else when
implementing new database features".

This new version includes some changes per feedback.  Most notoriously,
the opclass definition is different now: instead of relying on the
"sortable" opclass implementation extracting the oprcode for each
operator strategy (i.e. the functions that underlie < <= >= >), I chose
to have catalog entries in pg_amproc for the underlying support
functions.  The new definition makes a lot of sense to me now, after
thinking long about this stuff and carefully reading the
"Catalog Entries for Indexes" chapter in docs.

The way it works now is that there are five pg_amop entries in an
opclass, just like previously (corresponding to the underlying < <= = >= >
operators).  This lets the optimizer choose the index when a query uses
those operators.  There are also seven pg_amproc entries.  The first
three are identical to all minmax opclasses: "opcinfo" (version 9 called
it "getopers"), "consistent" (v9 name "compare") and "add_value" (v9
name "maybeUpdateValues", not a loved name evidently).  A minmax opclass
on top of a sortable datatype has four additional support functions: one
for each function underlying the < <= >= > operators.  Other opclasses
would define their own support functions here, which would correspond to
functions used to implement the "consistent" and "compare" functions
internally.

I don't claim this is 100% correct, but in particular I think it's now
possible to implement cross-datatype comparisons, so that a minmax index
defined on an int8 column works when the query uses an int4 operator,
for example.  (The current patch doesn't actually add such catalog
entries, though.  I think some minor code changes are required for this
to actually work.  However with the previous opclass definition it would
have been outright impossible.)

I fixed the bug reported by Masao-kun that collatable datatypes weren't
cleanly supported.  Collation OIDs are passed down now, although I don't
claim that it is bulletproof.  This could use some more testing.

I haven't yet updated the revmap definition per Heikki's review.  I am
not sure I want to do that right away.  I think we could live with what
we have now, and see about changing this later on in the 9.5 cycle if we
think a different definition is better.  I think what we have is pretty
solid even if there are some theoretical holes.

As a very quick test, I created a 10 million tuples table with an int4
column on my laptop.  The table is ~346 MB.  Creating a btree index on
it takes 8 seconds.  A minmax index takes 1.6 seconds.  The btree index
is 214 MB.  The minmax index, with pages_per_range=1 is 1 MB.  With
pages_per_range=16 (default) it is 48kB.

Very unscientific results follow.  This is the btree doing an index-only
scan:

alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
                                                       QUERY PLAN
 

-------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using bti2 on t  (cost=0.43..1692.75 rows=54416 width=4) (actual time=0.106..23.329 rows=54518
loops=1)
   Index Cond: ((a > 991243) AND (a < 1045762))
   Heap Fetches: 0
   Buffers: shared hit=1 read=152
 Planning time: 0.695 ms
 Execution time: 31.565 ms
(6 filas)

Duración: 33,662 ms

Turn off index-only scan, do a regular index scan:

alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Index Scan using bti2 on t  (cost=0.43..1932.75 rows=54416 width=4) (actual time=0.066..31.027 rows=54518 loops=1)
   Index Cond: ((a > 991243) AND (a < 1045762))
   Buffers: shared hit=394
 Planning time: 0.250 ms
 Execution time: 39.218 ms
(5 filas)

Duración: 40,385 ms

Use the 16-pages-per-range minmax index:

alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=16.60..47402.01 rows=54416 width=4) (actual time=4.266..43.948 rows=54518 loops=1)
   Recheck Cond: ((a > 991243) AND (a < 1045762))
   Rows Removed by Index Recheck: 32266
   Heap Blocks: lossy=384
   Buffers: shared hit=244 read=142
   ->  Bitmap Index Scan on ti2  (cost=0.00..3.00 rows=54416 width=0) (actual time=1.061..1.061 rows=3840 loops=1)
         Index Cond: ((a > 991243) AND (a < 1045762))
         Buffers: shared hit=2
 Planning time: 0.215 ms
 Execution time: 51.820 ms
(10 filas)

This is the 1-page-per-range minmax index:

alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=157.60..47543.01 rows=54416 width=4) (actual time=82.479..98.642 rows=54518 loops=1)
   Recheck Cond: ((a > 991243) AND (a < 1045762))
   Rows Removed by Index Recheck: 174
   Heap Blocks: lossy=242
   Buffers: shared hit=385
   ->  Bitmap Index Scan on ti  (cost=0.00..144.00 rows=54416 width=0) (actual time=82.448..82.448 rows=2420 loops=1)
         Index Cond: ((a > 991243) AND (a < 1045762))
         Buffers: shared hit=143
 Planning time: 0.280 ms
 Execution time: 103.542 ms
(10 filas)

Duración: 104,952 ms

This is a seqscan.  Notice the high number of buffer accesses:

alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..194248.00 rows=54416 width=4) (actual time=161.338..1201.535 rows=54518 loops=1)
   Filter: ((a > 991243) AND (a < 1045762))
   Rows Removed by Filter: 9945482
   Buffers: shared hit=10672 read=33576
 Planning time: 0.189 ms
 Execution time: 1204.501 ms
(6 filas)

Duración: 1205,304 ms

Of course, this isn't nearly a worst-case scenario for minmax, as the
data is perfectly correlated.  The pages_per_range=16 index benefits
particularly from that.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-13.patch

Re: Minmax indexes

From

Josh Berkus

Date:

05 August 2014, 23:55:53

On 08/05/2014 04:41 PM, Alvaro Herrera wrote:
> I have chosen to keep the name "minmax", even if the opclasses now let
> one implement completely different things on top of it such as geometry
> bounding boxes and bloom filters (aka bitmap indexes).  I don't see a
> need for a rename: essentially, in PR we can just say "we have these
> neat minmax indexes that other databases also have, but instead of just
> being used for integer data, they can also be used for geometry, GIS and
> bitmap indexes, so as always we're more powerful than everyone else when
> implementing new database features".

Plus we haven't come up with a better name ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Minmax indexes

From

Alvaro Herrera

Date:

06 August 2014, 02:35:11

FWIW I think I haven't responded appropriately to the points raised by
Heikki.  Basically, as I see it there are three main items:

1. the revmap physical-to-logical mapping is too complex; let's use
something else.

We had revmap originally in a separate fork.  The current approach grew
out of the necessity of putting it in the main fork while ensuring that
fast access to individual pages is possible.  There are of course many
ways to skin this cat; Heikki's proposal is to have it always occupy the
first few physical pages, rather than require a logical-to-physical
mapping table.  To implement this he proposes to move other pages out of
the way as the index grows.  I don't really have much love for this
idea.  We can change how this is implemented later in the cycle, if we
find that a different approach is better than my proposal.  I don't want
to spend endless time meddling with this (and I definitely don't want to
have this delay the eventual commit of the patch.)


2. vacuuming is not optimal

Right now, to eliminate garbage index tuples we need to first scan
the revmap to figure out which tuples are unreferenced.  There is a
concern that if there's an excess of dead tuples, the index becomes
unvacuumable because palloc() fails due to request size.  This is
largely theoretical because in order for this to happen there need to be
several million dead index tuples.  As a minimal fix to alleviate this
problem without requiring a complete rework of vacuuming, we can cap
that palloc request to maintenance_work_mem and remove dead tuples in a
loop instead of trying to remove all of them in a single pass.

Another thing proposed was to store range numbers (or just heap page
numbers) within each index tuple.  I felt that this would add more bloat
unnecessarily.  However, there is some padding space in index tuple that
maybe we can use to store range numbers.  I will think some more about
how we can use this to simplify vacuuming.


3. avoid MMTuple as it is just unnecessary extra complexity.

The main thing that MMTuple adds is not the fact that we save 2 bytes
by storing BlockNumber as is instead of within a TID field.  Instead,
it's that we can construct and deconstruct using our own design, which
means we can use however many Datum entries we want and however many
"null" flags.  In normal heap and index tuples, there are always the
same number of datum/nulls.  In minmax, the number of nulls is twice the
number of indexed columns; the number of datum values is determined by
how many datum values are stored per opclass ("sortable" opclasses
store 2 columns, but geometry would store only one).  If we were to use
regular IndexTuples, we would lose that .. and I have no idea how it
would work.  In other words, MMTuples look fine to me.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Robert Haas

Date:

06 August 2014, 15:56:31

On Tue, Aug 5, 2014 at 7:55 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 08/05/2014 04:41 PM, Alvaro Herrera wrote:
>> I have chosen to keep the name "minmax", even if the opclasses now let
>> one implement completely different things on top of it such as geometry
>> bounding boxes and bloom filters (aka bitmap indexes).  I don't see a
>> need for a rename: essentially, in PR we can just say "we have these
>> neat minmax indexes that other databases also have, but instead of just
>> being used for integer data, they can also be used for geometry, GIS and
>> bitmap indexes, so as always we're more powerful than everyone else when
>> implementing new database features".
>
> Plus we haven't come up with a better name ...

Several good suggestions have been made, like "summarizing" or
"summary" indexes and "compressed range" indexes.  I still really
dislike the present name - you might think this is a type of index
that has something to do with optimizing "min" and "max", but what it
really is is a kind of small index for a big table.  The current name
couldn't make that less clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Claudio Freire

Date:

06 August 2014, 16:29:13

On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> "Summary" seems good.  If I get enough votes I can change it to that.
>
> CREATE INDEX foo ON t USING summary (cols)
>
> "Summarizing" seems weird on that command.  Not sure about "compressed
> range", as you would have to use an abbreviation or run the words
> together.

Summarizing index sounds better to my ears, but both ideas based on
"summary" are quite succint and to-the-point descriptions of what's
happening, so I vote for those.

Re: Minmax indexes

From

Claudio Freire

Date:

06 August 2014, 16:31:19

On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
> CREATE INDEX foo ON t USING comprange (cols)
> CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
> -- or use an identifier with whitespace:
> CREATE INDEX foo ON t USING "compressed range" (cols)

The word you'd use there is not necessarily the one you use on the
framework, since the framework applies to many such techniques, but
the index type there is one specific one.

The create command can still use minmax, or rangemap if you prefer
that, while the framework's code uses summary or summarizing.

Re: Minmax indexes

From

Bruce Momjian

Date:

06 August 2014, 16:35:25

On Wed, Aug  6, 2014 at 01:31:14PM -0300, Claudio Freire wrote:
> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
> > CREATE INDEX foo ON t USING comprange (cols)
> > CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
> > -- or use an identifier with whitespace:
> > CREATE INDEX foo ON t USING "compressed range" (cols)
> 
> 
> The word you'd use there is not necessarily the one you use on the
> framework, since the framework applies to many such techniques, but
> the index type there is one specific one.

"Block filter" indexes?

> The create command can still use minmax, or rangemap if you prefer
> that, while the framework's code uses summary or summarizing.

"Summary" sounds like materialized views to me.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Minmax indexes

From

Claudio Freire

Date:

06 August 2014, 16:42:22

On Wed, Aug 6, 2014 at 1:35 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Aug  6, 2014 at 01:31:14PM -0300, Claudio Freire wrote:
>> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
>> > CREATE INDEX foo ON t USING comprange (cols)
>> > CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
>> > -- or use an identifier with whitespace:
>> > CREATE INDEX foo ON t USING "compressed range" (cols)
>>
>>
>> The word you'd use there is not necessarily the one you use on the
>> framework, since the framework applies to many such techniques, but
>> the index type there is one specific one.
>
> "Block filter" indexes?


Nice one

Re: Minmax indexes

From

Claudio Freire

Date:

06 August 2014, 17:08:55

On Wed, Aug 6, 2014 at 1:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Claudio Freire wrote:
>> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
>> > CREATE INDEX foo ON t USING comprange (cols)
>> > CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
>> > -- or use an identifier with whitespace:
>> > CREATE INDEX foo ON t USING "compressed range" (cols)
>>
>> The word you'd use there is not necessarily the one you use on the
>> framework, since the framework applies to many such techniques, but
>> the index type there is one specific one.
>>
>> The create command can still use minmax, or rangemap if you prefer
>> that, while the framework's code uses summary or summarizing.
>
> I think you're confusing the AM name with the opclass name.  The name
> you specify in that part of the command is the access method name.  You
> can specify the opclass together with each column, like so:
>
> CREATE INDEX foo ON t USING blockfilter
>         (order_date date_minmax_ops, geometry gis_bbox_ops);

Oh, uh... no, I'm not confusing them, but now I just realized how one
would implement other classes of block filtering indexes, and yeah...
you do it through the opclasses.

I'm sticking to bloom filters:

CREATE INDEX foo ON t USING blockfilter (order_date date_minmax_ops,
path character_bloom_ops);

Cool. Very cool.

So, I like blockfilter a lot. I change my vote to blockfilter ;)

Re: Minmax indexes

From

Nicolas Barbier

Date:

06 August 2014, 20:06:49

2014-08-06 Claudio Freire <klaussfreire@gmail.com>:

> So, I like blockfilter a lot. I change my vote to blockfilter ;)

+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

Re: Minmax indexes

From

Robert Haas

Date:

07 August 2014, 13:53:07

On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:
> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
>
>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
>
> +1 for blockfilter, because it stresses the fact that the "physical"
> arrangement of rows in blocks matters for this index.

I don't like that quite as well as summary, but I'd prefer either to
the current naming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Simon Riggs

Date:

07 August 2014, 14:16:45

On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
> <nicolas.barbier@gmail.com> wrote:
>> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
>>
>>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>
>> +1 for blockfilter, because it stresses the fact that the "physical"
>> arrangement of rows in blocks matters for this index.
>
> I don't like that quite as well as summary, but I'd prefer either to
> the current naming.

Yes, "summary index" isn't good. I'm not sure where the block or the
filter part comes in though, so -1 to "block filter", not least
because it doesn't have a good abbreviation (bfin??).

A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Claudio Freire

Date:

07 August 2014, 14:19:08

On Thu, Aug 7, 2014 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
>> <nicolas.barbier@gmail.com> wrote:
>>> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
>>>
>>>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>>
>>> +1 for blockfilter, because it stresses the fact that the "physical"
>>> arrangement of rows in blocks matters for this index.
>>
>> I don't like that quite as well as summary, but I'd prefer either to
>> the current naming.
>
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).

Block filter would refer to the index property that selects blocks,
not tuples, and it does so through a "filter function" (for min-max,
it's a range check, but for other opclasses it could be anything).

Re: Minmax indexes

From

Alvaro Herrera

Date:

07 August 2014, 14:43:06

Simon Riggs wrote:
> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
> > <nicolas.barbier@gmail.com> wrote:
> >> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
> >>
> >>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
> >>
> >> +1 for blockfilter, because it stresses the fact that the "physical"
> >> arrangement of rows in blocks matters for this index.
> >
> > I don't like that quite as well as summary, but I'd prefer either to
> > the current naming.
> 
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).

I was thinking just "blockfilter" (I did show a sample command).
Claudio explained the name downthread; personally, of all the options
suggested thus far, it's the one I like the most (including minmax).

At this point, the naming issue is what is keeping me from committing
this patch, so the quicker we can solve it, the merrier.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Robert Haas

Date:

07 August 2014, 14:58:14

On Thu, Aug 7, 2014 at 10:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
>> <nicolas.barbier@gmail.com> wrote:
>>> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
>>>
>>>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>>
>>> +1 for blockfilter, because it stresses the fact that the "physical"
>>> arrangement of rows in blocks matters for this index.
>>
>> I don't like that quite as well as summary, but I'd prefer either to
>> the current naming.
>
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).
>
> A better description would be "block range index" since we are
> indexing a range of blocks (not just one block). Perhaps a better one
> would be simply "range index", which we could abbreviate to RIN or
> BRIN.

range index might get confused with range types; block range index
seems better.  I like summary, but I'm fine with block range index or
block filter index, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Minmax indexes

From

Oleg Bartunov

Date:

07 August 2014, 15:38:27

+1 for BRIN !

On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
>> <nicolas.barbier@gmail.com> wrote:
>>> 2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
>>>
>>>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>>
>>> +1 for blockfilter, because it stresses the fact that the "physical"
>>> arrangement of rows in blocks matters for this index.
>>
>> I don't like that quite as well as summary, but I'd prefer either to
>> the current naming.
>
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).
>
> A better description would be "block range index" since we are
> indexing a range of blocks (not just one block). Perhaps a better one
> would be simply "range index", which we could abbreviate to RIN or
> BRIN.
>
> --
>  Simon Riggs                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: Minmax indexes

From

Nicolas Barbier

Date:

07 August 2014, 18:04:55

2014-08-07 Oleg Bartunov <obartunov@gmail.com>:

> +1 for BRIN !

+1, rolls off the tongue smoothly and captures the essence :-).

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

Re: Minmax indexes

From

Petr Jelinek

Date:

07 August 2014, 18:23:54

On 07/08/14 16:16, Simon Riggs wrote:
>
> A better description would be "block range index" since we are
> indexing a range of blocks (not just one block). Perhaps a better one
> would be simply "range index", which we could abbreviate to RIN or
> BRIN.
>

+1 for block range index (BRIN)

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Minmax indexes

From

Alvaro Herrera

Date:

07 August 2014, 22:40:38

Simon Riggs wrote:

> A better description would be "block range index" since we are
> indexing a range of blocks (not just one block). Perhaps a better one
> would be simply "range index", which we could abbreviate to RIN or
> BRIN.

Seems a lot of people liked BRIN.  I will be adopting that by renaming
files and directories soon.

Here's v14.  I fixed a few bugs; most notably, queries with IS NULL and
IS NOT NULL now work correctly.  Also I made the pageinspect extension
be able to display existing index tuples (I had disabled that when
generalizing the opclass stuff).  It only works with minmax opclasses
for now; it should be easy to fix if/when we add more stuff though.

I also added some docs.  These are not finished by any means.  They
talk about the index using the BRIN term.

All existing opclasses were renamed to "<type>_minmax_ops".

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-14.patch

Re: Minmax indexes

From

Josh Berkus

Date:

08 August 2014, 00:47:52

On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
> +1 for BRIN !
> 
> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
>> A better description would be "block range index" since we are
>> indexing a range of blocks (not just one block). Perhaps a better one
>> would be simply "range index", which we could abbreviate to RIN or
>> BRIN.

How about Block Range Dynamic indexes?

Or Range Usage Metadata indexes?

You see what I'm getting at:

BRanDy

RUM

... to keep with our "new indexes" naming scheme ...


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Minmax indexes

From

Michael Paquier

Date:

08 August 2014, 00:52:57

On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus <josh@agliodbs.com> wrote:
> On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
>> +1 for BRIN !
>>
>> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
>>> A better description would be "block range index" since we are
>>> indexing a range of blocks (not just one block). Perhaps a better one
>>> would be simply "range index", which we could abbreviate to RIN or
>>> BRIN.
>
> How about Block Range Dynamic indexes?
>
> Or Range Usage Metadata indexes?
>
> You see what I'm getting at:
>
> BRanDy
>
> RUM
>
> ... to keep with our "new indexes" naming scheme ...
Not the best fit for kids, fine for grad students.

BRIN seems to be a perfect consensus, so +1 for it.
-- 
Michael

Re: Minmax indexes

From

Peter Geoghegan

Date:

08 August 2014, 00:56:50

On Thu, Aug 7, 2014 at 7:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> range index might get confused with range types; block range index
> seems better.  I like summary, but I'm fine with block range index or
> block filter index, too.

+1


-- 
Peter Geoghegan

Re: Minmax indexes

From

Josh Berkus

Date:

08 August 2014, 02:04:28

On 08/07/2014 05:52 PM, Michael Paquier wrote:
> On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
>>> +1 for BRIN !
>>>
>>> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> A better description would be "block range index" since we are
>>>> indexing a range of blocks (not just one block). Perhaps a better one
>>>> would be simply "range index", which we could abbreviate to RIN or
>>>> BRIN.
>>
>> How about Block Range Dynamic indexes?
>>
>> Or Range Usage Metadata indexes?
>>
>> You see what I'm getting at:
>>
>> BRanDy
>>
>> RUM
>>
>> ... to keep with our "new indexes" naming scheme ...
> Not the best fit for kids, fine for grad students.

But, it goes perfectly with our GIN and VODKA indexes.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Minmax indexes

From

Heikki Linnakangas

Date:

08 August 2014, 08:30:02

On 08/06/2014 05:35 AM, Alvaro Herrera wrote:
> FWIW I think I haven't responded appropriately to the points raised by
> Heikki.  Basically, as I see it there are three main items:
>
> 1. the revmap physical-to-logical mapping is too complex; let's use
> something else.
>
> We had revmap originally in a separate fork.  The current approach grew
> out of the necessity of putting it in the main fork while ensuring that
> fast access to individual pages is possible.  There are of course many
> ways to skin this cat; Heikki's proposal is to have it always occupy the
> first few physical pages, rather than require a logical-to-physical
> mapping table.  To implement this he proposes to move other pages out of
> the way as the index grows.  I don't really have much love for this
> idea.  We can change how this is implemented later in the cycle, if we
> find that a different approach is better than my proposal.  I don't want
> to spend endless time meddling with this (and I definitely don't want to
> have this delay the eventual commit of the patch.)

Please also note that LockTuple is pretty expensive, compared to
lightweight locks.  Remember how Robert made hash indexes signifcantly
faster a couple of years ago (commit 76837c15) by removing the need for
heavy-weight locks during queries. To demonstrate that, I applied your
patch, and ran a very simple test:

create table numbers as select i*1000+j as n from generate_series(0,
19999) i, generate_series(1, 1000) j;
create index number_minmax on numbers using minmax (n) with
(pages_per_range=1);

I ran "explain analyze select * from numbers where n = 10;" a few times
under "perf" profiler. The full profile is attached, but here's the top 10:

Samples: 3K of event 'cycles', Event count (approx.): 2332550418

+  24.15%  postmaster  postgres           [.] hash_search_with_hash_value
+  10.55%  postmaster  postgres           [.] LWLockAcquireCommon
+   7.12%  postmaster  postgres           [.] hash_any
+   6.77%  postmaster  postgres           [.] minmax_deform_tuple
+   6.67%  postmaster  postgres           [.] LWLockRelease
+   5.55%  postmaster  postgres           [.] AllocSetAlloc
+   4.37%  postmaster  postgres           [.] SetupLockInTable.isra.2
+   2.79%  postmaster  postgres           [.] LockRelease
+   2.67%  postmaster  postgres           [.] LockAcquireExtended
+   2.54%  postmaster  postgres           [.] mmgetbitmap

If you drill into those functions, you'll see that most of the time
spent in hash_search_with_hash_value, LWLockAcquireCommon and hash_any
are coming from heavy-weight lock handling. At a rough estimate, about
1/3 of the CPU time is spent on LockTuple/UnlockTuple.

Maybe we don't care because it's fast enough anyway, but it just seems
like we're leaving a lot of money on the table. Because of that, and all
the other reasons already discussed, I strongly feel that this design
should be changed.

> 3. avoid MMTuple as it is just unnecessary extra complexity.
>
> The main thing that MMTuple adds is not the fact that we save 2 bytes
> by storing BlockNumber as is instead of within a TID field.  Instead,
> it's that we can construct and deconstruct using our own design, which
> means we can use however many Datum entries we want and however many
> "null" flags.  In normal heap and index tuples, there are always the
> same number of datum/nulls.  In minmax, the number of nulls is twice the
> number of indexed columns; the number of datum values is determined by
> how many datum values are stored per opclass ("sortable" opclasses
> store 2 columns, but geometry would store only one).

Hmm. Why is the number of null bits 2x the number of indexed columns? I
would expect there to be one null bit per stored Datum.

(/me looks at the patch):

>         /*
>          * We need a double-length bitmap on an on-disk minmax index tuple;
>          * the first half stores the "allnulls" bits, the second stores
>          * "hasnulls".
>          */

So, one bit means whether there are any heap tuples with a NULL in the
indexed column, and the other bit means if the value stored for that
column is a NULL. Does that mean that it's not possible to store a NULL
minimum, but non-NULL maximum, for a single column? I can't immediately
think of an example where you'd want to do that, but I'm also not
convinced that no opclass would ever want that. Individual bits are
cheap, so I'm inclined to rather have too many of them than regret later.

In any case, it should be documented in minmax_tuple.h what those
null-bits are and how they're laid out in the bitmap. The comment there
currently just says that there are "two null bits for each value stored"
(which isn't actually wrong, because you're storing two bits per indexed
column, not two bits per value stored (but I just suggested changing
that, after which the comment would be correct)).

PS. Please add regression tests. It would also be good to implement at
least one other opclass than the b-tree based ones, to make sure that
the code actually works with something else too. I'd suggest
implementing the bounding box opclass for points, that seems simple.

- Heikki

Attachment

minmax-locktuple-is-expensive-profile

Re: Minmax indexes

From

Heikki Linnakangas

Date:

08 August 2014, 09:02:29

I think there's a race condition in mminsert, if two backends insert a 
tuple to the same heap page range concurrently. mminsert does this:

1. Fetch the MMtuple for the page range
2. Check if any of the stored datums need updating
3. Unlock the page.
4. Lock the page again in exclusive mode.
5. Update the tuple.

It's possible that two backends arrive at phase 3 at the same time, with 
different values. For example, backend A wants to update the minimum to 
contain 10, and and backend B wants to update it to 5. Now, if backend B 
gets to update the tuple first, to 5, backend A will update the tuple to 
10 when it gets the lock, which is wrong.

The simplest solution would be to get the buffer lock in exclusive mode 
to begin with, so that you don't need to release it between steps 2 and 
5. That might be a significant hit on concurrency, though, when most of 
the insertions don't in fact have to update the value. Another idea is 
to re-check the updated values after acquiring the lock in exclusive 
mode, to see if they match the previous values.

- Heikki

Re: Minmax indexes

From

Heikki Linnakangas

Date:

08 August 2014, 13:22:13

Another race condition:

If a new tuple is inserted to the range while summarization runs, it's 
possible that the new tuple isn't included in the tuple that the 
summarization calculated, nor does the insertion itself udpate it.

1. There is no index tuple for page range 1-10
2. Summarization begins. It scans pages 1-5.
3. A new insertion inserts a heap tuple to page 1.
4. The insertion sees that there is no index tuple covering range 1-10, 
so it doesn't update it.
5. The summarization finishes scanning pages 5-10, and inserts the new 
index tuple. The summarization didn't see the newly inserted heap tuple, 
and hence it's not included in the calculated index tuple.

One idea is to do the summarization in two stages. First, insert a 
placeholder tuple, with no real value in it. A query considers the 
placeholder tuple the same as a missing tuple, ie. always considers it a 
match. An insertion updates the placeholder tuple with the value 
inserted, as if it was a regular mmtuple. After summarization has 
finished scanning the page range, it turns the placeholder tuple into a 
regular tuple, by unioning the placeholder value with the value formed 
by scanning the heap.

- Heikki

Re: Minmax indexes

From

Heikki Linnakangas

Date:

08 August 2014, 15:04:17

I couldn't resist starting to hack on this, and implemented the scheme
I've been having in mind:

1. MMTuple contains the block number of the heap page (range) that the
tuple represents. Vacuum is no longer needed to clean up old tuples;
when an index tuples is updated, the old tuple is deleted atomically
with the insertion of a new tuple and updating the revmap, so no garbage
is left behind.

2. LockTuple is gone. When following the pointer from revmap to MMTuple,
the block number is used to check that you land on the right tuple. If
not, the search is started over, looking at the revmap again.

I'm sure this still needs some cleanup, but here's the patch, based on
your v14. Now that I know what this approach looks like, I still like it
much better. The insert and update code is somewhat more complicated,
because you have to be careful to lock the old page, new page, and
revmap page in the right order. But it's not too bad, and it gets rid of
all the complexity in vacuum.

- Heikki

Attachment

minmax-v14-heikki-2.patch

Re: Minmax indexes

From

Simon Riggs

Date:

10 August 2014, 09:23:05

On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> 1. MMTuple contains the block number of the heap page (range) that the tuple
> represents. Vacuum is no longer needed to clean up old tuples; when an index
> tuples is updated, the old tuple is deleted atomically with the insertion of
> a new tuple and updating the revmap, so no garbage is left behind.

What happens if the transaction that does this aborts? Surely that
means the new value is itself garbage? What cleans up that?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Simon Riggs

Date:

10 August 2014, 09:37:44

On 8 August 2014 10:01, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> It's possible that two backends arrive at phase 3 at the same time, with
> different values. For example, backend A wants to update the minimum to
> contain 10, and and backend B wants to update it to 5. Now, if backend B
> gets to update the tuple first, to 5, backend A will update the tuple to 10
> when it gets the lock, which is wrong.
>
> The simplest solution would be to get the buffer lock in exclusive mode to
> begin with, so that you don't need to release it between steps 2 and 5. That
> might be a significant hit on concurrency, though, when most of the
> insertions don't in fact have to update the value. Another idea is to
> re-check the updated values after acquiring the lock in exclusive mode, to
> see if they match the previous values.

Simplest solution is to re-apply the test just before update, so in
the above example, if we think we want to lower the minimum to 10 and
when we get there it is already 5, we just don't update.

We don't need to do the re-check always, though. We can read the page
LSN while holding share lock, then re-read it once we acquire
exclusive lock. If LSN is the same, no need for datatype specific
re-checks at all.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Simon Riggs

Date:

10 August 2014, 09:42:58

On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> I couldn't resist starting to hack on this, and implemented the scheme I've
> been having in mind:
>
> 1. MMTuple contains the block number of the heap page (range) that the tuple
> represents. Vacuum is no longer needed to clean up old tuples; when an index
> tuples is updated, the old tuple is deleted atomically with the insertion of
> a new tuple and updating the revmap, so no garbage is left behind.
>
> 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
> block number is used to check that you land on the right tuple. If not, the
> search is started over, looking at the revmap again.

Part 2 sounds interesting, especially because of the reduction in CPU
that it might allow.

Part 1 doesn't sound good yet.
Are they connected?

More importantly, can't we tweak this after commit? Delaying commit
just means less time for other people to see, test, understand tune
and fix. I see you (Heikki) doing lots of incremental development,
lots of small commits. Can't we do this one the same?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Heikki Linnakangas

Date:

10 August 2014, 10:20:33

On 08/10/2014 12:22 PM, Simon Riggs wrote:
> On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
>> 1. MMTuple contains the block number of the heap page (range) that the tuple
>> represents. Vacuum is no longer needed to clean up old tuples; when an index
>> tuples is updated, the old tuple is deleted atomically with the insertion of
>> a new tuple and updating the revmap, so no garbage is left behind.
>
> What happens if the transaction that does this aborts? Surely that
> means the new value is itself garbage? What cleans up that?

It's no different from Alvaro's patch. The updated MMTuple covers the 
aborted value, but that's OK from a correctnes point of view.

- Heikki

Re: Minmax indexes

From

Heikki Linnakangas

Date:

10 August 2014, 10:27:46

On 08/10/2014 12:42 PM, Simon Riggs wrote:
> On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
>> I couldn't resist starting to hack on this, and implemented the scheme I've
>> been having in mind:
>>
>> 1. MMTuple contains the block number of the heap page (range) that the tuple
>> represents. Vacuum is no longer needed to clean up old tuples; when an index
>> tuples is updated, the old tuple is deleted atomically with the insertion of
>> a new tuple and updating the revmap, so no garbage is left behind.
>>
>> 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
>> block number is used to check that you land on the right tuple. If not, the
>> search is started over, looking at the revmap again.
>
> Part 2 sounds interesting, especially because of the reduction in CPU
> that it might allow.
>
> Part 1 doesn't sound good yet.
> Are they connected?

Yes. The optimistic locking in part 2 is based on checking that the 
block number on the MMTuple matches what you're searching for, and that 
there is never more than one MMTuple in the index with the same block 
number.

> More importantly, can't we tweak this after commit? Delaying commit
> just means less time for other people to see, test, understand tune
> and fix. I see you (Heikki) doing lots of incremental development,
> lots of small commits. Can't we do this one the same?

Well, I wouldn't consider "let's redesign how locking and vacuuming 
works and change the on-disk format" as incremental development ;-). 
It's more like, well, redesigning the whole thing. Any testing and 
tuning would certainly need to be redone after such big changes.

If you agree that these changes make sense, let's do them now and not 
waste people's time testing and tuning a dead-end design. If you don't 
agree, then let's discuss that.

- Heikki

Re: Minmax indexes

From

Claudio Freire

Date:

10 August 2014, 17:43:24

On Fri, Aug 8, 2014 at 6:01 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> It's possible that two backends arrive at phase 3 at the same time, with
> different values. For example, backend A wants to update the minimum to
> contain 10, and and backend B wants to update it to 5. Now, if backend B
> gets to update the tuple first, to 5, backend A will update the tuple to 10
> when it gets the lock, which is wrong.
>
> The simplest solution would be to get the buffer lock in exclusive mode to
> begin with, so that you don't need to release it between steps 2 and 5. That
> might be a significant hit on concurrency, though, when most of the
> insertions don't in fact have to update the value. Another idea is to
> re-check the updated values after acquiring the lock in exclusive mode, to
> see if they match the previous values.


No, the simplest solution is to re-check the bounds after acquiring
the exclusive lock. So instead of doing the addValue with share lock,
do a consistency check first, and if it's not consistent, do the
addValue with exclusive lock.

Re: Minmax indexes

From

Alvaro Herrera

Date:

12 August 2014, 22:53:03

Heikki Linnakangas wrote:
> I couldn't resist starting to hack on this, and implemented the
> scheme I've been having in mind:
>
> 1. MMTuple contains the block number of the heap page (range) that
> the tuple represents. Vacuum is no longer needed to clean up old
> tuples; when an index tuples is updated, the old tuple is deleted
> atomically with the insertion of a new tuple and updating the
> revmap, so no garbage is left behind.
>
> 2. LockTuple is gone. When following the pointer from revmap to
> MMTuple, the block number is used to check that you land on the
> right tuple. If not, the search is started over, looking at the
> revmap again.

Thanks, looks good, yeah.  Did you just forget to attach the
access/rmgrdesc/minmaxdesc.c file, or did you ignore it altogether?
Anyway I hacked one up, and cleaned up some other things.

> I'm sure this still needs some cleanup, but here's the patch, based
> on your v14. Now that I know what this approach looks like, I still
> like it much better. The insert and update code is somewhat more
> complicated, because you have to be careful to lock the old page,
> new page, and revmap page in the right order. But it's not too bad,
> and it gets rid of all the complexity in vacuum.

It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason.  minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:


"(2,1)","(2,2)","(2,3)","(2,4)","(2,5)","(4,1)","(5,1)","(6,1)","(7,1)","(8,1)","(9,1)","(10,1)","(11,1)","(12,1)","(13,1)","(14,1)","(15,1)","(16,1)","(17,1)","(18,1)","(19,1)","(20,1)","(21,1)","(22,1)","(23,1)","(24,1)","(25,1)","(26,1)","(27,1)","(28,1)","(29,1)","(30,1)","(31,1)","(32,1)","(33,1)","(34,1)","(35,1)","(36,1)","(37,1)","(38,1)","(39,1)","(40,1)","(41,1)","(42,1)","(43,1)","(44,1)","(45,1)","(46,1)","(47,1)","(48,1)","(49,1)","(50,1)","(51,1)","(52,1)","(53,1)","(54,1)","(55,1)","(56,1)","(57,1)","(58,1)","(59,1)","(60,1)","(61,1)","(62,1)","(63,1)","(64,1)","(65,1)","(66,1)","(67,1)","(68,1)","(69,1)","(70,1)","(71,1)","(72,1)","(73,1)","(74,1)","(75,1)","(76,1)","(77,1)","(78,1)","(79,1)","(80,1)","(81,1)","(82,1)","(83,1)","(84,1)","(85,1)","(86,1)","(87,1)","(88,1)","(89,1)","(90,1)","(91,1)","(92,1)","(93,1)","(94,1)","(95,1)","(96,1)","(97,1)","(98,1)","(99,1)","(100,1)","(101,1)","(102,1)","(103,1)","(104,1)","(105,1)","(106,1)","(107,1)","(108,1)","(109,1)","(110,1)","(111,1)","(112,1)","(113,1)","(114,1)","(115,1)","(116,1)","(117,1)","(118,1)","(119,1)","(120,1)","(121,1)","(122,1)","(123,1)","(124,1)","(125,1)","(126,1)","(127,1)","(128,1)","(129,1)","(130,1)","(131,1)","(132,1)","(133,1)","(134,1)"

There are some who would think that getting one item per page is
suboptimal.  (Maybe it's just a missing FSM update somewhere.)


I've been hacking away a bit more at this; will post updated patch
probably tomorrow (was about to post but just found a memory stomp in
pageinspect.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Alvaro Herrera

Date:

14 August 2014, 23:03:11

Alvaro Herrera wrote:
> Heikki Linnakangas wrote:

> > I'm sure this still needs some cleanup, but here's the patch, based
> > on your v14. Now that I know what this approach looks like, I still
> > like it much better. The insert and update code is somewhat more
> > complicated, because you have to be careful to lock the old page,
> > new page, and revmap page in the right order. But it's not too bad,
> > and it gets rid of all the complexity in vacuum.
>
> It seems there is some issue here, because pageinspect tells me the
> index is not growing properly for some reason.  minmax_revmap_data gives
> me this array of TIDs after a bunch of insert/vacuum/delete/ etc:

I fixed this issue, and did a lot more rework and bugfixing.  Here's
v15, based on v14-heikki2.

I think remaining issues are mostly minimal (pageinspect should output
block number alongside each tuple, now that we have it, for example.)

I haven't tested the new xlog records yet.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-15.patch

Re: Minmax indexes

From

Heikki Linnakangas

Date:

15 August 2014, 07:27:15

On 08/15/2014 02:02 AM, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>
>>> I'm sure this still needs some cleanup, but here's the patch, based
>>> on your v14. Now that I know what this approach looks like, I still
>>> like it much better. The insert and update code is somewhat more
>>> complicated, because you have to be careful to lock the old page,
>>> new page, and revmap page in the right order. But it's not too bad,
>>> and it gets rid of all the complexity in vacuum.
>>
>> It seems there is some issue here, because pageinspect tells me the
>> index is not growing properly for some reason.  minmax_revmap_data gives
>> me this array of TIDs after a bunch of insert/vacuum/delete/ etc:
>
> I fixed this issue, and did a lot more rework and bugfixing.  Here's
> v15, based on v14-heikki2.

Thanks!

> I think remaining issues are mostly minimal (pageinspect should output
> block number alongside each tuple, now that we have it, for example.)

There's this one issue I left in my patch version that I think we should 
do something about:

> +         /*
> +          * No luck. Assume that the revmap was updated concurrently.
> +          *
> +          * XXX: it would be nice to add some kind of a sanity check here to
> +          * avoid looping infinitely, if the revmap points to wrong tuple for
> +          * some reason.
> +          */

This happens when we follow the revmap to a tuple, but find that the 
tuple points to a different block than what the revmap claimed. 
Currently, we just assume that it's because the tuple was updated 
concurrently, but while hacking, I frequently had a broken index where 
the revmap pointed to bogus tuples or the tuples had a missing/wrong 
block number on them, and ran into infinite loop here. It's clearly a 
case of a corrupt index and shouldn't happen, but I would imagine that 
it's a fairly typical way this would fail in production too because of 
hardware issues or bugs. So I think we need to work a bit harder to stop 
the looping and throw an error instead.

Perhaps something as simple as keeping a loop counter and giving up 
after 1000 attempts would be good enough. The window between releasing 
the lock on the revmap, and acquiring the lock on the page containing 
the MMTuple is very narrow, so the chances of losing that race to a 
concurrent update more than 1-2 times in a row is vanishingly small.

- Heikki

Re: Minmax indexes

From

Heikki Linnakangas

Date:

15 August 2014, 07:57:13

On 08/15/2014 10:26 AM, Heikki Linnakangas wrote:
> On 08/15/2014 02:02 AM, Alvaro Herrera wrote:
>> Alvaro Herrera wrote:
>>> Heikki Linnakangas wrote:
>>
>>>> I'm sure this still needs some cleanup, but here's the patch, based
>>>> on your v14. Now that I know what this approach looks like, I still
>>>> like it much better. The insert and update code is somewhat more
>>>> complicated, because you have to be careful to lock the old page,
>>>> new page, and revmap page in the right order. But it's not too bad,
>>>> and it gets rid of all the complexity in vacuum.
>>>
>>> It seems there is some issue here, because pageinspect tells me the
>>> index is not growing properly for some reason.  minmax_revmap_data gives
>>> me this array of TIDs after a bunch of insert/vacuum/delete/ etc:
>>
>> I fixed this issue, and did a lot more rework and bugfixing.  Here's
>> v15, based on v14-heikki2.
>
> Thanks!
>
>> I think remaining issues are mostly minimal (pageinspect should output
>> block number alongside each tuple, now that we have it, for example.)
>
> There's this one issue I left in my patch version that I think we should
> do something about:
>
>> +         /*
>> +          * No luck. Assume that the revmap was updated concurrently.
>> +          *
>> +          * XXX: it would be nice to add some kind of a sanity check here to
>> +          * avoid looping infinitely, if the revmap points to wrong tuple for
>> +          * some reason.
>> +          */
>
> This happens when we follow the revmap to a tuple, but find that the
> tuple points to a different block than what the revmap claimed.
> Currently, we just assume that it's because the tuple was updated
> concurrently, but while hacking, I frequently had a broken index where
> the revmap pointed to bogus tuples or the tuples had a missing/wrong
> block number on them, and ran into infinite loop here. It's clearly a
> case of a corrupt index and shouldn't happen, but I would imagine that
> it's a fairly typical way this would fail in production too because of
> hardware issues or bugs. So I think we need to work a bit harder to stop
> the looping and throw an error instead.
>
> Perhaps something as simple as keeping a loop counter and giving up
> after 1000 attempts would be good enough. The window between releasing
> the lock on the revmap, and acquiring the lock on the page containing
> the MMTuple is very narrow, so the chances of losing that race to a
> concurrent update more than 1-2 times in a row is vanishingly small.

Reading the patch more closely, I see that you added a check that when 
we loop, we throw an error if the new item pointer in the revmap is the 
same as before. In theory, it's possible that two concurrent updates 
happen: one that moves the tuple we're looking for elsewhere, and 
another that moves it back again. The probability of that is also 
vanishingly small, so maybe that's OK. Or we could check the LSN; if the 
revmap has been updated, its LSN must've changed.

- Heikki

Re: Minmax indexes

From

Fujii Masao

Date:

15 August 2014, 12:57:20

On Fri, Aug 15, 2014 at 8:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>
>> > I'm sure this still needs some cleanup, but here's the patch, based
>> > on your v14. Now that I know what this approach looks like, I still
>> > like it much better. The insert and update code is somewhat more
>> > complicated, because you have to be careful to lock the old page,
>> > new page, and revmap page in the right order. But it's not too bad,
>> > and it gets rid of all the complexity in vacuum.
>>
>> It seems there is some issue here, because pageinspect tells me the
>> index is not growing properly for some reason.  minmax_revmap_data gives
>> me this array of TIDs after a bunch of insert/vacuum/delete/ etc:
>
> I fixed this issue, and did a lot more rework and bugfixing.  Here's
> v15, based on v14-heikki2.

I've not read the patch yet. But while testing the feature, I found that

* Brin index cannot be created on CHAR(n) column.  Maybe other data types have the same problem.

* FILLFACTOR cannot be set in brin index.

Are these intentional?

Regards,

-- 
Fujii Masao

Re: Minmax indexes

From

Heikki Linnakangas

Date:

15 August 2014, 17:16:51

On 08/15/2014 02:02 AM, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>
>>> I'm sure this still needs some cleanup, but here's the patch, based
>>> on your v14. Now that I know what this approach looks like, I still
>>> like it much better. The insert and update code is somewhat more
>>> complicated, because you have to be careful to lock the old page,
>>> new page, and revmap page in the right order. But it's not too bad,
>>> and it gets rid of all the complexity in vacuum.
>>
>> It seems there is some issue here, because pageinspect tells me the
>> index is not growing properly for some reason.  minmax_revmap_data gives
>> me this array of TIDs after a bunch of insert/vacuum/delete/ etc:
>
> I fixed this issue, and did a lot more rework and bugfixing.  Here's
> v15, based on v14-heikki2.

So, the other design change I've been advocating is to store the revmap
in the first N blocks, instead of having the two-level structure with
array pages and revmap pages.

Attached is a patch for that, to be applied after v15. When the revmap
needs to be expanded, all the tuples on it are moved elsewhere
one-by-one. That adds some latency to the unfortunate guy who needs to
do that, but as the patch stands, the revmap is only ever extended by
VACUUM or CREATE INDEX, so I think that's fine. Like with my previous
patch, the point is to demonstrate how much simpler the code becomes
this way; I'm sure there are bugs and cleanup still necessary.

PS. Spotted one oversight in patch v15: callers of mm_doupdate must
check the return value, and retry the operation if it returns false.

- Heikki

Attachment

minmax-revmap-redesign-over-v15-1.patch

Re: Minmax indexes

From

Alvaro Herrera

Date:

15 August 2014, 18:16:24

Fujii Masao wrote:

> I've not read the patch yet. But while testing the feature, I found that
> 
> * Brin index cannot be created on CHAR(n) column.
>    Maybe other data types have the same problem.

Yeah, it's just a matter of adding an opclass for it -- pretty simple
stuff really, because you don't need to write any code, just add a bunch
of catalog entries and an OPCINFO line in mmsortable.c.

Right now there are opclasses for the following types:

int4
numeric
text
date
timestamp with time zone
timestamp
time with time zone
time
"char"

We can eventually extend to cover all types that have btree opclasses,
but we can do that in a separate commit.  I'm also considering removing
the opclass for time with time zone, as it's a pretty useless type.  I
mostly added the ones that are there as a way to test that it behaved
reasonably in the various cases (pass by val vs. not, variable width vs.
fixed, different alignment requirements)

Of course, the real interesting part is adding a completely different
opclass, such as one that stores bounding boxes.

> * FILLFACTOR cannot be set in brin index.

I hadn't added this one because I didn't think there was much point
previously, but I think it might now be useful to allow same-page
updates.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Minmax indexes

From

Alvaro Herrera

Date:

20 August 2014, 22:51:57

Heikki Linnakangas wrote:

> So, the other design change I've been advocating is to store the
> revmap in the first N blocks, instead of having the two-level
> structure with array pages and revmap pages.
>
> Attached is a patch for that, to be applied after v15. When the
> revmap needs to be expanded, all the tuples on it are moved
> elsewhere one-by-one. That adds some latency to the unfortunate guy
> who needs to do that, but as the patch stands, the revmap is only
> ever extended by VACUUM or CREATE INDEX, so I think that's fine.
> Like with my previous patch, the point is to demonstrate how much
> simpler the code becomes this way; I'm sure there are bugs and
> cleanup still necessary.

Thanks for the prodding.  I didn't like this too much initially, but
after going over it a few times I agree that having less code and a less
complex physical representation is better.  Your proposed approach is to
just call the update routine on every tuple in the page we're
evacuating.  There are optimizations possible (such as doing bulk
updates; and instead of updating the revmap, keep a redirection pointer
in the page we just evacuated, so that the revmap can be updated lazily
later), but I have spent way too long on this already that I am fine
with keeping what we have here.  If somebody later wants to contribute
improvements to this, it'd be welcome.  But on the other hand the
operation is not that frequent and as you say it's not executed by
user-facing queries, so perhaps it's okay.

I cleaned it up some: mainly I created a separate file (mmpageops.c)
that now hosts the routines related to page operations: mm_doinsert,
mm_doupdate, mm_start_evacuating_page, mm_evacuate_page.  There are
other rather very minor changes here and there; also added
CHECK_FOR_INTERRUPTS in all relevant loops.

This bit in mm_doupdate I just couldn't understand:

   /* If both tuples are in fact equal, there is nothing to do */
   if (!minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
   {
       LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
       return false;
   }

Isn't the test exactly reversed?  I don't see how this would work.
I updated it to

   /*
    * If both tuples are identical, there is nothing to do; except that if we
    * were requested to move the tuple across pages, we do it even if they are
    * equal.
    */
   if (samepage && minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
   {
       LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
       return false;
   }


> PS. Spotted one oversight in patch v15: callers of mm_doupdate must
> check the return value, and retry the operation if it returns false.

Right, thanks.  Fixed.

So here's v16, rebased on top of 9bac66020.  As far as I am concerned,
this is the last version before I start renaming everything to BRIN and
then commit.


 contrib/pageinspect/Makefile             |   2 +-
 contrib/pageinspect/mmfuncs.c            | 407 +++++++++++++
 contrib/pageinspect/pageinspect--1.2.sql |  36 ++
 contrib/pg_xlogdump/rmgrdesc.c           |   1 +
 doc/src/sgml/brin.sgml                   | 248 ++++++++
 doc/src/sgml/filelist.sgml               |   1 +
 doc/src/sgml/indices.sgml                |  36 +-
 doc/src/sgml/postgres.sgml               |   1 +
 minmax-proposal                          | 306 ++++++++++
 src/backend/access/Makefile              |   2 +-
 src/backend/access/common/reloptions.c   |   7 +
 src/backend/access/heap/heapam.c         |  22 +-
 src/backend/access/minmax/Makefile       |  17 +
 src/backend/access/minmax/minmax.c       | 942 +++++++++++++++++++++++++++++++
 src/backend/access/minmax/mmpageops.c    | 638 +++++++++++++++++++++
 src/backend/access/minmax/mmrevmap.c     | 451 +++++++++++++++
 src/backend/access/minmax/mmsortable.c   | 287 ++++++++++
 src/backend/access/minmax/mmtuple.c      | 478 ++++++++++++++++
 src/backend/access/minmax/mmxlog.c       | 323 +++++++++++
 src/backend/access/rmgrdesc/Makefile     |   3 +-
 src/backend/access/rmgrdesc/minmaxdesc.c |  89 +++
 src/backend/access/transam/rmgr.c        |   1 +
 src/backend/catalog/index.c              |  24 +
 src/backend/replication/logical/decode.c |   1 +
 src/backend/storage/page/bufpage.c       | 179 +++++-
 src/backend/utils/adt/selfuncs.c         |  24 +
 src/include/access/heapam.h              |   2 +
 src/include/access/minmax.h              |  52 ++
 src/include/access/minmax_internal.h     |  86 +++
 src/include/access/minmax_page.h         |  70 +++
 src/include/access/minmax_pageops.h      |  29 +
 src/include/access/minmax_revmap.h       |  36 ++
 src/include/access/minmax_tuple.h        |  90 +++
 src/include/access/minmax_xlog.h         | 106 ++++
 src/include/access/reloptions.h          |   3 +-
 src/include/access/relscan.h             |   4 +-
 src/include/access/rmgrlist.h            |   1 +
 src/include/catalog/index.h              |   8 +
 src/include/catalog/pg_am.h              |   2 +
 src/include/catalog/pg_amop.h            |  81 +++
 src/include/catalog/pg_amproc.h          |  73 +++
 src/include/catalog/pg_opclass.h         |   9 +
 src/include/catalog/pg_opfamily.h        |  10 +
 src/include/catalog/pg_proc.h            |  52 ++
 src/include/storage/bufpage.h            |   2 +
 src/include/utils/selfuncs.h             |   1 +
 src/test/regress/expected/opr_sanity.out |  14 +-
 src/test/regress/sql/opr_sanity.sql      |   7 +-
 48 files changed, 5248 insertions(+), 16 deletions(-)


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-16.patch

Re: Minmax indexes

From

Alvaro Herrera

Date:

20 August 2014, 23:11:08

Alvaro Herrera wrote:

> So here's v16, rebased on top of 9bac66020.  As far as I am concerned,
> this is the last version before I start renaming everything to BRIN and
> then commit.

FWIW in case you or others have interest, here's the diff between your
patch and v16.  Also, for illustrative purposes, the diff between
versions yours and mine of the code that got moved to mmpageops.c
because it's difficult to see it from the partial patch.  (There's
nothing to do with that partial diff other than read it directly.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

BRIN indexes (was Re: Minmax indexes)

From

Alvaro Herrera

Date:

08 September 2014, 16:02:46

Here's version 18.  I have renamed it: These are now BRIN indexes.

I have fixed numerous race conditions and deadlocks.  In particular I
fixed this problem you noted:

Heikki Linnakangas wrote:
> Another race condition:
>
> If a new tuple is inserted to the range while summarization runs,
> it's possible that the new tuple isn't included in the tuple that
> the summarization calculated, nor does the insertion itself udpate
> it.

I did it mostly in the way you outlined, i.e. by way of a placeholder
tuple that gets updated by concurrent inserters and then the tuple
resulting from the scan is unioned with the values in the updated
placeholder tuple.  This required the introduction of one extra support
proc for opclasses (pretty simple stuff anyhow).

There should be only minor items left now, such as silencing the

WARNING:  concurrent insert in progress within table "sales"

which is emitted by IndexBuildHeapScan (possibly thousands of times)
when doing a summarization of a range being inserted into or otherwise
modified.  Basically the issue here is that IBHS assumes it's being run
with ShareLock in the heap (which blocks inserts), but here we're using
it with ShareUpdateExclusive only, which lets inserts in.  There is no
harm AFAICS because of the placeholder tuple stuff I describe above.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-18.patch

Re: BRIN indexes - TRAP: BadArgument

From

"Erik Rijkers"

Date:

08 September 2014, 18:04:08

On Mon, September 8, 2014 18:02, Alvaro Herrera wrote:
> Here's version 18.  I have renamed it: These are now BRIN indexes.
>

I get into a BadArgument after:


$ cat crash.sql

-- drop table if exists t_100_000_000 cascade;  create table         t_100_000_000 as select cast(i as integer) from
generate_series(1,100000000) as f(i) ;
 

-- drop index if exists t_100_000_000_i_brin_idx;  create index         t_100_000_000_i_brin_idx on t_100_000_000 using
brin(i);select
 
pg_size_pretty(pg_relation_size('t_100_000_000_i_brin_idx'));

select i from t_100_000_000 where i between 10000 and 1009999; -- ( + 999999 )


Log file says:

TRAP: BadArgument("!(((context) != ((void *)0) && (((((const Node*)((context)))->type) == T_AllocSetContext))))",
File:
"mcxt.c", Line: 752)
2014-09-08 19:54:46.071 CEST 30151 LOG:  server process (PID 30336) was terminated by signal 6: Aborted
2014-09-08 19:54:46.071 CEST 30151 DETAIL:  Failed process was running: select i from t_100_000_000 where i between
10000
and 1009999;



The crash is caused by the last select statement; the table and index create are OK.

it only happens with a largish table; small tables are OK.



Linux / Centos / 32 GB.
PostgreSQL 9.5devel_minmax_20140908_1809_0640c1bfc091 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.9.1, 64-bit
        setting          |              current_setting
--------------------------+--------------------------------------------autovacuum               | offport
     | 6444shared_buffers           | 100MBeffective_cache_size     | 4GBwork_mem                 |
10MBmaintenance_work_mem    | 1GBcheckpoint_segments      | 20data_checksums           | onserver_version           |
9.5devel_minmax_20140908_1809_0640c1bfc091pg_postmaster_start_time| 2014-09-08 19:53 (uptime: 0d 0h 6m 54s)
 

'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444'
'--bindir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/bin'
'--libdir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/lib' '--enable-depend' '--enable-cassert' '--enable-debug'
'--with-perl' '--with-openssl' '--with-libxml' '--with-extra-version=_minmax_20140908_1809_0640c1bfc091'


pgpatches/0095/minmax/20140908/minmax-18.patch


thanks,


Erik Rijkers

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

08 September 2014, 18:39:15

Erik Rijkers wrote:

> Log file says:
>
> TRAP: BadArgument("!(((context) != ((void *)0) && (((((const Node*)((context)))->type) == T_AllocSetContext))))",
File:
> "mcxt.c", Line: 752)
> 2014-09-08 19:54:46.071 CEST 30151 LOG:  server process (PID 30336) was terminated by signal 6: Aborted
> 2014-09-08 19:54:46.071 CEST 30151 DETAIL:  Failed process was running: select i from t_100_000_000 where i between
10000
> and 1009999;

A double-free mistake -- here's a patch.  Thanks.


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-18a.patch

Re: BRIN indexes (was Re: Minmax indexes)

From

Heikki Linnakangas

Date:

09 September 2014, 11:49:32

On 09/08/2014 07:02 PM, Alvaro Herrera wrote:
> Here's version 18.  I have renamed it: These are now BRIN indexes.
>
> I have fixed numerous race conditions and deadlocks.  In particular I
> fixed this problem you noted:
>
> Heikki Linnakangas wrote:
>> Another race condition:
>>
>> If a new tuple is inserted to the range while summarization runs,
>> it's possible that the new tuple isn't included in the tuple that
>> the summarization calculated, nor does the insertion itself udpate
>> it.
>
> I did it mostly in the way you outlined, i.e. by way of a placeholder
> tuple that gets updated by concurrent inserters and then the tuple
> resulting from the scan is unioned with the values in the updated
> placeholder tuple.  This required the introduction of one extra support
> proc for opclasses (pretty simple stuff anyhow).

Hmm. So the union support proc is only called if there is a race 
condition? That makes it very difficult to test, I'm afraid.

It would make more sense to pass BrinValues to the support functions, 
rather than DeformedBrTuple. An opclass'es support function should never 
need to access the values for other columns.

Does minmaxUnion handle NULLs correctly?

minmaxUnion pfrees the old values. Is that necessary? What memory 
context does the function run in? If the code runs in a short-lived 
memory context, you might as well just let them leak. If it runs in a 
long-lived context, well, perhaps it shouldn't. It's nicer to write 
functions that can leak freely. IIRC, GiST and GIN runs the support 
functions in a temporary context. In any case, it might be worth noting 
explicitly in the docs which functions may leak and which may not.

If you add a new datatype, and define b-tree operators for it, what is 
required to create a minmax opclass for it? Would it be possible to 
generalize the functions in brin_minmax.c so that they can be reused for 
any datatype (with b-tree operators) without writing any new C code? I 
think we're almost there; the only thing that differs between each data 
type is the opcinfo function. Let's pass the type OID as argument to the 
opcinfo function. You could then have just a single minmax_opcinfo 
function, instead of the macro to generate a separate function for each 
built-in datatype.

In general, this patch is in pretty good shape now, thanks!

- Heikki

Re: BRIN indexes (was Re: Minmax indexes)

From

Emanuel Calvo

Date:

15 September 2014, 01:08:41

<br /><div class="moz-cite-prefix">El 08/09/14 13:02, Alvaro Herrera escribió:<br /></div><blockquote
cite="mid:20140908160219.GN14037@eldon.alvh.no-ip.org"type="cite"><pre wrap="">Here's version 18.  I have renamed it:
Theseare now BRIN indexes.
 

I have fixed numerous race conditions and deadlocks.  In particular I
fixed this problem you noted:

Heikki Linnakangas wrote:
</pre><blockquote type="cite"><pre wrap="">Another race condition:

If a new tuple is inserted to the range while summarization runs,
it's possible that the new tuple isn't included in the tuple that
the summarization calculated, nor does the insertion itself udpate
it.
</pre></blockquote><pre wrap="">
I did it mostly in the way you outlined, i.e. by way of a placeholder
tuple that gets updated by concurrent inserters and then the tuple
resulting from the scan is unioned with the values in the updated
placeholder tuple.  This required the introduction of one extra support
proc for opclasses (pretty simple stuff anyhow).

There should be only minor items left now, such as silencing the 

WARNING:  concurrent insert in progress within table "sales"

which is emitted by IndexBuildHeapScan (possibly thousands of times)
when doing a summarization of a range being inserted into or otherwise
modified.  Basically the issue here is that IBHS assumes it's being run
with ShareLock in the heap (which blocks inserts), but here we're using
it with ShareUpdateExclusive only, which lets inserts in.  There is no
harm AFAICS because of the placeholder tuple stuff I describe above.
</pre></blockquote><br /> Debuging VACUUM VERBOSE ANALYZE over a concurrent table being updated/insert.<br /><br />
(gbd)<br/> Breakpoint 1, errfinish (dummy=0) at elog.c:411<br /> 411        ErrorData  *edata =
&errordata[errordata_stack_depth];<br/><br /> The complete backtrace is at <a class="moz-txt-link-freetext"
href="http://pastebin.com/gkigSNm7">http://pastebin.com/gkigSNm7</a><br/><br /><br /> Also, I found pages with an
unkowntype (using deafult parameters for the index<br /> creation):<br /><br />  brin_page_type | array_agg<br />
----------------+-----------<br/>  unknown (00)   | {3,4}<br />  revmap         | {1}<br />  regular        | {2}<br />
 meta          | {0}<br /> (4 rows)<br /><br /><br /><br /><br /><blockquote
cite="mid:20140908160219.GN14037@eldon.alvh.no-ip.org"type="cite"><pre wrap="">
 
</pre><br /><fieldset class="mimeAttachmentHeader"></fieldset><br /><pre wrap="">
</pre></blockquote><br /><pre class="moz-signature" cols="72">-- 
--
Emanuel Calvo
@3manuek</pre>

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

23 September 2014, 18:04:34

Here's an updated version, rebased to current master.

Erik Rijkers wrote:

> I get into a BadArgument after:

Fixed in the attached, thanks.

Emanuel Calvo wrote:

> Debuging VACUUM VERBOSE ANALYZE over a concurrent table being
> updated/insert.
>
> (gbd)
> Breakpoint 1, errfinish (dummy=0) at elog.c:411
> 411        ErrorData  *edata = &errordata[errordata_stack_depth];
>
> The complete backtrace is at http://pastebin.com/gkigSNm7

The file/line info in the backtrace says that this is reporting this
message:

    ereport(elevel,
            (errmsg("scanned index \"%s\" to remove %d row versions",
                    RelationGetRelationName(indrel),
                    vacrelstats->num_dead_tuples),
             errdetail("%s.", pg_rusage_show(&ru0))));
Not sure why you're reporting it, since this is expected.

There were thousands of WARNINGs being emitted by IndexBuildHeapScan
when concurrent insertions occurred; I fixed that by setting the
ii_Concurrent flag, which makes that function obtain a snapshot to use
for the scan.  This is okay because concurrent insertions will be
detected via the placeholder tuple mechanism as previously described.
(There is no danger of serializable transactions etc, because this only
runs in vacuum.  I added an Assert() nevertheless.)

> Also, I found pages with an unkown type (using deafult parameters for
> the index
> creation):
>
>  brin_page_type | array_agg
> ----------------+-----------
>  unknown (00)   | {3,4}
>  revmap         | {1}
>  regular        | {2}
>  meta           | {0}
> (4 rows)

Ah, we had an issue with the vacuuming of the FSM.  I had to make that
more aggressive; I was able to reproduce the problem and it is fixed
now.

Heikki Linnakangas wrote:

> Hmm. So the union support proc is only called if there is a race
> condition? That makes it very difficult to test, I'm afraid.

Yes.  I guess we can fix that by having an assert-only block that uses
the union support proc to verify consistency of generated tuples.  This
might be difficult for types involving floating point arithmetic.

> It would make more sense to pass BrinValues to the support
> functions, rather than DeformedBrTuple. An opclass'es support
> function should never need to access the values for other columns.

Agreed -- fixed.  I added attno to BrinValues, which makes this easier.

> Does minmaxUnion handle NULLs correctly?

Nope, fixed.

> minmaxUnion pfrees the old values. Is that necessary? What memory
> context does the function run in? If the code runs in a short-lived
> memory context, you might as well just let them leak. If it runs in
> a long-lived context, well, perhaps it shouldn't. It's nicer to
> write functions that can leak freely. IIRC, GiST and GIN runs the
> support functions in a temporary context. In any case, it might be
> worth noting explicitly in the docs which functions may leak and
> which may not.

Yeah, I had tried playing with contexts in general previously but it
turned out that there was too much bureaucratic overhead (quite visible
in profiles), so I ripped it out and did careful retail pfree instead
(it's not *that* difficult).  Maybe I went overboard with it, and that
with more careful planning we can do better; I don't think this is
critical ATM -- we can certainly stand later cleanup in this area.

> If you add a new datatype, and define b-tree operators for it, what
> is required to create a minmax opclass for it? Would it be possible
> to generalize the functions in brin_minmax.c so that they can be
> reused for any datatype (with b-tree operators) without writing any
> new C code? I think we're almost there; the only thing that differs
> between each data type is the opcinfo function. Let's pass the type
> OID as argument to the opcinfo function. You could then have just a
> single minmax_opcinfo function, instead of the macro to generate a
> separate function for each built-in datatype.

Yeah, that's how I had that initially.  I changed it to what it's now as
part of a plan to enable building cross-type opclasses, so you could
have "WHERE int8col=42" without requiring a cast of the constant to type
int8.  This might have been a thinko, because AFAICS it's possible to
build them with a constant opcinfo as well (I changed several other
things to support this, as described in a previous email.)  I will look
into this later.

Thanks for the review!

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-19.patch

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

23 September 2014, 19:04:59

Alvaro Herrera wrote:

> Heikki Linnakangas wrote:

> > If you add a new datatype, and define b-tree operators for it, what
> > is required to create a minmax opclass for it? Would it be possible
> > to generalize the functions in brin_minmax.c so that they can be
> > reused for any datatype (with b-tree operators) without writing any
> > new C code? I think we're almost there; the only thing that differs
> > between each data type is the opcinfo function. Let's pass the type
> > OID as argument to the opcinfo function. You could then have just a
> > single minmax_opcinfo function, instead of the macro to generate a
> > separate function for each built-in datatype.
>
> Yeah, that's how I had that initially.  I changed it to what it's now as
> part of a plan to enable building cross-type opclasses, so you could
> have "WHERE int8col=42" without requiring a cast of the constant to type
> int8.  This might have been a thinko, because AFAICS it's possible to
> build them with a constant opcinfo as well (I changed several other
> things to support this, as described in a previous email.)  I will look
> into this later.

I found out that we don't really throw errors in such cases anymore; we
insert casts instead.  Maybe there's a performance argument that it
might be better to use existing cross-type operators than casting, but
justifying this work just turned a lot harder.  Here's a patch that
reverts opcinfo into a generic function that receives the type OID.

I will look into adding some testing mechanism for the union support
proc; with that I will just consider the patch ready for commit and will
push.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-19a.patch

Re: BRIN indexes - TRAP: BadArgument

From

Robert Haas

Date:

23 September 2014, 23:23:18

On Tue, Sep 23, 2014 at 3:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>> > If you add a new datatype, and define b-tree operators for it, what
>> > is required to create a minmax opclass for it? Would it be possible
>> > to generalize the functions in brin_minmax.c so that they can be
>> > reused for any datatype (with b-tree operators) without writing any
>> > new C code? I think we're almost there; the only thing that differs
>> > between each data type is the opcinfo function. Let's pass the type
>> > OID as argument to the opcinfo function. You could then have just a
>> > single minmax_opcinfo function, instead of the macro to generate a
>> > separate function for each built-in datatype.
>>
>> Yeah, that's how I had that initially.  I changed it to what it's now as
>> part of a plan to enable building cross-type opclasses, so you could
>> have "WHERE int8col=42" without requiring a cast of the constant to type
>> int8.  This might have been a thinko, because AFAICS it's possible to
>> build them with a constant opcinfo as well (I changed several other
>> things to support this, as described in a previous email.)  I will look
>> into this later.
>
> I found out that we don't really throw errors in such cases anymore; we
> insert casts instead.  Maybe there's a performance argument that it
> might be better to use existing cross-type operators than casting, but
> justifying this work just turned a lot harder.  Here's a patch that
> reverts opcinfo into a generic function that receives the type OID.
>
> I will look into adding some testing mechanism for the union support
> proc; with that I will just consider the patch ready for commit and will
> push.

With all respect, I think this is a bad idea.  I know you've put a lot
of energy into this patch and I'm confident it's made a lot of
progress.  But as with Stephen's patch, the final form deserves a
thorough round of looking over by someone else before it goes in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BRIN indexes - TRAP: BadArgument

From

Michael Paquier

Date:

23 September 2014, 23:36:05

On Wed, Sep 24, 2014 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 23, 2014 at 3:04 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Alvaro Herrera wrote:
>> I will look into adding some testing mechanism for the union support
>> proc; with that I will just consider the patch ready for commit and will
>> push.
>
> With all respect, I think this is a bad idea.  I know you've put a lot
> of energy into this patch and I'm confident it's made a lot of
> progress.  But as with Stephen's patch, the final form deserves a
> thorough round of looking over by someone else before it goes in.

Would this person be it an extra committer or an simple reviewer? It
would give more insurance if such huge patches (couple of thousands of
lines) get an extra +1 from another committer, proving that the code
has been reviewed by people well-experienced with backend code. Now as
this would put more pressure in the hands of committers, an extra
external pair of eyes, be it non-committer but let's say a seasoned
reviewer would be fine IMO.
-- 
Michael

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

24 September 2014, 01:24:01

Robert Haas wrote:

> With all respect, I think this is a bad idea.  I know you've put a lot
> of energy into this patch and I'm confident it's made a lot of
> progress.  But as with Stephen's patch, the final form deserves a
> thorough round of looking over by someone else before it goes in.

As you can see in the thread, Heikki's put a lot of review effort into
it (including important code contributions); I don't feel I'm rushing it
at this point.  If you or somebody else want to give it a look, I have
no problem waiting a bit longer.  I don't want to delay indefinitely,
though, because I think it's better shipped early in the release cycle
than later, to allow for further refinements and easier testing by other
interested parties.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Robert Haas

Date:

24 September 2014, 01:43:49

On Tue, Sep 23, 2014 at 7:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Would this person be it an extra committer or an simple reviewer? It
> would give more insurance if such huge patches (couple of thousands of
> lines) get an extra +1 from another committer, proving that the code
> has been reviewed by people well-experienced with backend code. Now as
> this would put more pressure in the hands of committers, an extra
> external pair of eyes, be it non-committer but let's say a seasoned
> reviewer would be fine IMO.

If you're volunteering, I certainly wouldn't say "no".  The more the
merrier.  Same with anyone else.  Since Heikki looked at it before, I
also think it would be appropriate to give him a bit of time to see if
he feels satisfied with it now - nobody on this project has more
experience with indexing than he does, but he may not have the time,
and even if he does, someone else might spot something he misses.

Alvaro's quite right to point out that there is no sense in waiting a
long time for a review that isn't coming.  That just backs everything
up against the end of the release cycle to no benefit.  But if there's
review available from experienced people within the community, taking
advantage of that now might find things that could be much harder to
fix later.  That's a win for everybody.  And it's not like we're
pressed up against the end of the cycle, nor is it as if this feature
has been through endless rounds of review already.  It's certainly had
some, and it's gotten better as a result.  But it's also changed a lot
in the process.

And much of the review to date has been high-level design review, like
"how should the opclasses look?" and "what should we call this thing
anyway?".  Going through it for logic errors, documentation
shortcomings, silly thinkos, etc. has not been done too much, I think,
and definitely not on the latest version.  So, some of that might not
be out of place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BRIN indexes - TRAP: BadArgument

From

Robert Haas

Date:

24 September 2014, 01:51:39

On Tue, Sep 23, 2014 at 9:23 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> With all respect, I think this is a bad idea.  I know you've put a lot
>> of energy into this patch and I'm confident it's made a lot of
>> progress.  But as with Stephen's patch, the final form deserves a
>> thorough round of looking over by someone else before it goes in.
>
> As you can see in the thread, Heikki's put a lot of review effort into
> it (including important code contributions); I don't feel I'm rushing it
> at this point.

Yeah, I was really glad Heikki looked at it.  That seemed good.

> If you or somebody else want to give it a look, I have
> no problem waiting a bit longer.  I don't want to delay indefinitely,
> though, because I think it's better shipped early in the release cycle
> than later, to allow for further refinements and easier testing by other
> interested parties.

I agree with that.  I'd like to look at it, and I will if I get time,
but as I said elsewhere, I also think it's appropriate to give a
little time around the final version of any big, complex patch just
because people may have thoughts, and they may not have time to
deliver those thoughts the minute the patch hits the list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

24 September 2014, 03:03:26

Robert Haas wrote:
> On Tue, Sep 23, 2014 at 9:23 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > If you or somebody else want to give it a look, I have
> > no problem waiting a bit longer.  I don't want to delay indefinitely,
> > though, because I think it's better shipped early in the release cycle
> > than later, to allow for further refinements and easier testing by other
> > interested parties.
> 
> I agree with that.  I'd like to look at it, and I will if I get time,
> but as I said elsewhere, I also think it's appropriate to give a
> little time around the final version of any big, complex patch just
> because people may have thoughts, and they may not have time to
> deliver those thoughts the minute the patch hits the list.

Fair enough -- I'll keep it open for the time being.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Heikki Linnakangas

Date:

24 September 2014, 06:59:18

On 09/23/2014 10:04 PM, Alvaro Herrera wrote:
> +  <para>
> +   The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
> +   is primarily maintained by Álvaro Herrera.
> +  </para>

We don't usually have such verbiage in the docs. The GIN and GiST pages 
do, but I think those are a historic exceptions, not something we want 
to do going forward.

> +   <variablelist>
> +    <varlistentry>
> +     <term><function>BrinOpcInfo *opcInfo(void)</></term>
> +     <listitem>
> +      <para>
> +       Returns internal information about the indexed columns' summary data.
> +      </para>
> +     </listitem>
> +    </varlistentry>

I think you should explain what that internal information is. The 
minmax-19a.patch adds the type OID argument to this; remember to update 
the docs.

In SP-GiST, the similar function is called "config". It might be good to 
use the same name here, for consistency across indexams (although I 
actually like the "opcInfo" name better than "config")

The docs for the other support functions need to be updated, now that 
you changed the arguments from DeformedBrTuple to BrinValues.

> + <!-- this needs improvement ... -->
> +   To implement these methods in a generic ways, normally the opclass
> +   defines its own internal support functions.  For instance, minmax
> +   opclasses add the support functions for the four inequality operators
> +   for the datatype.
> +   Additionally, the operator class must supply appropriate
> +   operator entries,
> +   to enable the optimizer to use the index when those operators are
> +   used in queries.

The above needs improvement ;-)

> +    BRIN indexes (a shorthand for Block Range indexes)
> +    store summaries about the values stored in consecutive table physical block ranges.

"consecutive table physical block ranges" is quite a mouthful.

> +    For datatypes that have a linear sort order, the indexed data
> +    corresponds to the minimum and maximum values of the
> +    values in the column for each block range,
> +    which support indexed queries using these operators:
> +
> +    <simplelist>
> +     <member><literal><</literal></member>
> +     <member><literal><=</literal></member>
> +     <member><literal>=</literal></member>
> +     <member><literal>>=</literal></member>
> +     <member><literal>></literal></member>
> +    </simplelist>

That's the built-in minmax indexing strategy, yes, but you could have 
others, even for datatypes with a linear sort order.

> + To find out the index tuple for a particular page range, we have an internal

s/find out/find/

> + new heap tuple contains null values but the index tuple indicate there are no

s/indicate/indicates/

> + Open questions
> + --------------
> +
> + * Same-size page ranges?
> +   Current related literature seems to consider that each "index entry" in a
> +   BRIN index must cover the same number of pages.  There doesn't seem to be a

What is the related literature? Is there an academic paper or something 
that should be cited as a reference for BRIN?

> +  * TODO
> +  *        * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY)
> +  *        * add support for unlogged indexes
> +  *        * ditto expressional indexes

We don't have unlogged indexes in general, so no need to list that here. 
What would be needed to implement ScalarArrayOpExprs?

I didn't realize that expression indexes are still not supported. And I 
see that partial indexes are not supported either. Why not? I wouldn't 
expect BRIN to need to care about those things in particular; the 
expressions for an expressional or partial index are handled in the 
executor, no?

> + /*
> +  * A tuple in the heap is being inserted.  To keep a brin index up to date,
> +  * we need to obtain the relevant index tuple, compare its stored values with
> +  * those of the new tuple; if the tuple values are consistent with the summary
> +  * tuple, there's nothing to do; otherwise we need to update the index.

s/compare/and compare/. Perhaps replace one of the semicolons with a 
full stop.

> +  * If the range is not currently summarized (i.e. the revmap returns InvalidTid
> +  * for it), there's nothing to do either.
> +  */
> + Datum
> + brininsert(PG_FUNCTION_ARGS)

There is no InvalidTid, as a constant or a #define. Perhaps replace with 
"invalid item pointer".

> +     /*
> +      * XXX We need to know the size of the table so that we know how long to
> +      * iterate on the revmap.  There's room for improvement here, in that we
> +      * could have the revmap tell us when to stop iterating.
> +      */

The revmap doesn't know how large the table is. Remember that you have 
to return all blocks that are not in the revmap, so you can't just stop 
when you reach the end of the revmap. I think the current design is fine.

I have to stop now to do some other stuff. Overall, this is in pretty 
good shape. In addition to little cleanup of things I listed above, and 
similar stuff elsewhere that I didn't read through right now, there are 
a few medium-sized items I'd still like to see addressed before you 
commit this:

* expressional/partial index support
* the difficulty of testing the union support function that we discussed 
earlier
* clarify the memory context stuff of support functions that we also 
discussed earlier

- Heikki

Re: BRIN indexes - TRAP: BadArgument

From

"Erik Rijkers"

Date:

26 September 2014, 16:12:56

On Tue, September 23, 2014 21:04, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
>
> [minmax-19.patch]
> [minmax-19a.patch]

Although admittedly it is not directly likely for us to need it, and although I see that there is a BRIN Extensibility
chapter added (good!), I am still a bit surprised by the absence of a built-in BRIN operator class for bigint, as the
BRIN
index type is specifically useful for huge tables (where after all huge values are more likely to occur).

Will a brin int8 be added operator class for 9.5? (I know, quite some time left...)

(btw, so far the patch proves quite stable under my abusive testing...)

thanks,

Erik Rijkers

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

06 October 2014, 22:34:29

Heikki Linnakangas wrote:
> On 09/23/2014 10:04 PM, Alvaro Herrera wrote:
> >+  <para>
> >+   The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
> >+   is primarily maintained by Álvaro Herrera.
> >+  </para>
>
> We don't usually have such verbiage in the docs. The GIN and GiST
> pages do, but I think those are a historic exceptions, not something
> we want to do going forward.

Removed.

> >+   <variablelist>
> >+    <varlistentry>
> >+     <term><function>BrinOpcInfo *opcInfo(void)</></term>
> >+     <listitem>
> >+      <para>
> >+       Returns internal information about the indexed columns' summary data.
> >+      </para>
> >+     </listitem>
> >+    </varlistentry>
>
> I think you should explain what that internal information is. The
> minmax-19a.patch adds the type OID argument to this; remember to
> update the docs.

Updated.

> In SP-GiST, the similar function is called "config". It might be
> good to use the same name here, for consistency across indexams
> (although I actually like the "opcInfo" name better than "config")

Well, I'm not sure that there's any value in being consistent if the new
name is better than the old one.  Most likely, a person trying to
implement an spgist opclass wouldn't try to do a brin opclass at the
same time, so it's not like there's a lot of value in being consistent
there, anyway.

> The docs for the other support functions need to be updated, now
> that you changed the arguments from DeformedBrTuple to BrinValues.

Updated.

> >+ <!-- this needs improvement ... -->
> >+   To implement these methods in a generic ways, normally the opclass
> >+   defines its own internal support functions.  For instance, minmax
> >+   opclasses add the support functions for the four inequality operators
> >+   for the datatype.
> >+   Additionally, the operator class must supply appropriate
> >+   operator entries,
> >+   to enable the optimizer to use the index when those operators are
> >+   used in queries.
>
> The above needs improvement ;-)

I rechecked and while I tweaked it here and there, I wasn't able to add
much more to it.

> >+    BRIN indexes (a shorthand for Block Range indexes)
> >+    store summaries about the values stored in consecutive table physical block ranges.
>
> "consecutive table physical block ranges" is quite a mouthful.

I reworded this introduction.  I hope it makes more sense now.

> >+    For datatypes that have a linear sort order, the indexed data
> >+    corresponds to the minimum and maximum values of the
> >+    values in the column for each block range,
> >+    which support indexed queries using these operators:
> >+
> >+    <simplelist>
> >+     <member><literal><</literal></member>
> >+     <member><literal><=</literal></member>
> >+     <member><literal>=</literal></member>
> >+     <member><literal>>=</literal></member>
> >+     <member><literal>></literal></member>
> >+    </simplelist>
>
> That's the built-in minmax indexing strategy, yes, but you could
> have others, even for datatypes with a linear sort order.

I "fixed" this by removing this list.  It's not possible to be
comprehensive here, I think, and anyway I don't think there's much
point.

> >+ To find out the index tuple for a particular page range, we have an internal
>
> s/find out/find/
>
> >+ new heap tuple contains null values but the index tuple indicate there are no
>
> s/indicate/indicates/

Both fixed.

> >+ Open questions
> >+ --------------
> >+
> >+ * Same-size page ranges?
> >+   Current related literature seems to consider that each "index entry" in a
> >+   BRIN index must cover the same number of pages.  There doesn't seem to be a
>
> What is the related literature? Is there an academic paper or
> something that should be cited as a reference for BRIN?

I the original "minmax-proposal" file, I had these four URLs:

: Other database systems already have similar features. Some examples:
:
: * Oracle Exadata calls this "storage indexes"
:   http://richardfoote.wordpress.com/category/storage-indexes/
:
: * Netezza has "zone maps"
:   http://nztips.com/2010/11/netezza-integer-join-keys/
:
: * Infobright has this automatically within their "data packs" according to a
:   May 3rd, 2009 blog post
:   http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
:
: * MonetDB also uses this technique, according to a published paper
:   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
:   "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"

I gave them all a quick look and none of them touches the approach in
detail; in fact other than the Oracle Exadata one, they are all talking
about something else and mention the "minmax" stuff only in passing.  I
don't think any of them is worth citing.

> >+  * TODO
> >+  *        * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY)
> >+  *        * add support for unlogged indexes
> >+  *        * ditto expressional indexes
>
> We don't have unlogged indexes in general, so no need to list that
> here. What would be needed to implement ScalarArrayOpExprs?

Well, it requires a different way to handle ScanKeys.  Anyway the
queries that it is supposed to serve can already be served in some other
ways for AMs that don't have amsearcharray, so I don't think it's a huge
loss if we don't implement it.  We can add that later.

> I didn't realize that expression indexes are still not supported.
> And I see that partial indexes are not supported either. Why not? I
> wouldn't expect BRIN to need to care about those things in
> particular; the expressions for an expressional or partial index are
> handled in the executor, no?

Yeah; those restrictions were leftovers from back when I didn't really
know how they were supposed to be implemented.  I took out the
restrictions and there wasn't anything else required to support both
these features.

> >+ /*
> >+  * A tuple in the heap is being inserted.  To keep a brin index up to date,
> >+  * we need to obtain the relevant index tuple, compare its stored values with
> >+  * those of the new tuple; if the tuple values are consistent with the summary
> >+  * tuple, there's nothing to do; otherwise we need to update the index.
>
> s/compare/and compare/. Perhaps replace one of the semicolons with a
> full stop.

Fixed.

> >+  * If the range is not currently summarized (i.e. the revmap returns InvalidTid
> >+  * for it), there's nothing to do either.
> >+  */
> >+ Datum
> >+ brininsert(PG_FUNCTION_ARGS)
>
> There is no InvalidTid, as a constant or a #define. Perhaps replace
> with "invalid item pointer".

Fixed -- actually it doesn't return invalid TID anymore, only NULL.

> >+     /*
> >+      * XXX We need to know the size of the table so that we know how long to
> >+      * iterate on the revmap.  There's room for improvement here, in that we
> >+      * could have the revmap tell us when to stop iterating.
> >+      */
>
> The revmap doesn't know how large the table is. Remember that you
> have to return all blocks that are not in the revmap, so you can't
> just stop when you reach the end of the revmap. I think the current
> design is fine.

Yeah, I was leaning towards the same conclusion myself.  I have removed
the comment.  (We could think about having brininsert update the
metapage so that the index keeps track of what's the last heap page,
which could help us support this, but I'm not sure there's much point.
Anyway we can tweak this later.)

> I have to stop now to do some other stuff. Overall, this is in
> pretty good shape. In addition to little cleanup of things I listed
> above, and similar stuff elsewhere that I didn't read through right
> now, there are a few medium-sized items I'd still like to see
> addressed before you commit this:
>
> * expressional/partial index support
> * the difficulty of testing the union support function that we
> discussed earlier

I added an USE_ASSERTION-only block in brininsert that runs the union
support proc and compares the output with the one from regular addValue.
I haven't tested this too much yet.

> * clarify the memory context stuff of support functions that we also
> discussed earlier

I re-checked this stuff.  Turns out that the support functions don't
palloc/pfree memory too much, except to update the stuff stored in
BrinValues, by using datumCopy().  This memory is only freed when we
need to update a previous Datum.  There's no way for the brin.c code to
know when the Datum is going to be released by the support proc, and
thus no way for a temp context to be used.

The memory context experiments I alluded to earlier are related to
pallocs done in brininsert / bringetbitmap themselves, not in the
opclass-provided support procs.  All in all, I don't think there's much
room for improvement, other than perhaps doing so in brininsert/
bringetbitmap.  Don't really care too much about this either way.

Once again, many thanks for the review.  Here's a new version.  I have
added operator classes for int8, text, and actually everything that btree
supports except:
    bool
    record
    oidvector
    anyarray
    tsvector
    tsquery
    jsonb
    range

since I'm not sure that it makes sense to have opclasses for any of
these -- at least not regular minmax opclasses.  There are some
interesting possibilities, for example for range types, whereby we store
in the index tuple the union of all the range in the block range.

(I had an opclass for anyenum too, but on further thought I removed it
because it is going to be pointless in nearly all cases.)

 contrib/pageinspect/Makefile             |    2 +-
 contrib/pageinspect/brinfuncs.c          |  410 +++++++++++
 contrib/pageinspect/pageinspect--1.2.sql |   37 +
 contrib/pg_xlogdump/rmgrdesc.c           |    1 +
 doc/src/sgml/brin.sgml                   |  498 +++++++++++++
 doc/src/sgml/filelist.sgml               |    1 +
 doc/src/sgml/indices.sgml                |   36 +-
 doc/src/sgml/postgres.sgml               |    1 +
 src/backend/access/Makefile              |    2 +-
 src/backend/access/brin/Makefile         |   18 +
 src/backend/access/brin/README           |  179 +++++
 src/backend/access/brin/brin.c           | 1116 ++++++++++++++++++++++++++++++
 src/backend/access/brin/brin_minmax.c    |  320 +++++++++
 src/backend/access/brin/brin_pageops.c   |  712 +++++++++++++++++++
 src/backend/access/brin/brin_revmap.c    |  473 +++++++++++++
 src/backend/access/brin/brin_tuple.c     |  553 +++++++++++++++
 src/backend/access/brin/brin_xlog.c      |  319 +++++++++
 src/backend/access/common/reloptions.c   |    7 +
 src/backend/access/heap/heapam.c         |   22 +-
 src/backend/access/rmgrdesc/Makefile     |    3 +-
 src/backend/access/rmgrdesc/brindesc.c   |  112 +++
 src/backend/access/transam/rmgr.c        |    1 +
 src/backend/catalog/index.c              |   24 +
 src/backend/replication/logical/decode.c |    1 +
 src/backend/storage/page/bufpage.c       |  179 ++++-
 src/backend/utils/adt/selfuncs.c         |   74 +-
 src/include/access/brin.h                |   52 ++
 src/include/access/brin_internal.h       |   87 +++
 src/include/access/brin_page.h           |   70 ++
 src/include/access/brin_pageops.h        |   36 +
 src/include/access/brin_revmap.h         |   39 ++
 src/include/access/brin_tuple.h          |   97 +++
 src/include/access/brin_xlog.h           |  107 +++
 src/include/access/heapam.h              |    2 +
 src/include/access/reloptions.h          |    3 +-
 src/include/access/relscan.h             |    4 +-
 src/include/access/rmgrlist.h            |    1 +
 src/include/catalog/index.h              |    8 +
 src/include/catalog/pg_am.h              |    2 +
 src/include/catalog/pg_amop.h            |  164 +++++
 src/include/catalog/pg_amproc.h          |  245 +++++++
 src/include/catalog/pg_opclass.h         |   32 +
 src/include/catalog/pg_opfamily.h        |   28 +
 src/include/catalog/pg_proc.h            |   38 +
 src/include/storage/bufpage.h            |    2 +
 src/include/utils/selfuncs.h             |    1 +
 src/test/regress/expected/opr_sanity.out |   14 +-
 src/test/regress/sql/opr_sanity.sql      |    7 +-
 48 files changed, 6122 insertions(+), 18 deletions(-)

(I keep naming the patch file "minmax", but nothing in the code is
actually called that way anymore, except the opclasses).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

minmax-20.patch

Re: BRIN indexes - TRAP: BadArgument

From

Heikki Linnakangas

Date:

07 October 2014, 14:00:14

On 10/07/2014 01:33 AM, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> On 09/23/2014 10:04 PM, Alvaro Herrera wrote:
>>> + Open questions
>>> + --------------
>>> +
>>> + * Same-size page ranges?
>>> +   Current related literature seems to consider that each "index entry" in a
>>> +   BRIN index must cover the same number of pages.  There doesn't seem to be a
>>
>> What is the related literature? Is there an academic paper or
>> something that should be cited as a reference for BRIN?
>
> I the original "minmax-proposal" file, I had these four URLs:
>
> : Other database systems already have similar features. Some examples:
> :
> : * Oracle Exadata calls this "storage indexes"
> :   http://richardfoote.wordpress.com/category/storage-indexes/
> :
> : * Netezza has "zone maps"
> :   http://nztips.com/2010/11/netezza-integer-join-keys/
> :
> : * Infobright has this automatically within their "data packs" according to a
> :   May 3rd, 2009 blog post
> :   http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
> :
> : * MonetDB also uses this technique, according to a published paper
> :   http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
> :   "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
>
> I gave them all a quick look and none of them touches the approach in
> detail; in fact other than the Oracle Exadata one, they are all talking
> about something else and mention the "minmax" stuff only in passing.  I
> don't think any of them is worth citing.

I think the "current related literature" phrase should be removed, if 
there isn't in fact any literature on this. If there's any literature 
worth referencing, should add a proper citation.

> I added an USE_ASSERTION-only block in brininsert that runs the union
> support proc and compares the output with the one from regular addValue.
> I haven't tested this too much yet.

Ok, that's better than nothing. I wonder if it's too strict, though. It 
uses brin_tuple_equal(), which does a memcmp() on the tuples. That will 
trip for any non-meaningful differences, like the scale in a numeric.

>> * clarify the memory context stuff of support functions that we also
>> discussed earlier
>
> I re-checked this stuff.  Turns out that the support functions don't
> palloc/pfree memory too much, except to update the stuff stored in
> BrinValues, by using datumCopy().  This memory is only freed when we
> need to update a previous Datum.  There's no way for the brin.c code to
> know when the Datum is going to be released by the support proc, and
> thus no way for a temp context to be used.
>
> The memory context experiments I alluded to earlier are related to
> pallocs done in brininsert / bringetbitmap themselves, not in the
> opclass-provided support procs.

At the very least, it needs to be documented.

> All in all, I don't think there's much
> room for improvement, other than perhaps doing so in brininsert/
> bringetbitmap.  Don't really care too much about this either way.

Doing it in brininsert/bringetbitmap seems like the right approach. 
GiST, GIN, and SP-GiST all use a temporary memory context like that.

It would be wise to reserve some more support procedure numbers, for 
future expansion. Currently, support procs 1-4 are used by BRIN itself, 
and higher numbers can be used by the opclass. minmax opclasses uses 5-8 
for the <, <=, >= and > operators. If we ever want to add a new, 
optional, support function to BRIN, we're out of luck. Let's document 
that e.g. support procs < 10 are reserved for BRIN.

The redo routines should be updated to follow the new 
XLogReadBufferForRedo idiom (commit 
f8f4227976a2cdb8ac7c611e49da03aa9e65e0d2).

- Heikki

BRIN range operator class

From

Emre Hasegeli

Date:

19 October 2014, 17:04:31

> Once again, many thanks for the review.  Here's a new version.  I have
> added operator classes for int8, text, and actually everything that btree
> supports except:
>     bool
>     record
>     oidvector
>     anyarray
>     tsvector
>     tsquery
>     jsonb
>     range
>
> since I'm not sure that it makes sense to have opclasses for any of
> these -- at least not regular minmax opclasses.  There are some
> interesting possibilities, for example for range types, whereby we store
> in the index tuple the union of all the range in the block range.

I thought we can do better than minmax for the inet data type,
and ended up with a generalized opclass supporting both inet and range
types.  Patch based on minmax-v20 attached.  It works well except
a few small problems.  I will improve the patch and add into
a commitfest after BRIN framework is committed.

To support more operators I needed to change amstrategies and
amsupport on the catalog.  It would be nice if amsupport can be set
to 0 like amstrategies.

Inet data types accept IP version 4 and version 6.  It is not possible
to represent union of addresses from different versions with a valid
inet type.  So, I made the union function return NULL in this case.
Then, I tried to store if returned value is NULL or not, in
column->values[] as boolean, but it failed on the pfree() inside
brin_dtuple_initilize().  It doesn't seem right to free the values
based on attr->attbyval.

I think the same opclass can be used for geometric types.  I can
rename it to inclusion_ops instead of range_ops.  The GiST opclasses
for the geometric types use bounding boxes.  It wouldn't be possible
to use a different data type in a generic oplass.  Maybe STORAGE
parameter can be used for that purpose.

> (I had an opclass for anyenum too, but on further thought I removed it
> because it is going to be pointless in nearly all cases.)

It can be useful in some circumstances.  We wouldn't lose anything
by supporting more types.  I think we should even add an operator
class for boolean.

Attachment

brin-range-v01.patch

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

29 October 2014, 20:12:00

Heikki Linnakangas wrote:
> On 10/07/2014 01:33 AM, Alvaro Herrera wrote:

> >I added an USE_ASSERTION-only block in brininsert that runs the union
> >support proc and compares the output with the one from regular addValue.
> >I haven't tested this too much yet.
> 
> Ok, that's better than nothing. I wonder if it's too strict, though. It uses
> brin_tuple_equal(), which does a memcmp() on the tuples. That will trip for
> any non-meaningful differences, like the scale in a numeric.

True.  I'm not real sure how to do better, though.  For types that have
a btree opclass it's easy, because we can just use the btree equality
function to compare the values.  But most interesting cases would not
have btree opclasses; those are covered by the minmax family of
opclasses.

> It would be wise to reserve some more support procedure numbers, for future
> expansion. Currently, support procs 1-4 are used by BRIN itself, and higher
> numbers can be used by the opclass. minmax opclasses uses 5-8 for the <, <=,
> >= and > operators. If we ever want to add a new, optional, support function
> to BRIN, we're out of luck. Let's document that e.g. support procs < 10 are
> reserved for BRIN.

Sure.  I hope we never need to add a seventh optional support function ...

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Robert Haas

Date:

30 October 2014, 17:56:27

+   <acronym>BRIN</acronym> indexes can satisfy queries via the bitmap
+   scanning facility, and will return all tuples in all pages within

"The bitmap scanning facility?"  Does this mean a bitmap index scan?
Or something novel to BRIN?  I think this could be clearer.

+   This enables them to work as very fast sequential scan helpers to avoid
+   scanning blocks that are known not to contain matching tuples.

Hmm, but they don't actually do anything about sequential scans per
se, right?  I'd say something like: "Because a BRIN index is very
small, scanning the index adds little overhead compared to a
sequential scan, but may avoid scanning large parts of the table that
are known not to contain matching tuples."

+   depend on the operator class selected for the data type.

The operator class is selected for the index, not the data type.

+   The size of the block range is determined at index creation time with
+   the <literal>pages_per_range</> storage parameter.
+   The smaller the number, the larger the index becomes (because of the need to
+   store more index entries), but at the same time the summary data stored can
+   be more precise and more data blocks can be skipped during an index scan.

I would insert a sentence something like this: "The number of index
entries will be equal to the size of the relation in pages divided by
the selected value for pages_per_range.  Therefore, the smaller the
number...."  At least, I would insert that if it's actually true.  My
point is that I think the effect of pages_per_range could be made more
clear.

+   The core <productname>PostgreSQL</productname> distribution includes
+   includes the <acronym>BRIN</acronym> operator classes shown in
+   <xref linkend="gin-builtin-opclasses-table">.

Shouldn't that say brin, not gin?

+   requiring the access method implementer only to implement the semantics

The naming of the reverse range map seems a little weird.  It seems
like most operations go through it, so it feels more like the forward
direction.  Maybe I'm misunderstanding.  (I doubt it's worth renaming
it at this point either way, but I thought I'd mention it.)

+              errmsg("unlogged BRIN indexes are not supported")));

Why not?  Shouldn't be particularly hard, I wouldn't think.

I'm pretty sure you need to create a pageinspect--1.3.sql, not just
update the 1.2 file.  Because that's in 9.4, and this won't be.

I'm pretty excited about this feature.  I think it's going to be very
good for PostgreSQL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

03 November 2014, 22:19:50

Robert Haas wrote:
> [lots]

I have fixed all these items in the attached, thanks -- most
user-visible change was the pageinspect 1.3 thingy.  pg_upgrade from 1.2
works fine now.  I also fixed some things Heikki noted, mainly avoid
retail pfree where possible, and renumber the support procs to leave
room for future expansion of the framework.  XLog replay code is updated
too.

Also, I made the summarization step callable directly from SQL without
having to invoke VACUUM.

So here's v21.  I also attach a partial diff from v20, just in case
anyone wants to give it a look.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: BRIN indexes - TRAP: BadArgument

From

Jeff Janes

Date:

04 November 2014, 00:16:22

On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:
> [lots]

I have fixed all these items in the attached, thanks -- most
user-visible change was the pageinspect 1.3 thingy. pg_upgrade from 1.2
works fine now. I also fixed some things Heikki noted, mainly avoid
retail pfree where possible, and renumber the support procs to leave
room for future expansion of the framework. XLog replay code is updated
too.

Also, I made the summarization step callable directly from SQL without
having to invoke VACUUM.

So here's v21. I also attach a partial diff from v20, just in case
anyone wants to give it a look.

I get a couple compiler warnings with this:

brin.c: In function 'brininsert':

brin.c:97: warning: 'tupcxt' may be used uninitialized in this function

brin.c:98: warning: 'oldcxt' may be used uninitialized in this function

Also, I think it is missing a cat version bump. It let me start the patched server against an unpatched initdb run, but once started it didn't find the index method.

What would it take to make CLUSTER work on a brin index? Now I just added a btree index on the same column, clustered on that, then dropped that index.

Thanks,

Jeff

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

04 November 2014, 00:59:30

Jeff Janes wrote:
> On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:

> I get a couple compiler warnings with this:
> 
> brin.c: In function 'brininsert':
> brin.c:97: warning: 'tupcxt' may be used uninitialized in this function
> brin.c:98: warning: 'oldcxt' may be used uninitialized in this function

Ah, that's easily fixed.  My compiler (gcc 4.9 from Debian Jessie
nowadays) doesn't complain, but I can see that it's not entirely
trivial.

> Also, I think it is missing a cat version bump.  It let me start the
> patched server against an unpatched initdb run, but once started it didn't
> find the index method.

Sure, that's expected (by me at least).  I'm too lazy to maintain
catversion bumps in the patch before pushing, since that generates
constant conflicts as I rebase.

> What would it take to make CLUSTER work on a brin index?  Now I just added
> a btree index on the same column, clustered on that, then dropped that
> index.

Interesting question.  What's the most efficient way to pack a table to
minimize the intervals covered by each index entry?  One thing that
makes this project a bit easier, I think, is that CLUSTER has already
been generalized so that it supports either an indexscan or a
seqscan+sort.  If anyone wants to work on this, be my guest; I'm
certainly not going to add it to the initial commit.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Simon Riggs

Date:

04 November 2014, 08:42:40

On 3 November 2014 22:18, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> So here's v21.  I also attach a partial diff from v20, just in case
> anyone wants to give it a look.

Looks really good.

I'd like to reword this sentence in the readme, since one of the main
use cases would be tables without btrees  It's unlikely that BRIN would be the only
+ indexes in a table, though, because primary keys can be btrees only, and so
+ we don't implement this optimization.

I don't see a regression test. Create, use, VACUUM, just so we know it
hasn't regressed after commit.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Jeff Janes

Date:

04 November 2014, 10:07:25

On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

So here's v21. I also attach a partial diff from v20, just in case
anyone wants to give it a look.

This needs a bump to 1.3, or the extension won't install:

contrib/pageinspect/pageinspect.control

During crash recovery, I am getting a segfault:

#0 0x000000000067ee35 in LWLockRelease (lock=0x0) at lwlock.c:1161

#1 0x0000000000664f4a in UnlockReleaseBuffer (buffer=0) at bufmgr.c:2888

#2 0x0000000000465a88 in brin_xlog_revmap_extend (lsn=<value optimized out>, record=<value optimized out>) at brin_xlog.c:261

#3 brin_redo (lsn=<value optimized out>, record=<value optimized out>) at brin_xlog.c:284

#4 0x00000000004ce505 in StartupXLOG () at xlog.c:6795

I failed to preserve the data directory, I'll try to repeat this later this week if needed.

Cheers,

Jeff

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

04 November 2014, 22:30:11

Jeff Janes wrote:
> On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>
> >
> > So here's v21.  I also attach a partial diff from v20, just in case
> > anyone wants to give it a look.
> >
>
> This needs a bump to 1.3, or the extension won't install:

Missed that, thanks.

> #0  0x000000000067ee35 in LWLockRelease (lock=0x0) at lwlock.c:1161
> #1  0x0000000000664f4a in UnlockReleaseBuffer (buffer=0) at bufmgr.c:2888
> #2  0x0000000000465a88 in brin_xlog_revmap_extend (lsn=<value optimized
> out>, record=<value optimized out>) at brin_xlog.c:261
> #3  brin_redo (lsn=<value optimized out>, record=<value optimized out>) at
> brin_xlog.c:284
> #4  0x00000000004ce505 in StartupXLOG () at xlog.c:6795
>
> I failed to preserve the data directory, I'll try to repeat this later this
> week if needed.

I was clearly too careless about testing the xlog code --- it had
numerous bugs.  This version should be a lot better, but there might be
problems lurking still as I don't think I covered it all.  Let me know
if you see anything wrong.

I also added pageinspect docs, which I had neglected and only realized
due to a comment in another thread (thanks Amit).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

brin-22.patch

Re: BRIN indexes - TRAP: BadArgument

From

Jeff Janes

Date:

05 November 2014, 17:14:28

On Tue, Nov 4, 2014 at 2:28 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I was clearly too careless about testing the xlog code --- it had
numerous bugs. This version should be a lot better, but there might be
problems lurking still as I don't think I covered it all. Let me know
if you see anything wrong.

At line 252 of brin_xlog.c, should the UnlockReleaseBuffer(metabuf) be protected by a BufferIsValid?

XLogReadBufferForRedo says that it might return an invalid buffer under some situations. Perhaps it is known that those situations can't apply here?

Now I am getting segfaults during normal (i.e. no intentional crashes) operations. I think I was seeing them sometimes before as well, I just wasn't looking for them.

The attached script invokes the segfault within a few minutes. A lot of the stuff in the script is probably not necessary, I just didn't spend the time to pair it down to the essentials. It does not need to be in parallel, I get the crash when invoked with only one job (perl ~/brin_crash.pl 1).

I think this is related to having block ranges which have no tuples in them when they are first summarized. If I take out the "with t as (delete from foo returning *) insert into foo select * from t", then I don't see the crashes

#0 0x000000000089ed3e in pg_detoast_datum_packed (datum=0x0) at fmgr.c:2270

#1 0x0000000000869be9 in text_le (fcinfo=0x7fff1bf6b9f0) at varlena.c:1661

#2 0x000000000089cfc7 in FunctionCall2Coll (flinfo=0x297e640, collation=100, arg1=0, arg2=43488216) at fmgr.c:1324

#3 0x00000000004678f8 in minmaxConsistent (fcinfo=0x7fff1bf6be40) at brin_minmax.c:213

#4 0x000000000089d0c9 in FunctionCall3Coll (flinfo=0x297b830, collation=100, arg1=43509512, arg2=43510296, arg3=43495856) at fmgr.c:1349

#5 0x0000000000462484 in bringetbitmap (fcinfo=0x7fff1bf6c310) at brin.c:469

#6 0x000000000089cfc7 in FunctionCall2Coll (flinfo=0x28f2440, collation=0, arg1=43495712, arg2=43497376) at fmgr.c:1324

#7 0x00000000004b3fc9 in index_getbitmap (scan=0x297b120, bitmap=0x297b7a0) at indexam.c:651

#8 0x000000000062ece0 in MultiExecBitmapIndexScan (node=0x297af30) at nodeBitmapIndexscan.c:89

#9 0x0000000000619783 in MultiExecProcNode (node=0x297af30) at execProcnode.c:550

#10 0x000000000062dea2 in BitmapHeapNext (node=0x2974750) at nodeBitmapHeapscan.c:104

Cheers,

Jeff

Attachment

brin_crash.pl

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

05 November 2014, 20:54:44

Jeff Janes wrote:

> At line 252 of brin_xlog.c, should the UnlockReleaseBuffer(metabuf) be
> protected by a BufferIsValid?

Yes, that was just me being careless.  Fixed.

> Now I am getting segfaults during normal (i.e. no intentional crashes)
> operations.  I think I was seeing them sometimes before as well, I just
> wasn't looking for them.

Interesting.  I was neglecting to test for empty index tuples in the
Consistent support function.  Should be fixed now, and I verified that
the other support functions check for this condition (AFAICS this was
the only straggler -- I had fixed all the others already).

> I think this is related to having block ranges which have no tuples in them
> when they are first summarized.  If I take out the "with t as (delete from
> foo returning *) insert into foo select * from t", then I don't see the
> crashes

Exactly.

After fixing that I noticed that there was an assertion (about
collations) failing under certain conditions with your script.  I also
fixed that.  I also added a test for regress.  I didn't have time to
distill a standalone test case for your crash, though.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

brin-23.patch

Re: BRIN indexes - TRAP: BadArgument

From

Jeff Janes

Date:

05 November 2014, 22:58:39

On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Thanks for the updated patch.

Now when I run the test program (version with better error reporting attached), it runs fine until I open a psql session and issue:

reindex table foo;

Then it immediately falls over with some rows no longer being findable through the index.

-- use index

select count(*) from foo where text_array = md5(4611::text);

-- use seq scan

select count(*) from foo where text_array||'' = md5(4611::text);

Where the number '4611' was taken from the error message of the test program.

Attachment

brin_crash.pl

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

06 November 2014, 21:54:23

Jeff Janes wrote:
> On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>
> Thanks for the updated patch.
>
> Now when I run the test program (version with better error reporting
> attached), it runs fine until I open a psql session and issue:
>
> reindex table foo;

Interesting.  This was a more general issue actually -- if you dropped
the index at that point and created it again, the resulting index would
also be corrupt in the same way.  Inspecting with the supplied
pageinspect functions made the situation pretty obvious.  The old code
was skipping page ranges in which it could not find any tuples, but
that's bogus and inefficient.  I changed an "if" into a loop that
inserts intermediary tuples, if any are needed.  I cannot reproduce that
problem anymore.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

brin-24.patch.gz

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

07 November 2014, 19:56:14

I just pushed this, after some more minor tweaks.  Thanks, and please do
continue testing!

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

David Rowley

Date:

07 November 2014, 23:33:07

On Sat, Nov 8, 2014 at 8:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I just pushed this, after some more minor tweaks. Thanks, and please do
continue testing!

I'm having problems getting this to compile on MSVC. Attached is a patch which fixes the problem.

There also seems to be a bit of a problem with:

brin.c(250): warning C4700: uninitialized local variable 'newsz' used

* Before releasing the lock, check if we can attempt a same-page

* update. Another process could insert a tuple concurrently in

* the same page though, so downstream we must be prepared to cope

* if this turns out to not be possible after all.

samepage = brin_can_do_samepage_update(buf, origsz, newsz);

LockBuffer(buf, BUFFER_LOCK_UNLOCK);

newtup = brin_form_tuple(bdesc, heapBlk, dtup, &newsz);

Here newsz is passed to brin_can_do_samepage_update before being initialised. I'm not quite sure of the solution here as I've not spent much time looking at it, but perhaps brin_form_tuple needs to happen before brin_can_do_samepage_update, then the lock should be released? I didn't change this in the patch as I'm not sure if that's the proper fix or not.

The attached should fix the build problem that anole is having: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2014-11-07%2022%3A04%3A03

Regards

David Rowley

Attachment

brin_elog_fix.diff

Re: BRIN indexes - TRAP: BadArgument

From

Tom Lane

Date:

08 November 2014, 02:20:50

David Rowley <dgrowleyml@gmail.com> writes:
> I'm having problems getting this to compile on MSVC. Attached is a patch
> which fixes the problem.

The committed code is completely broken on compilers that don't accept
varargs macros, and this patch will not make them happier.

Probably what needs to happen is to put extra parentheses into the call
sites, along the lines of
      #ifdef BRIN_DEBUG      #define BRIN_elog(args) elog args      #else      #define BRIN_elog(args) ((void) 0)
#endif

      BRIN_elog((LOG, "fmt", ...));


Or we could decide we don't need this debugging crud anymore and just
nuke it all.
        regards, tom lane

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

08 November 2014, 03:27:24

Tom Lane wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
> > I'm having problems getting this to compile on MSVC. Attached is a patch
> > which fixes the problem.
> 
> The committed code is completely broken on compilers that don't accept
> varargs macros, and this patch will not make them happier.

I tried to make it fire only on GCC, which is known to support variadic
macros, but I evidently failed.

> Probably what needs to happen is to put extra parentheses into the call
> sites, along the lines of
> 
>        #ifdef BRIN_DEBUG
>        #define BRIN_elog(args) elog args
>        #else
>        #define BRIN_elog(args) ((void) 0)
>        #endif
> 
> 
>        BRIN_elog((LOG, "fmt", ...));

That works for me, thanks for the suggestion.

> Or we could decide we don't need this debugging crud anymore and just
> nuke it all.

I'm removing one which seems pointless, but keeping the others for now.
We can always remove them later.  (I also left BRIN_DEBUG turned on by
default; I'm turning it off.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

David Rowley

Date:

08 November 2014, 06:41:45

On Sat, Nov 8, 2014 at 8:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I just pushed this, after some more minor tweaks. Thanks, and please do
continue testing!

Here's another small fix for some unused variable warnings. Unfortunately this Microsoft compiler that I'm using does not know about __attribute__((unused)), so some warnings are generated for these:

BrinTuple *tmptup PG_USED_FOR_ASSERTS_ONLY;

BrinMemTuple *tmpdtup PG_USED_FOR_ASSERTS_ONLY;

Size tmpsiz PG_USED_FOR_ASSERTS_ONLY;

The attached patch moves these into within the #ifdef USE_ASSERT_CHECKING section.

I know someone will ask so, let me explain: The reason I don't see a bunch of other warnings for PG_USED_FOR_ASSERTS_ONLY vars when compiling without assert checks, is that this Microsoft compiler seems to be ok with variables being assigned values and the values never being used, but if the variable is never assigned a value, then it'll warn you of that.

Regards

David Rowley

Attachment

brin_unused_variables_fix.diff

Re: BRIN indexes - TRAP: BadArgument

From

David Rowley

Date:

08 November 2014, 08:40:46

On Sat, Nov 8, 2014 at 8:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I just pushed this, after some more minor tweaks. Thanks, and please do
continue testing!

Please find attached another small fix. This time it's just a small typo in the README, and just some updates to some, now outdated docs.

Kind Regards

David Rowley

Attachment

brin_docs_fix.diff

Re: BRIN indexes - TRAP: BadArgument

From

Michael Paquier

Date:

09 November 2014, 06:06:41

On Sat, Nov 8, 2014 at 5:40 PM, David Rowley <dgrowleyml@gmail.com> wrote:
> Please find attached another small fix. This time it's just a small typo in
> the README, and just some updates to some, now outdated docs.
Speaking about the feature... The index operators are still named with
"minmax", wouldn't it be better to switch to "brin"?
Regards,
-- 
Michael

Re: BRIN indexes - TRAP: BadArgument

From

Heikki Linnakangas

Date:

09 November 2014, 09:18:38

On 11/09/2014 08:06 AM, Michael Paquier wrote:
> On Sat, Nov 8, 2014 at 5:40 PM, David Rowley <dgrowleyml@gmail.com> wrote:
>> Please find attached another small fix. This time it's just a small typo in
>> the README, and just some updates to some, now outdated docs.
> Speaking about the feature... The index operators are still named with
> "minmax", wouldn't it be better to switch to "brin"?

All the built-in opclasses still implement the min-max policy - they 
store the min and max values. BRIN supports other kinds of opclasses, 
like storing a containing box for points, but no such opclasses have 
been implemented yet.

Speaking of which, Alvaro, any chance we could get such on opclass still 
included into 9.5? It would be nice to have one, just to be sure that 
nothing minmax-specific has crept into the BRIN code.

- Heikki

Re: BRIN indexes - TRAP: BadArgument

From

Fujii Masao

Date:

09 November 2014, 13:31:05

On Sat, Nov 8, 2014 at 4:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I just pushed this, after some more minor tweaks.

Nice!

> Thanks, and please do continue testing!

I got the following PANIC error in the standby server when I set up
the replication servers and ran "make installcheck". Note that I was
repeating the manual CHECKPOINT every second while "installcheck"
was running. Without the checkpoints, I could not reproduce the
problem. I'm not sure if CHECKPOINT really triggers this problem, though.
Anyway BRIN seems to have a problem around its WAL replay.

2014-11-09 22:19:42 JST sby1 WARNING:  page 547 of relation
base/16384/30878 does not exist
2014-11-09 22:19:42 JST sby1 CONTEXT:  xlog redo BRIN/UPDATE: rel
1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2)
TID (547,2)
2014-11-09 22:19:42 JST sby1 PANIC:  WAL contains references to invalid pages
2014-11-09 22:19:42 JST sby1 CONTEXT:  xlog redo BRIN/UPDATE: rel
1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2)
TID (547,2)
2014-11-09 22:19:47 JST sby1 LOG:  startup process (PID 15230) was
terminated by signal 6: Abort trap
2014-11-09 22:19:47 JST sby1 LOG:  terminating any other active server processes

Regards,

-- 
Fujii Masao

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

09 November 2014, 17:06:50

On Sun, Nov 9, 2014 at 9:18 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Speaking of which, Alvaro, any chance we could get such on opclass still
> included into 9.5? It would be nice to have one, just to be sure that
> nothing minmax-specific has crept into the BRIN code.

I'm trying to do a bloom filter Brin index. I'm already a bit puzzled
by a few things but I've just started so maybe it'll become clear.

From what I've seen so far it feels more likely there's the opposite.
There's some boilerplate that I'm doing that feels like it could be
pushed down into general Brin code since it'll be the same for every
access method.

-- 
greg

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

09 November 2014, 17:58:18

On Sun, Nov 9, 2014 at 5:06 PM, Greg Stark <stark@mit.edu> wrote:
> I'm trying to do a bloom filter Brin index. I'm already a bit puzzled
> by a few things but I've just started so maybe it'll become clear.

So some quick comments from pretty early goings -- partly because I'm
afraid once I get past them I'll forget what it was I was confused
by....

1) The manual describes the exensibility API including the BrinOpcInfo
struct -- but it doesn't define the BrinDesc struct that every API
method takes. It's not clear what exactly that argument is for or how
to make use of it.

2) The mention about additional opclass operators and to number them
from 11 up is fine -- but there's no explanation of how to decide what
operators need to be explicitly added like that. Specifically I gather
from reading minmax that = is handled internally by Brin and you only
need to add any other operators aside from = ? Is that right?

3) It's not entirely clear in the docs when each method is will be
invoked. Specifically it's not clear whether opcInfo is invoked once
when the index is defined or every time the definition is loaded to be
used. I gather it's the latter? Perhaps there needs to be a method
that's invoked specifically when the index is defined? I'm wondering
where I'm going to hook in the logic to determine the size and number
of hash functions to use for the bloom filter which needs to be
decided once when the index is created and then static for the index
in the future.

4) It doesn't look like BRIN handles cross-type operators at all. For
example this query with btree indexes can use the index just fine
because it looks up the operator based on both the left and right
operands:

::***# explain select * from data where i = 1::smallint;
┌─────────────────────────────────────────────────────────────────────┐
│                             QUERY PLAN                              │
├─────────────────────────────────────────────────────────────────────┤
│ Index Scan using btree_i on data  (cost=0.42..8.44 rows=1 width=14) │
│   Index Cond: (i = 1::smallint)                                     │
└─────────────────────────────────────────────────────────────────────┘
(2 rows)

But Minmax opclasses don't contain the cross-type operators and in
fact looking at the code I don't think minmax would be able to cope
(minmax_get_procinfo doesn't even get passed the type int he qual,
only the type of the column).

::***# explain select * from data2 where i = 1::smallint;
┌──────────────────────────────────────────────────────────┐
│                        QUERY PLAN                        │
├──────────────────────────────────────────────────────────┤
│ Seq Scan on data2  (cost=0.00..18179.00 rows=1 width=14) │
│   Filter: (i = 1::smallint)                              │
└──────────────────────────────────────────────────────────┘
(2 rows)

Time: 0.544 ms

-- 
greg

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

10 November 2014, 00:10:03

On Sun, Nov 9, 2014 at 5:57 PM, Greg Stark <stark@mit.edu> wrote:
> 2) The mention about additional opclass operators and to number them
> from 11 up is fine -- but there's no explanation of how to decide what
> operators need to be explicitly added like that. Specifically I gather
> from reading minmax that = is handled internally by Brin and you only
> need to add any other operators aside from = ? Is that right?

I see I totally misunderstood the use of the opclass procedure
functions. I think I understand now but just to be sure -- If I can
only handle BTEqualStrategyNumber keys then is it adequate to just
define the opclass containing only the equality operator?

Somehow I got confused between the amprocs that minmax uses to
implement the consistency function and the amops that the brin index
supports.

-- 
greg

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

10 November 2014, 01:46:55

Heikki Linnakangas wrote:

> Speaking of which, Alvaro, any chance we could get such on opclass still
> included into 9.5? It would be nice to have one, just to be sure that
> nothing minmax-specific has crept into the BRIN code.

Emre Hasegeli contributed a patch for range types.  I am hoping he will
post a rebased version that we can consider including.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Amit Langote

Date:

10 November 2014, 01:54:20

On Sun, Nov 9, 2014 at 10:30 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Nov 8, 2014 at 4:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> I just pushed this, after some more minor tweaks.
>
> Nice!
>
>> Thanks, and please do continue testing!
>
> I got the following PANIC error in the standby server when I set up
> the replication servers and ran "make installcheck". Note that I was
> repeating the manual CHECKPOINT every second while "installcheck"
> was running. Without the checkpoints, I could not reproduce the
> problem. I'm not sure if CHECKPOINT really triggers this problem, though.
> Anyway BRIN seems to have a problem around its WAL replay.
>
> 2014-11-09 22:19:42 JST sby1 WARNING:  page 547 of relation
> base/16384/30878 does not exist
> 2014-11-09 22:19:42 JST sby1 CONTEXT:  xlog redo BRIN/UPDATE: rel
> 1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2)
> TID (547,2)
> 2014-11-09 22:19:42 JST sby1 PANIC:  WAL contains references to invalid pages
> 2014-11-09 22:19:42 JST sby1 CONTEXT:  xlog redo BRIN/UPDATE: rel
> 1663/16384/30878 heapBlk 6 revmapBlk 1 pagesPerRange 1 old TID (3,2)
> TID (547,2)
> 2014-11-09 22:19:47 JST sby1 LOG:  startup process (PID 15230) was
> terminated by signal 6: Abort trap
> 2014-11-09 22:19:47 JST sby1 LOG:  terminating any other active server processes
>

I could reproduce this using the same steps. It's the same page 547
here too if that's any helpful.

Thanks,
Amit

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

10 November 2014, 20:40:36

Fujii Masao wrote:

> I got the following PANIC error in the standby server when I set up
> the replication servers and ran "make installcheck". Note that I was
> repeating the manual CHECKPOINT every second while "installcheck"
> was running. Without the checkpoints, I could not reproduce the
> problem. I'm not sure if CHECKPOINT really triggers this problem, though.
> Anyway BRIN seems to have a problem around its WAL replay.

Hm, I think I see what's happening.  The xl_brin_update record
references two buffers, one which is target for the updated tuple and
another which is the revmap buffer.  When the update target buffer is
being first used we set the INIT bit which removes the buffer reference
from the xlog record; in that case, if the revmap buffer is first being
modified after the prior checkpoint, that revmap buffer receives backup
block number 0; but the code hardcodes it as 1 on the expectation that
the buffer that's target for the update will receive 0.  The attached
patch should fix this.

I cannot reproduce the issue after applying this patch, can you please
confirm that it fixes the issue for you as well?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

brinxlog.patch

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

10 November 2014, 21:16:05

Alvaro Herrera wrote:

> Hm, I think I see what's happening.  The xl_brin_update record
> references two buffers, one which is target for the updated tuple and
> another which is the revmap buffer.  When the update target buffer is
> being first used we set the INIT bit which removes the buffer reference
> from the xlog record; in that case, if the revmap buffer is first being
> modified after the prior checkpoint, that revmap buffer receives backup
> block number 0; but the code hardcodes it as 1 on the expectation that
> the buffer that's target for the update will receive 0.  The attached
> patch should fix this.

Pushed, thanks for the report.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

10 November 2014, 21:41:53

Greg Stark wrote:
> On Sun, Nov 9, 2014 at 5:57 PM, Greg Stark <stark@mit.edu> wrote:
> > 2) The mention about additional opclass operators and to number them
> > from 11 up is fine -- but there's no explanation of how to decide what
> > operators need to be explicitly added like that. Specifically I gather
> > from reading minmax that = is handled internally by Brin and you only
> > need to add any other operators aside from = ? Is that right?
> 
> I see I totally misunderstood the use of the opclass procedure
> functions. I think I understand now but just to be sure -- If I can
> only handle BTEqualStrategyNumber keys then is it adequate to just
> define the opclass containing only the equality operator?

Yes.

I agree that this deserves some more documentation.  In a nutshell, the
opclass must provide three separate groups of items:

1. the mandatory support functions, opcInfo, addValue, Union,
Consistent.  opcInfo is invoked each time the index is accessed
(including during index creation).

2. the additional support functions; normally these are called from
within addValue, Consistent, Union.  For minmax, what we provide is the
functions that implement the inequality operators for the type, that is
< <= => and >.  Since minmax tries to be generic and support a whole lot
of types, this is the way that the mandatory support functions know what
functions to call to compare two given values.  If the opclass is
specific to one data type, you might not need anything here; or perhaps
you have other ways to figure out a hash function to call, etc.

3. the operators.  We only use these so that the optimizer picks up the
index for queries.

> Somehow I got confused between the amprocs that minmax uses to
> implement the consistency function and the amops that the brin index
> supports.

I think it is somewhat confusing, yeah.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

10 November 2014, 21:41:54

Greg Stark wrote:

> 1) The manual describes the exensibility API including the BrinOpcInfo
> struct -- but it doesn't define the BrinDesc struct that every API
> method takes. It's not clear what exactly that argument is for or how
> to make use of it.

Hm, I guess this could use some expansion.

> 2) The mention about additional opclass operators and to number them
> from 11 up is fine -- but there's no explanation of how to decide what
> operators need to be explicitly added like that. Specifically I gather
> from reading minmax that = is handled internally by Brin and you only
> need to add any other operators aside from = ? Is that right?

I think I already replied to this in the other email.

> 3) It's not entirely clear in the docs when each method is will be
> invoked. Specifically it's not clear whether opcInfo is invoked once
> when the index is defined or every time the definition is loaded to be
> used. I gather it's the latter? Perhaps there needs to be a method
> that's invoked specifically when the index is defined? I'm wondering
> where I'm going to hook in the logic to determine the size and number
> of hash functions to use for the bloom filter which needs to be
> decided once when the index is created and then static for the index
> in the future.

Every time the index is accessed, yeah.  I'm not sure about figuring the
initial creation details.  Do you think we need another support
procedure to help with that?  We can add it if needed; minmax would just
define it to InvalidOid.

> 4) It doesn't look like BRIN handles cross-type operators at all.

The idea here is that there is a separate opclass to handle cross-type
operators, which would be together in the same opfamily as the opclass
used to create the index.  I haven't actually tried this yet, mind you.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

10 November 2014, 22:44:33

On Mon, Nov 10, 2014 at 9:31 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Every time the index is accessed, yeah.  I'm not sure about figuring the
> initial creation details.  Do you think we need another support
> procedure to help with that?  We can add it if needed; minmax would just
> define it to InvalidOid.

I have a working bloom filter with hard coded filter size and hard
coded number of hash functions. I need to think about how I'm going to
make it more general now. I think the answer is that I should have an
index option that specifies the false positive rate and calculates the
optimal filter size and number of hash functions. It might possibly
need to peek at the table statistics to determine the population size
though. Or perhaps I should bite the bullet and size the bloom filters
based on the actual number of rows in a chunk since the BRIN
infrastructure does allow each summary to be a different size.

There's another API question I have. To implement Consistent I need to
call the hash function which in the case of functions like hashtext
could be fairly expensive and I even need to generate multiple hash
values(though currently I'm slicing them all from the integer hash
value so that's not too bad) and then test each of those bits. It
would be natural to call hashtext once at the start of the scan and
possibly build a bitmap and compare all of them in a single &
operation. But afaict there's no way to hook the beginning of the scan
and opaque is not associated with the specific scan so I don't think I
can cache the hash value of the scan key there safely. Is there a good
way to do it with the current API?

On a side note I'm curious about something, I was stepping through the
my code in gdb and discovered that a single row insert appeared to
construct a new summary then union it into the existing summary
instead of just calling AddValue on the existing summary. Is that
intentional? What led to that?

-- 
greg

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

11 November 2014, 02:14:12

Greg Stark wrote:

> There's another API question I have. To implement Consistent I need to
> call the hash function which in the case of functions like hashtext
> could be fairly expensive and I even need to generate multiple hash
> values(though currently I'm slicing them all from the integer hash
> value so that's not too bad) and then test each of those bits. It
> would be natural to call hashtext once at the start of the scan and
> possibly build a bitmap and compare all of them in a single &
> operation. But afaict there's no way to hook the beginning of the scan
> and opaque is not associated with the specific scan so I don't think I
> can cache the hash value of the scan key there safely. Is there a good
> way to do it with the current API?

I'm not sure why you say opaque is not associated with the specific
scan.  Are you thinking we could reuse opaque for a future scan?  I
think we could consider that opaque *is* the place to cache things such
as the hashed value of the qual constants or whatever.

> On a side note I'm curious about something, I was stepping through the
> my code in gdb and discovered that a single row insert appeared to
> construct a new summary then union it into the existing summary
> instead of just calling AddValue on the existing summary. Is that
> intentional? What led to that?

That's to test the Union procedure; if you look at the code, it's just
used in assert-enabled builds.  Now that I think about it, perhaps this
can turn out to be problematic for your bloom filter opclass.  I
considered the idea of allowing the opclass to disable this testing
procedure, but it isn't done (yet.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

11 November 2014, 08:15:06

On Tue, Nov 11, 2014 at 2:14 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I'm not sure why you say opaque is not associated with the specific
> scan.  Are you thinking we could reuse opaque for a future scan?  I
> think we could consider that opaque *is* the place to cache things such
> as the hashed value of the qual constants or whatever.

Oh. I guess this goes back to my original suggestion that the API docs
need to explain some sense of when OpcInfo is called. I didn't realize
it was tied to a specific scan. This does raise the question of why
the scan information isn't available in OpcInfo though. That would let
me build the hash value in a natural place instead of having to do it
lazily which I find significantly more awkward.

Is it possible for scan keys to change between calls for nested loop
joins or quirky SQL with volatile functions in the scan or anything? I
guess that would prevent the index scan from being used at all. But I
can be reassured the Opcinfo call will be called again when a cached
plan is reexecuted? Stable functions might have new values in a
subsequent execution even if the plan hasn't changed at all for
example.

> That's to test the Union procedure; if you look at the code, it's just
> used in assert-enabled builds.  Now that I think about it, perhaps this
> can turn out to be problematic for your bloom filter opclass.  I
> considered the idea of allowing the opclass to disable this testing
> procedure, but it isn't done (yet.)

No, it isn't a problem for my opclass other than performance, it was
quite helpful in turning up bugs early in fact. It was just a bit
confusing because I was trying to test things one by one and it turned
out the assertion checks meant a simple insert turned up bugs in Union
which I hadn't expected. But it seems perfectly sensible in an
assertion check.

-- 
greg

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

11 November 2014, 12:12:47

Greg Stark wrote:
> On Tue, Nov 11, 2014 at 2:14 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > I'm not sure why you say opaque is not associated with the specific
> > scan.  Are you thinking we could reuse opaque for a future scan?  I
> > think we could consider that opaque *is* the place to cache things such
> > as the hashed value of the qual constants or whatever.
> 
> Oh. I guess this goes back to my original suggestion that the API docs
> need to explain some sense of when OpcInfo is called. I didn't realize
> it was tied to a specific scan. This does raise the question of why
> the scan information isn't available in OpcInfo though. That would let
> me build the hash value in a natural place instead of having to do it
> lazily which I find significantly more awkward.

Hmm.  OpcInfo is also called in contexts other than scans, though, so
passing down scan keys into it seems wrong.  Maybe we do need another
amproc that "initializes" the scan for the opclass, which would get
whatever got returned from opcinfo as well as scankeys.  There you would
have the opportunity to run the hash and store it into the opaque.

> Is it possible for scan keys to change between calls for nested loop
> joins or quirky SQL with volatile functions in the scan or anything? I
> guess that would prevent the index scan from being used at all. But I
> can be reassured the Opcinfo call will be called again when a cached
> plan is reexecuted? Stable functions might have new values in a
> subsequent execution even if the plan hasn't changed at all for
> example.

As far as I understand, the scan keys don't change within any given
scan; if they do, the rescan AM method is called, at which point we
should reset whatever is cached about the previous scan.

> > That's to test the Union procedure; if you look at the code, it's just
> > used in assert-enabled builds.  Now that I think about it, perhaps this
> > can turn out to be problematic for your bloom filter opclass.  I
> > considered the idea of allowing the opclass to disable this testing
> > procedure, but it isn't done (yet.)
> 
> No, it isn't a problem for my opclass other than performance, it was
> quite helpful in turning up bugs early in fact. It was just a bit
> confusing because I was trying to test things one by one and it turned
> out the assertion checks meant a simple insert turned up bugs in Union
> which I hadn't expected. But it seems perfectly sensible in an
> assertion check.

Great, thanks.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

11 November 2014, 12:39:08

On Tue, Nov 11, 2014 at 12:12 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> As far as I understand, the scan keys don't change within any given
> scan; if they do, the rescan AM method is called, at which point we
> should reset whatever is cached about the previous scan.

But am I guaranteed that rescan will throw away the opcinfo struct and
its opaque element? I guess that's the heart of the uncertainty I had.

-- 
greg

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

11 November 2014, 12:52:45

Greg Stark wrote:
> On Tue, Nov 11, 2014 at 12:12 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > As far as I understand, the scan keys don't change within any given
> > scan; if they do, the rescan AM method is called, at which point we
> > should reset whatever is cached about the previous scan.
>
> But am I guaranteed that rescan will throw away the opcinfo struct and
> its opaque element? I guess that's the heart of the uncertainty I had.

Well, it should, and if not that's a bug, which should be fixed by the
attached (untested) patch.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

reinit-opaque.patch

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

11 November 2014, 13:00:24

It might be clearer to have an opclassinfo and a scaninfo which can
store information in separate opc_opaque and scan_opaque fields with
distinct liftetimes.

In the bloom filter case the longlived info is the (initial?) size of
the bloom filter and the number of hash functions. But I still haven't
determined how much it will cost to recalculate them. Right now
they're just hard coded so it doesn't hurt to do it on every rescan
but if it involves peeking at the index reloptions or stats that might
be impractical.

Re: BRIN indexes - TRAP: BadArgument

From

Alvaro Herrera

Date:

11 November 2014, 13:04:12

Greg Stark wrote:
> It might be clearer to have an opclassinfo and a scaninfo which can
> store information in separate opc_opaque and scan_opaque fields with
> distinct liftetimes.
> 
> In the bloom filter case the longlived info is the (initial?) size of
> the bloom filter and the number of hash functions. But I still haven't
> determined how much it will cost to recalculate them. Right now
> they're just hard coded so it doesn't hurt to do it on every rescan
> but if it involves peeking at the index reloptions or stats that might
> be impractical.

Patches welcome :-)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BRIN indexes - TRAP: BadArgument

From

Greg Stark

Date:

11 November 2014, 13:14:31

On Tue, Nov 11, 2014 at 1:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> It might be clearer ...
>
> Patches welcome :-)

Or perhaps there could still be a single opaque field but have two
optional opclass methods "scaninit" and "rescan" which allow the op
class to set or reset whichever fields inside opaque that need to be
reset.

-- 
greg

Re: BRIN indexes - TRAP: BadArgument

From

Amit Kapila

Date:

17 November 2014, 04:13:33

On Sat, Nov 8, 2014 at 1:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>
> I just pushed this, after some more minor tweaks. Thanks, and please do
> continue testing!
>

Few typo's and few questions

1. * range. Need to an extra flag in mmtuples for that.

Datum

brinbulkdelete(PG_FUNCTION_ARGS)

Isn't the part of comment referring *mmtuples* require some change,

as I think mmtuples was used in initial version of patch.

/* ---------------

* mt_info is laid out in the following fashion:

* 7th (high)

bit: has nulls

* 6th bit: is placeholder tuple

* 5th bit: unused

* 4-0 bit: offset of data

* ---------------

uint8 bt_info;

} BrinTuple;

Here in comments, bt_info is referred as mt_info.

* t_info manipulation macros

#define BRIN_OFFSET_MASK 0x1F

I think in above comment it should be bt_info, rather than t_info.

static void

revmap_physical_extend(BrinRevmap *revmap)

{

START_CRIT_SECTION();

/* the rm_tids array is initialized to all invalid by PageInit */

brin_page_init(page, BRIN_PAGETYPE_REVMAP);

MarkBufferDirty(buf);

metadata->lastRevmapPage = mapBlk;

MarkBufferDirty(revmap->rm_metaBuf);

}

Can't we update revmap->rm_lastRevmapPage along with metadata->lastRevmap?

typedef struct BrinMemTuple

{

bool bt_placeholder; /* this is a placeholder tuple */

BlockNumber bt_blkno; /* heap blkno that the tuple is for */

MemoryContext bt_context; /*

memcxt holding the dt_column values */

}

How is this memory context getting used?

I could see that this is used brin_deform_tuple() which gets called from

3 other places in core code bringetbitmap(), brininsert() and union_tuples()

and in all the 3 places there is already another temporaray memory context

used to avoid any form of memory leaks.

Is there anyway to force brin index to be off, if not, then do we need it

as it is present for other type of scan's.

like set enable_indexscan=off;

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BRIN range operator class

From

Emre Hasegeli

Date:

14 December 2014, 20:05:43

> I thought we can do better than minmax for the inet data type,
> and ended up with a generalized opclass supporting both inet and range
> types.  Patch based on minmax-v20 attached.  It works well except
> a few small problems.  I will improve the patch and add into
> a commitfest after BRIN framework is committed.

I wanted to send a new version before the commitfest to get some
feedback, but it is still work in progress.  Patch attached rebased
to the current HEAD.  This version supports more operators and
box from geometric data types.  Opclasses are renamed to inclusion_ops
to be more generic.  The problems I mentioned remain beause I
couldn't solve them without touching the BRIN framework.

> To support more operators I needed to change amstrategies and
> amsupport on the catalog.  It would be nice if amsupport can be set
> to 0 like am strategies.

I think it would be nicer to get the functions from the operators
with using the strategy numbers instead of adding them directly as
support functions.  I looked around a bit but couldn't find
a sensible way to support it.  Is it possible without adding them
to the RelationData struct?

> Inet data types accept IP version 4 and version 6.  It isn't possible
> to represent union of addresses from different versions with a valid
> inet type.  So, I made the union function return NULL in this case.
> Then, I tried to store if returned value is NULL or not, in
> column->values[] as boolean, but it failed on the pfree() inside
> brin_dtuple_initilize().  It doesn't seem right to free the values
> based on attr->attbyval.

This problem remains.  There is also a similar problem with the
range types, namely empty ranges.  There should be special cases
for them on some of the strategies.  I tried to solve the problems
in several different ways, but got a segfault one line or another.
This makes me think that BRIN framework doesn't support to store
different types than the indexed column in the values array.
For example, brin_deform_tuple() iterates over the values array and
copies them using the length of the attr on the index, not the length
of the type defined by OpcInfo function.  If storing another types
aren't supported, why is it required to return oid's on the OpcInfo
function.  I am confused.

I didn't try to support other geometric types than box as I couldn't
managed to store a different type on the values array, but it would
be nice to get some feedback about the overall design.  I was
thinking to add a STORAGE parameter to the index to support other
geometric types.  I am not sure that adding the STORAGE parameter
to be used by the opclass implementation is the right way.  It
wouldn't be the actual thing that is stored by the index, it will be
an element in the values array.  Maybe, data type specific opclasses
is the way to go, not a generic one as I am trying.

Attachment

brin-inclusion-v02.patch

Re: BRIN range operator class

From

Andreas Karlsson

Date:

11 January 2015, 00:36:35

Hi,

I made a quick review for your patch, but I would like to see someone 
who was involved in the BRIN work comment on Emre's design issues. I 
will try to answer them as best as I can below.

I think minimax indexes on range types seems very useful, and inet/cidr 
too. I have no idea about geometric types. But we need to fix the issues 
with empty ranges and IPv4/IPv6 for these indexes to be useful.

= Review

The current code compiles but the brin test suite fails.

I tested the indexes a bit and they seem to work fine, except for cases 
where we know it to be broken like IPv4/IPv6.

The new code is generally clean and readable.

I think some things should be broken out in separate patches since they 
are unrelated to this patch.

- The addition of &< and >& on inet types.

- The fix in brin_minmax.c.

Your brin tests seems to forget &< and >& for inet types.

The tests should preferably be extended to support ipv6 and empty ranges 
once we have fixed support for those cases.

The /* If the it is all nulls, it cannot possibly be consistent. */ 
comment is different from the equivalent comment in brin_minmax.c. I do 
not see why they should be different.

In brin_inclusion_union() the "if (col_b->bv_allnulls)" is done after 
handling has_nulls, which is unlike what is done in brin_minmax_union(), 
which code is right? I am leaning towards the code in 
brin_inclusion_union() since you can have all_nulls without has_nulls.

On 12/14/2014 09:04 PM, Emre Hasegeli wrote:
>> To support more operators I needed to change amstrategies and
>> amsupport on the catalog.  It would be nice if amsupport can be set
>> to 0 like am strategies.
>
> I think it would be nicer to get the functions from the operators
> with using the strategy numbers instead of adding them directly as
> support functions.  I looked around a bit but couldn't find
> a sensible way to support it.  Is it possible without adding them
> to the RelationData struct?

Yes that would be nice, but I do not think the current solution is terrible.

> This problem remains.  There is also a similar problem with the
> range types, namely empty ranges.  There should be special cases
> for them on some of the strategies.  I tried to solve the problems
> in several different ways, but got a segfault one line or another.
> This makes me think that BRIN framework doesn't support to store
> different types than the indexed column in the values array.
> For example, brin_deform_tuple() iterates over the values array and
> copies them using the length of the attr on the index, not the length
> of the type defined by OpcInfo function.  If storing another types
> aren't supported, why is it required to return oid's on the OpcInfo
> function.  I am confused.

I leave this to someone more knowledgable about BRIN to answer.

> I didn't try to support other geometric types than box as I couldn't
> managed to store a different type on the values array, but it would
> be nice to get some feedback about the overall design.  I was
> thinking to add a STORAGE parameter to the index to support other
> geometric types.  I am not sure that adding the STORAGE parameter
> to be used by the opclass implementation is the right way.  It
> wouldn't be the actual thing that is stored by the index, it will be
> an element in the values array.  Maybe, data type specific opclasses
> is the way to go, not a generic one as I am trying.

I think a STORAGE parameter sounds like a good idea. Could it also be 
used to solve the issue with IPv4/IPv6 by setting the storage type to 
custom? Or is that the wrong way to fix things?

-- 
Andreas Karlsson

Re: BRIN range operator class

From

Alvaro Herrera

Date:

22 January 2015, 20:18:42

Can you please break up this patch?  I think I see three patches,

1. add sql-callable functions such as inet_merge, network_overright, etc
etc.  These need documentation and a trivial regression test
somewhere.

2. necessary changes to header files (skey.h etc)

3. the inclusion opclass itself

Thanks

BTW the main idea behind having opcinfo return the type oid was to tell
the index what was stored in the index.  If that doesn't work right now,
maybe it needs some tweak to the brin framework code.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Emre Hasegeli

Date:

11 February 2015, 18:34:51

Thank you for looking at my patch again.  New version is attached
with a lot of changes and point data type support.

> I think minimax indexes on range types seems very useful, and inet/cidr too.
> I have no idea about geometric types. But we need to fix the issues with
> empty ranges and IPv4/IPv6 for these indexes to be useful.

Both of the cases are fixed on the new version.

> The current code compiles but the brin test suite fails.

Now, only a test in .

> I tested the indexes a bit and they seem to work fine, except for cases
> where we know it to be broken like IPv4/IPv6.
>
> The new code is generally clean and readable.
>
> I think some things should be broken out in separate patches since they are
> unrelated to this patch.

Yes but they were also required by this patch.  This version adds more
functions and operators.  I can split them appropriately after your
review.

> - The addition of &< and >& on inet types.

I haven't actually added the operators, just the underlying procedures
for them to support basic comparison operators with the BRIN opclass.
I left them them out on the new version because of its new design.
We can add the operators later with documentation, tests and index
support.

> - The fix in brin_minmax.c.

It is already committed by Alvaro Herrera.  I can send another patch
to use pg_amop instead of pg_amproc on brin_minmax.c, if it is
acceptable.

> The tests should preferably be extended to support ipv6 and empty ranges
> once we have fixed support for those cases.

Done.

> The /* If the it is all nulls, it cannot possibly be consistent. */ comment
> is different from the equivalent comment in brin_minmax.c. I do not see why
> they should be different.

Not to confuse with the empty ranges.  Also, there it supports other
types than ranges, like box.

> In brin_inclusion_union() the "if (col_b->bv_allnulls)" is done after
> handling has_nulls, which is unlike what is done in brin_minmax_union(),
> which code is right? I am leaning towards the code in brin_inclusion_union()
> since you can have all_nulls without has_nulls.

>> I think it would be nicer to get the functions from the operators
>> with using the strategy numbers instead of adding them directly as
>> support functions.  I looked around a bit but couldn't find
>> a sensible way to support it.  Is it possible without adding them
>> to the RelationData struct?
>
>
> Yes that would be nice, but I do not think the current solution is terrible.

The new version does it this way.  It was required to support
strategies between different types.

>> This problem remains.  There is also a similar problem with the
>> range types, namely empty ranges.  There should be special cases
>> for them on some of the strategies.  I tried to solve the problems
>> in several different ways, but got a segfault one line or another.
>> This makes me think that BRIN framework doesn't support to store
>> different types than the indexed column in the values array.
>> For example, brin_deform_tuple() iterates over the values array and
>> copies them using the length of the attr on the index, not the length
>> of the type defined by OpcInfo function.  If storing another types
>> aren't supported, why is it required to return oid's on the OpcInfo
>> function.  I am confused.
>
>
> I leave this to someone more knowledgable about BRIN to answer.

I think I have fixed them.

>> I didn't try to support other geometric types than box as I couldn't
>> managed to store a different type on the values array, but it would
>> be nice to get some feedback about the overall design.  I was
>> thinking to add a STORAGE parameter to the index to support other
>> geometric types.  I am not sure that adding the STORAGE parameter
>> to be used by the opclass implementation is the right way.  It
>> wouldn't be the actual thing that is stored by the index, it will be
>> an element in the values array.  Maybe, data type specific opclasses
>> is the way to go, not a generic one as I am trying.
>
>
> I think a STORAGE parameter sounds like a good idea. Could it also be used
> to solve the issue with IPv4/IPv6 by setting the storage type to custom? Or
> is that the wrong way to fix things?

I have fixed different addressed families by adding another support
function.

I used STORAGE parameter to support the point data type.  To make it
work I added some operators between box and point data type.  We can
support all geometric types with this method.

Attachment

brin-inclusion-v03.patch

Re: BRIN range operator class

From

Michael Paquier

Date:

13 February 2015, 12:57:30

On Thu, Feb 12, 2015 at 3:34 AM, Emre Hasegeli <emre@hasegeli.com> wrote:

Thank you for looking at my patch again. New version is attached
with a lot of changes and point data type support.

Patch is moved to next CF 2015-02 as work is still going on.
--

Michael

Re: BRIN range operator class

From

Andreas Karlsson

Date:

08 March 2015, 04:44:41

On 02/11/2015 07:34 PM, Emre Hasegeli wrote:
>> The current code compiles but the brin test suite fails.
>
> Now, only a test in .

Yeah, there is still a test which fails in opr_sanity.

> Yes but they were also required by this patch.  This version adds more
> functions and operators.  I can split them appropriately after your
> review.

Ok, sounds fine to me.

>>> This problem remains.  There is also a similar problem with the
>>> range types, namely empty ranges.  There should be special cases
>>> for them on some of the strategies.  I tried to solve the problems
>>> in several different ways, but got a segfault one line or another.
>>> This makes me think that BRIN framework doesn't support to store
>>> different types than the indexed column in the values array.
>>> For example, brin_deform_tuple() iterates over the values array and
>>> copies them using the length of the attr on the index, not the length
>>> of the type defined by OpcInfo function.  If storing another types
>>> aren't supported, why is it required to return oid's on the OpcInfo
>>> function.  I am confused.
>>
>>
>> I leave this to someone more knowledgable about BRIN to answer.
>
> I think I have fixed them.

Looks good as far as I can tell.

> I have fixed different addressed families by adding another support
> function.
>
> I used STORAGE parameter to support the point data type.  To make it
> work I added some operators between box and point data type.  We can
> support all geometric types with this method.

Looks to me like this should work.

= New comments

- Searching for the empty range is slow since the empty range matches 
all brin ranges.

EXPLAIN ANALYZE SELECT * FROM foo WHERE r = '[1,1)';                                                      QUERY PLAN 

-----------------------------------------------------------------------------------------------------------------------
BitmapHeap Scan on foo  (cost=12.01..16.02 rows=1 width=14) (actual 
 
time=47.603..47.605 rows=1 loops=1)   Recheck Cond: (r = 'empty'::int4range)   Rows Removed by Index Recheck: 200000
HeapBlocks: lossy=1082   ->  Bitmap Index Scan on foo_r_idx  (cost=0.00..12.01 rows=1 
 
width=0) (actual time=0.169..0.169 rows=11000 loops=1)         Index Cond: (r = 'empty'::int4range) Planning time:
0.062ms Execution time: 47.647 ms
 
(8 rows)

- Found a typo in the docs: "withing the range"

- Why have you removed the USE_ASSERT_CHECKING code from brin.c?

- Remove redundant "or not" from "/* includes empty element or not */".

- Minor grammar gripe: Change "Check that" to "Check if" in the comments 
in brin_inclusion_add_value().

- Wont the code incorrectly return false if the first added element to 
an index page is empty?

- Would it be worth optimizing the code by checking for empty ranges 
after checking for overlap in brin_inclusion_add_value()? I would 
imagine that empty ranges are rare in most use cases.

- Typo in comment: "If the it" -> "If it"

- Typo in comment: "Note that this strategies" -> "Note that these 
strategies"

- Typo in comment: "inequality strategies does not" -> "inequality 
strategies do not"

- Typo in comment: "geometric types which uses" -> "geometric types 
which use"

- I get 'ERROR:  missing strategy 7 for attribute 1 of index 
"bar_i_idx"' when running the query below. Why does this not fail in the 
test suite? The overlap operator works just fine. If I read your code 
correctly other strategies are also missing.

SELECT * FROM bar WHERE i = '::1';

- I do not think this comment is true "Used to determine the addresses 
have a common union or not". It actually checks if we can create range 
which contains both ranges.

- Compact random spaces in "select numrange(1.0, 2.0) + numrange(2.5, 
3.0);        -- should fail"

-- 
Andreas Karlsson

Re: BRIN range operator class

From

Emre Hasegeli

Date:

06 April 2015, 19:36:42

> Yeah, there is still a test which fails in opr_sanity.

I attached an additional patch to remove extra pg_amproc entries from
minmax operator classes.  It fixes the test as a side effect.

>> Yes but they were also required by this patch.  This version adds more
>> functions and operators.  I can split them appropriately after your
>> review.
>
>
> Ok, sounds fine to me.

It is now split.

> = New comments
>
> - Searching for the empty range is slow since the empty range matches all
> brin ranges.
>
> EXPLAIN ANALYZE SELECT * FROM foo WHERE r = '[1,1)';
>                                                       QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foo  (cost=12.01..16.02 rows=1 width=14) (actual
> time=47.603..47.605 rows=1 loops=1)
>    Recheck Cond: (r = 'empty'::int4range)
>    Rows Removed by Index Recheck: 200000
>    Heap Blocks: lossy=1082
>    ->  Bitmap Index Scan on foo_r_idx  (cost=0.00..12.01 rows=1 width=0)
> (actual time=0.169..0.169 rows=11000 loops=1)
>          Index Cond: (r = 'empty'::int4range)
>  Planning time: 0.062 ms
>  Execution time: 47.647 ms
> (8 rows)

There is not much we can do about it.  It looks like the problem in
here is the selectivity estimation.

> - Found a typo in the docs: "withing the range"

Fixed.

> - Why have you removed the USE_ASSERT_CHECKING code from brin.c?

Because it doesn't work with the new operator class.  We don't set the
union field when there are elements that are not mergeable.

> - Remove redundant "or not" from "/* includes empty element or not */".

Fixed.

> - Minor grammar gripe: Change "Check that" to "Check if" in the comments in
> brin_inclusion_add_value().

Fixed.

> - Wont the code incorrectly return false if the first added element to an
> index page is empty?

No, column->bv_values[2] is set to true for the first empty element.

> - Would it be worth optimizing the code by checking for empty ranges after
> checking for overlap in brin_inclusion_add_value()? I would imagine that
> empty ranges are rare in most use cases.

I changed it for all empty range checks.

> - Typo in comment: "If the it" -> "If it"
>
> - Typo in comment: "Note that this strategies" -> "Note that these
> strategies"
>
> - Typo in comment: "inequality strategies does not" -> "inequality
> strategies do not"
>
> - Typo in comment: "geometric types which uses" -> "geometric types which
> use"

All of them are fixed.

> - I get 'ERROR:  missing strategy 7 for attribute 1 of index "bar_i_idx"'
> when running the query below. Why does this not fail in the test suite? The
> overlap operator works just fine. If I read your code correctly other
> strategies are also missing.
>
> SELECT * FROM bar WHERE i = '::1';

I fixed it on the new version.  Tests wasn't failing because they were
using minimal operator class for quality.

> - I do not think this comment is true "Used to determine the addresses have
> a common union or not". It actually checks if we can create range which
> contains both ranges.

Changed as you suggested.

> - Compact random spaces in "select numrange(1.0, 2.0) + numrange(2.5, 3.0);       -- should fail"

There was a tab in there.  Now it is replaced with a space.

Attachment

Re: BRIN range operator class

From

Alvaro Herrera

Date:

06 April 2015, 21:17:02

Thanks for the updated patch; I will at it as soon as time allows.  (Not
really all that soon, regrettably.)

Judging from a quick look, I think patches 1 and 5 can be committed
quickly; they imply no changes to other parts of BRIN.  (Not sure why 1
and 5 are separate.  Any reason for this?)  Also patch 2.

Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN
framework code; should also be committable right away.  Needs a closer
look of course.

Patch 3 is a problem.  That code is there because the union proc is only
used in a corner case in Minmax, so if we remove it, user-written Union
procs are very likely to remain buggy for long.  If you have a better
idea to test Union in Minmax, or some other way to turn that stuff off
for the range stuff, I'm all ears.  Just lets make sure the support
procs are tested to avoid stupid bugs.  Before I introduced that, my
Minmax Union proc was all wrong.

Patch 7 I don't understand.  Will have to look closer.  Are you saying
Minmax will depend on Btree opclasses?  I remember thinking in doing it
that way at some point, but wasn't convinced for some reason.

Patch 6 seems the real meat of your own stuff.  I think there should be
a patch 8 also but it's not attached ... ??

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Emre Hasegeli

Date:

14 April 2015, 14:45:38

> Judging from a quick look, I think patches 1 and 5 can be committed
> quickly; they imply no changes to other parts of BRIN.  (Not sure why 1
> and 5 are separate.  Any reason for this?)  Also patch 2.

Not much reason except that 1 includes only functions, but 5 includes operators.

> Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN
> framework code; should also be committable right away.  Needs a closer
> look of course.
>
> Patch 3 is a problem.  That code is there because the union proc is only
> used in a corner case in Minmax, so if we remove it, user-written Union
> procs are very likely to remain buggy for long.  If you have a better
> idea to test Union in Minmax, or some other way to turn that stuff off
> for the range stuff, I'm all ears.  Just lets make sure the support
> procs are tested to avoid stupid bugs.  Before I introduced that, my
> Minmax Union proc was all wrong.

I removed this test because I don't see a way to support it.  I
believe any other implementation that is more complicated than minmax
will fail in there.  It is better to cache them with the regression
tests, so I tried to improve them.  GiST, SP-GiST and GIN don't have
similar checks, but they have more complicated user defined functions.

> Patch 7 I don't understand.  Will have to look closer.  Are you saying
> Minmax will depend on Btree opclasses?  I remember thinking in doing it
> that way at some point, but wasn't convinced for some reason.

No, there isn't any additional dependency.  It makes minmax operator
classes use the procedures from the pg_amop instead of adding them to
pg_amproc.

It also makes the operator class safer for cross data type usage.
Actually, I just checked and find out that we got wrong answers from
index on the current master without this patch.  You can reproduce it
with this query on the regression database:

select * from brintest where timestampcol = '1979-01-29 11:05:09'::timestamptz;

inclusion-opclasses patch make it possible to add cross type brin
regression tests.  I will add more of them on the next version.

> Patch 6 seems the real meat of your own stuff.  I think there should be
> a patch 8 also but it's not attached ... ??

I had another commit not to intended to be sent.  Sorry about that.

Re: BRIN range operator class

From

Robert Haas

Date:

30 April 2015, 12:41:35

On Mon, Apr 6, 2015 at 5:17 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Thanks for the updated patch; I will at it as soon as time allows.  (Not
> really all that soon, regrettably.)
>
> Judging from a quick look, I think patches 1 and 5 can be committed
> quickly; they imply no changes to other parts of BRIN.  (Not sure why 1
> and 5 are separate.  Any reason for this?)  Also patch 2.
>
> Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN
> framework code; should also be committable right away.  Needs a closer
> look of course.

Is this still pending?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BRIN range operator class

From

Alvaro Herrera

Date:

30 April 2015, 16:49:17

Robert Haas wrote:
> On Mon, Apr 6, 2015 at 5:17 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Thanks for the updated patch; I will at it as soon as time allows.  (Not
> > really all that soon, regrettably.)
> >
> > Judging from a quick look, I think patches 1 and 5 can be committed
> > quickly; they imply no changes to other parts of BRIN.  (Not sure why 1
> > and 5 are separate.  Any reason for this?)  Also patch 2.
> >
> > Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN
> > framework code; should also be committable right away.  Needs a closer
> > look of course.
> 
> Is this still pending?

Yeah.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Andreas Karlsson

Date:

03 May 2015, 01:16:39

On 04/06/2015 09:36 PM, Emre Hasegeli wrote:
>>> Yes but they were also required by this patch.  This version adds more
>>> functions and operators.  I can split them appropriately after your
>>> review.
>>
>>
>> Ok, sounds fine to me.
>
> It is now split.

In which order should I apply the patches?

I also agree with Alvaro's comments.

-- 
Andreas Karlsson

Re: BRIN range operator class

From

Emre Hasegeli

Date:

03 May 2015, 16:11:02

> In which order should I apply the patches?

I rebased and renamed them with numbers.

Attachment

Re: BRIN range operator class

From

Andreas Karlsson

Date:

05 May 2015, 00:51:27

 From my point of view as a reviewer this patch set is very close to 
being committable.

= brin-inclusion-v06-01-sql-level-support-functions.patch

This patch looks good.

= brin-inclusion-v06-02-strategy-numbers.patch

This patch looks good, but shouldn't it be merged with 07?

= brin-inclusion-v06-03-remove-assert-checking.patch

As you wrote earlier this is needed because the new range indexes would 
violate the asserts. I think it is fine to remove the assertion.

= brin-inclusion-v06-04-fix-brin-deform-tuple.patch

This patch looks good and can be committed separately.

= brin-inclusion-v06-05-box-vs-point-operators.patch

This patch looks good and can be committed separately.

= brin-inclusion-v06-06-inclusion-opclasses.patch

- "operator classes store the union of the values in the indexed column" 
is not technically true. It stores something which covers all of the values.
- Missing space in "except box and point*/".
- Otherwise looks good.

= brin-inclusion-v06-07-remove-minmax-amprocs.patch

Shouldn't this be merged with 02? Otherwise it looks good.

-- 
Andreas Karlsson

Re: BRIN range operator class

From

Stefan Keller

Date:

05 May 2015, 01:40:23

Hi,

2015-05-05 2:51 GMT+02:00 Andreas Karlsson <andreas@proxel.se>:
> From my point of view as a reviewer this patch set is very close to being
> committable.

I'd like to thank already now to all committers and reviewers and hope
BRIN makes it into PG 9.5.
As a database instructor, conference organisator and geospatial
specialist I'm looking forward for this clever new index.
I'm keen to see if a PostGIS specialist jumps in and adds PostGIS
geometry support.

Yours, S.


2015-05-05 2:51 GMT+02:00 Andreas Karlsson <andreas@proxel.se>:
> From my point of view as a reviewer this patch set is very close to being
> committable.
>
> = brin-inclusion-v06-01-sql-level-support-functions.patch
>
> This patch looks good.
>
> = brin-inclusion-v06-02-strategy-numbers.patch
>
> This patch looks good, but shouldn't it be merged with 07?
>
> = brin-inclusion-v06-03-remove-assert-checking.patch
>
> As you wrote earlier this is needed because the new range indexes would
> violate the asserts. I think it is fine to remove the assertion.
>
> = brin-inclusion-v06-04-fix-brin-deform-tuple.patch
>
> This patch looks good and can be committed separately.
>
> = brin-inclusion-v06-05-box-vs-point-operators.patch
>
> This patch looks good and can be committed separately.
>
> = brin-inclusion-v06-06-inclusion-opclasses.patch
>
> - "operator classes store the union of the values in the indexed column" is
> not technically true. It stores something which covers all of the values.
> - Missing space in "except box and point*/".
> - Otherwise looks good.
>
> = brin-inclusion-v06-07-remove-minmax-amprocs.patch
>
> Shouldn't this be merged with 02? Otherwise it looks good.
>
>
> --
> Andreas Karlsson
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: BRIN range operator class

From

Alvaro Herrera

Date:

05 May 2015, 03:16:36

Stefan Keller wrote:
> Hi,
> 
> 2015-05-05 2:51 GMT+02:00 Andreas Karlsson <andreas@proxel.se>:
> > From my point of view as a reviewer this patch set is very close to being
> > committable.
> 
> I'd like to thank already now to all committers and reviewers and hope
> BRIN makes it into PG 9.5.
> As a database instructor, conference organisator and geospatial
> specialist I'm looking forward for this clever new index.

Appreciated.  The base BRIN code is already in 9.5, so barring
significant issues you should see it in the next major release.
Support for geometry types and the like is still pending, but I hope to
get to it shortly.

> I'm keen to see if a PostGIS specialist jumps in and adds PostGIS
> geometry support.

Did you test the patch proposed here already?  It could be a very good
contribution.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Andreas Karlsson

Date:

05 May 2015, 09:34:28

On 05/05/2015 04:24 AM, Alvaro Herrera wrote:
> Stefan Keller wrote:
>> I'm keen to see if a PostGIS specialist jumps in and adds PostGIS
>> geometry support.
>
> Did you test the patch proposed here already?  It could be a very good
> contribution.

Indeed, I have done some testing of the patch but more people testing 
would be nice.

Andreas

Re: BRIN range operator class

From

Emre Hasegeli

Date:

05 May 2015, 09:57:50

> From my point of view as a reviewer this patch set is very close to being
> committable.

Thank you.  The new versions are attached.

> - "operator classes store the union of the values in the indexed column" is
> not technically true. It stores something which covers all of the values.

I rephrased it as " operator classes store a value which includes the
values in the indexed column".

> - Missing space in "except box and point*/".

Fixed.

> = brin-inclusion-v06-07-remove-minmax-amprocs.patch
>
> Shouldn't this be merged with 02? Otherwise it looks good.

It doesn't have any relation with the 02-strategy-numbers.patch.
Maybe you mean 01-sql-level-support-functions.patch and
05-box-vs-point-operators.patch should be merged.  They can always be
committed together.

Attachment

Re: BRIN range operator class

From

Emre Hasegeli

Date:

05 May 2015, 10:06:41

> Indeed, I have done some testing of the patch but more people testing would
> be nice.

The inclusion opclass should work for other data types as long
required operators and SQL level support functions are supplied.
Maybe it would work for PostGIS, too.

Re: BRIN range operator class

From

Andreas Karlsson

Date:

05 May 2015, 10:10:14

On 05/05/2015 11:57 AM, Emre Hasegeli wrote:
>>  From my point of view as a reviewer this patch set is very close to being
>> committable.
>
> Thank you.  The new versions are attached.

Nice, I think it is ready now other than the issues Alvaro raised in his 
review[1]. Have you given those any thought?

Notes

1. http://www.postgresql.org/message-id/20150406211724.GH4369@alvh.no-ip.org

Andreas

Re: BRIN range operator class

From

Emre Hasegeli

Date:

05 May 2015, 11:10:39

> Nice, I think it is ready now other than the issues Alvaro raised in his
> review[1]. Have you given those any thought?

I already replied his email [1].  Which issues do you mean?

[1] http://www.postgresql.org/message-id/CAE2gYzxQ-Gk3q3jYWT=1eNLEbSgCgU28+1axML4oMCwjBkPuqw@mail.gmail.com

Re: BRIN range operator class

From

Alvaro Herrera

Date:

05 May 2015, 18:48:32

After looking at 05 again, I don't like the "same as %" business.
Creating a whole new class of exceptions is not my thing, particularly
not in a regression test whose sole purpose is to look for exceptional
(a.k.a. "wrong") cases.  I would much rather define the opclasses for
those two datatypes using the existing @> operators rather than create
&& operators for this purpose.  We can add a note to the docs, "for
historical reasons the brin opclass for datatype box/point uses the <@
operator instead of &&", or something like that.

AFAICS this is just some pretty small changes to patches 05 and 06.
Will you please resubmit?

I just pushed patch 01, and I'm looking at 04 next.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Alvaro Herrera

Date:

05 May 2015, 19:07:56

Looking at patch 04, it seems to me that it would be better to have
the OpcInfo struct carry the typecache struct rather than the type OID,
so that we can avoid repeated typecache lookups in brin_deform_tuple;
something like

/* struct returned by "OpcInfo" amproc */
typedef struct BrinOpcInfo
{/* Number of columns stored in an index column of this opclass */uint16        oi_nstored;
/* Opaque pointer for the opclass' private use */void       *oi_opaque;
/* Typecache entries of the stored columns */TypeCacheEntry oi_typcache[FLEXIBLE_ARRAY_MEMBER];
} BrinOpcInfo;

Looking into it now.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Alvaro Herrera

Date:

05 May 2015, 19:21:12

Alvaro Herrera wrote:
> Looking at patch 04, it seems to me that it would be better to have
> the OpcInfo struct carry the typecache struct rather than the type OID,
> so that we can avoid repeated typecache lookups in brin_deform_tuple;

Here's the patch.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

brin-inclusion-v07-04.patch

Re: BRIN range operator class

From

Alvaro Herrera

Date:

05 May 2015, 21:21:55

Can you please explain what is the purpose of patch 07?  I'm not sure I
understand; are we trying to avoid having to add pg_amproc entries for
these operators and instead piggy-back on btree opclass definitions?
Not too much in love with that idea; I see that there is less tedium in
that the brin opclass definition is simpler.  One disadvantage is a 3x
increase in the number of syscache lookups to get the function you need,
unless I'm reading things wrong.  Maybe this is not performance critical.

Anyway I tried applying it on isolation, and found that it fails the
assertion that tests the "union" support proc in brininsert.  That
doesn't seem okay.  I mean, it's okay not to run the test for the
inclusion opclasses, but why does it now fail in minmax which was
previously passing?  Couldn't figure it out.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Andreas Karlsson

Date:

05 May 2015, 22:53:28

On 05/05/2015 01:10 PM, Emre Hasegeli wrote:
> I already replied his email [1].  Which issues do you mean?

Sorry, my bad please ignore the previous email.

-- 
Andreas Karlsson

Re: BRIN range operator class

From

Emre Hasegeli

Date:

06 May 2015, 08:54:19

> Can you please explain what is the purpose of patch 07?  I'm not sure I
> understand; are we trying to avoid having to add pg_amproc entries for
> these operators and instead piggy-back on btree opclass definitions?
> Not too much in love with that idea; I see that there is less tedium in
> that the brin opclass definition is simpler.  One disadvantage is a 3x
> increase in the number of syscache lookups to get the function you need,
> unless I'm reading things wrong.  Maybe this is not performance critical.

It doesn't use btree opclass definitions.  It uses brin opclass
pg_amop entries instead of duplicating them in pg_amproc.
The pg_amproc.h header says:

> * The amproc table identifies support procedures associated with index
> * operator families and classes.  These procedures can't be listed in pg_amop
> * since they are not the implementation of any indexable operator.

In our case, these procedures can be listed in pg_amop as they
are implementations of indexable operators.

The more important change on this patch is to request procedures for
the right data types.  Minmax opclasses return wrong results without
this patch.  You can reproduce it with this query on
the regression database:

select * from brintest where timestampcol = '1979-01-29 11:05:09'::timestamptz;

> Anyway I tried applying it on isolation, and found that it fails the
> assertion that tests the "union" support proc in brininsert.  That
> doesn't seem okay.  I mean, it's okay not to run the test for the
> inclusion opclasses, but why does it now fail in minmax which was
> previously passing?  Couldn't figure it out.

The regression tests passed when I tried it on the current master.

Re: BRIN range operator class

From

Emre Hasegeli

Date:

06 May 2015, 09:31:22

>> Looking at patch 04, it seems to me that it would be better to have
>> the OpcInfo struct carry the typecache struct rather than the type OID,
>> so that we can avoid repeated typecache lookups in brin_deform_tuple;
>
> Here's the patch.

Looks better to me.  I will incorporate with this patch.

Re: BRIN range operator class

From

Emre Hasegeli

Date:

06 May 2015, 09:49:45

> After looking at 05 again, I don't like the "same as %" business.
> Creating a whole new class of exceptions is not my thing, particularly
> not in a regression test whose sole purpose is to look for exceptional
> (a.k.a. "wrong") cases.  I would much rather define the opclasses for
> those two datatypes using the existing @> operators rather than create
> && operators for this purpose.  We can add a note to the docs, "for
> historical reasons the brin opclass for datatype box/point uses the <@
> operator instead of &&", or something like that.

I worked around this by adding point <@ box operator as the overlap
strategy and removed additional && operators.

> AFAICS this is just some pretty small changes to patches 05 and 06.
> Will you please resubmit?

New series of patches are attached.  Note that
brin-inclusion-v08-04-fix-brin-deform-tuple.patch is the one from you.

Attachment

Re: BRIN range operator class

From

Alvaro Herrera

Date:

06 May 2015, 21:48:32

I again have to refuse the notion that removing the assert-only block
without any replacement is acceptable.  I just spent a lot of time
tracking down what turned out to be a bug in your patch 07:
    /* Adjust maximum, if B's max is greater than A's max */
-    needsadj = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
-                                                     PROCNUM_GREATER),
-                          colloid, col_b->bv_values[1], col_a->bv_values[1]);
+    frmg = minmax_get_strategy_procinfo(bdesc, attno, attr->atttypid,
+                                        BTGreaterStrategyNumber);
+    needsadj = FunctionCall2Coll(frmg, colloid, col_b->bv_values[0],
+                                 col_a->bv_values[0]);

Note the removed lines use array index 1, while the added lines use
array index 0.  The only reason I noticed this is because I applied this
patch without the others and saw the assertion fire; how would I have
noticed the problem had I just removed it?

Let's think together and try to find a reasonable way to get the union
procedures tested regularly.  It is pretty clear that having them run
only when the race condition occurs is not acceptable; bugs go
unnoticed.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Tom Lane

Date:

06 May 2015, 22:00:04

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Let's think together and try to find a reasonable way to get the union
> procedures tested regularly.  It is pretty clear that having them run
> only when the race condition occurs is not acceptable; bugs go
> unnoticed.

[ just a drive-by comment... ]  Maybe you could set up a testing mode
that forces the race condition to occur?  Then you could test the calling
code paths, not only the union procedures per se.
        regards, tom lane

Re: BRIN range operator class

From

Alvaro Herrera

Date:

07 May 2015, 16:05:38

Emre Hasegeli wrote:
> > After looking at 05 again, I don't like the "same as %" business.
> > Creating a whole new class of exceptions is not my thing, particularly
> > not in a regression test whose sole purpose is to look for exceptional
> > (a.k.a. "wrong") cases.  I would much rather define the opclasses for
> > those two datatypes using the existing @> operators rather than create
> > && operators for this purpose.  We can add a note to the docs, "for
> > historical reasons the brin opclass for datatype box/point uses the <@
> > operator instead of &&", or something like that.
> 
> I worked around this by adding point <@ box operator as the overlap
> strategy and removed additional && operators.

That works for me.

I pushed patches 04 and 07, as well as adopting some of the changes to
the regression test in 06.  I'm afraid I caused a bit of merge pain for
you -- sorry about that.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Emre Hasegeli

Date:

10 May 2015, 14:57:21

> I pushed patches 04 and 07, as well as adopting some of the changes to
> the regression test in 06.  I'm afraid I caused a bit of merge pain for
> you -- sorry about that.

No problem.  I rebased the remaining ones.

Attachment

Re: BRIN range operator class

From

Alvaro Herrera

Date:

12 May 2015, 17:37:58

Emre Hasegeli wrote:
> > I pushed patches 04 and 07, as well as adopting some of the changes to
> > the regression test in 06.  I'm afraid I caused a bit of merge pain for
> > you -- sorry about that.
> 
> No problem.  I rebased the remaining ones.

In patch 05, you use straight > etc comparisons of point/box values.
All the other code in that file AFAICS uses FPlt() macros and others; I
assume we should do likewise.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Alvaro Herrera

Date:

12 May 2015, 17:52:09

Alvaro Herrera wrote:

> In patch 05, you use straight > etc comparisons of point/box values.
> All the other code in that file AFAICS uses FPlt() macros and others; I
> assume we should do likewise.

Oooh, looking at the history of this I just realized that the comments
signed "tgl" are actually Thomas G. Lockhart, not Tom G. Lane!  See
commit 9e2a87b62db87fc4175b00dabfd26293a2d072fa

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Alvaro Herrera

Date:

12 May 2015, 19:49:00

So, in reading these patches, it came to me that we might want to have
pg_upgrade mark indexes invalid if we in the future change the
implementation of some opclass.  For instance, the inclusion opclass
submitted here uses three columns: the indexed value itself, plus two
booleans; each of these booleans is a workaround for some nasty design
decision in the underlying datatypes.

One boolean is "unmergeable": if a block range contains both IPv4 and
IPv6 addresses, we mark it as 'unmergeable' and then every query needs
to visit that block range always.  The other boolean is "contains empty"
and is used for range types: it is set if the empty value is present
somewhere in the block range.

If in the future, for instance, we come up with a way to store the ipv4
plus ipv6 info, we will want to change the page format.  If we add a
page version to the metapage, we can detect the change at pg_upgrade
time and force a reindex of the index.

Thoughts?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Heikki Linnakangas

Date:

12 May 2015, 19:54:38

On 05/12/2015 10:49 PM, Alvaro Herrera wrote:
> If in the future, for instance, we come up with a way to store the ipv4
> plus ipv6 info, we will want to change the page format.  If we add a
> page version to the metapage, we can detect the change at pg_upgrade
> time and force a reindex of the index.

A version number in the metapage is a certainly a good idea. But we 
already have that, don't we? :

> /* Metapage definitions */
> typedef struct BrinMetaPageData
> {
>     uint32        brinMagic;
>     uint32        brinVersion;
>     BlockNumber pagesPerRange;
>     BlockNumber lastRevmapPage;
> } BrinMetaPageData;
>
> #define BRIN_CURRENT_VERSION        1
> #define BRIN_META_MAGIC            0xA8109CFA

Did you have something else in mind?

- Heikki

Re: BRIN range operator class

From

Alvaro Herrera

Date:

12 May 2015, 20:02:19

Heikki Linnakangas wrote:
> On 05/12/2015 10:49 PM, Alvaro Herrera wrote:
> >If in the future, for instance, we come up with a way to store the ipv4
> >plus ipv6 info, we will want to change the page format.  If we add a
> >page version to the metapage, we can detect the change at pg_upgrade
> >time and force a reindex of the index.
> 
> A version number in the metapage is a certainly a good idea. But we already
> have that, don't we? :
> 
> >/* Metapage definitions */
> >typedef struct BrinMetaPageData
> >{
> >    uint32        brinMagic;
> >    uint32        brinVersion;
> >    BlockNumber pagesPerRange;
> >    BlockNumber lastRevmapPage;
> >} BrinMetaPageData;
> >
> >#define BRIN_CURRENT_VERSION        1
> >#define BRIN_META_MAGIC            0xA8109CFA
> 
> Did you have something else in mind?

Yeah, I was thinking we could have a separate version number for the
opclass code as well.  An external extension could change that, for
instance.  Also, we could change the 'inclusion' version and leave
minmax alone.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BRIN range operator class

From

Alvaro Herrera

Date:

13 May 2015, 23:07:36

Emre Hasegeli wrote:
> > I pushed patches 04 and 07, as well as adopting some of the changes to
> > the regression test in 06.  I'm afraid I caused a bit of merge pain for
> > you -- sorry about that.
>
> No problem.  I rebased the remaining ones.

Thanks!

After some back-and-forth between Emre and me, here's an updated patch.
My changes are cosmetic; for a detailed rundown, see
https://github.com/alvherre/postgres/commits/brin-inclusion

Note that datatype point was removed: it turns out that unless we get
box_contain_pt changed to use FPlt() et al, indexes created with this
opclass would be corrupt.  And we cannot simply change box_contain_pt,
because that would break existing GiST and SP-GiST indexes that use it
today and pg_upgrade to 9.5!  So that needs to be considered separately.
Also, removing point support means remove the CAST support procedure,
because there is no use for it in the supported types.  Also, patch 05
in the previous submissions goes away completely because there's no need
for those (box,point) operators anymore.

There's nothing Earth-shattering here that hasn't been seen in previous
submissions by Emre.

One item of note is that this patch is blindly removing the assert-only
blocks as previously discussed, without any replacement.  Need to think
more on how to put something back ...

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

brin-inclusion-v11.patch

Re: BRIN range operator class

From

Alvaro Herrera

Date:

15 May 2015, 21:20:04

Emre Hasegeli wrote:
> > I pushed patches 04 and 07, as well as adopting some of the changes to
> > the regression test in 06.  I'm afraid I caused a bit of merge pain for
> > you -- sorry about that.
> 
> No problem.  I rebased the remaining ones.

Thanks, pushed.

There was a proposed change by Emre to renumber operator -|- to 17 for
range types (from 6 I think).  I didn't include that as I think it
should be a separate commit.  Also, we're now in debt of the test
strategy for the union procedure.  I will work with Emre in the coming
days to get that sorted out.  I'm now thinking that something in
src/test/modules is the most appropriate.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services