Thread: [RFC] Minmax indexes

[RFC] Minmax indexes

From
Alvaro Herrera
Date:
Hi,

This is a preliminary proposal for Minmax indexes.  I'm experimenting
with the code, but it's too crude to post yet, so here's a document
explaining what they are and how they work, so that reviewers can poke
holes to have the design improved.  My intention is to have a patch to
show for CF2, so please do have a look at this and comment.

This is part of the AXLE project http://www.axleproject.eu and the
intention is to support tables of very large size.  In a quick
experiment, I have a table of ~12 GB and its corresponding index is 65
kB in size, making the time to do the equivalent of a seqscan a small
fraction of that taken by a real seqscan.  This technique sits between a
bitmap scan of a normal btree, and a seqscan: the minmax index tells the
bitmap heap scan what pages to seqscan, allowing it to skip a large
fraction of pages that are known not to contain tuples matching the
query quals.  This is a huge win for large data warehouses.  

Without further ado, here's what I propose.


Minmax Range Indexes
====================

Minmax indexes are a new access method intended to enable very fast scanning of
extremely large tables.

The essential idea of a minmax index is to keep track of the min() and max()
values in consecutive groups of heap pages (page ranges).  These values can be
used by constraint exclusion to avoid scanning such pages, depending on query
quals.

The main drawback of this is having to update the stored min/max values of each
page range as tuples are inserted into them.

Other database systems already have this feature. Some examples:

* Oracle Exadata calls this "storage indexes" http://richardfoote.wordpress.com/category/storage-indexes/

* Netezza has "zone maps" http://nztips.com/2010/11/netezza-integer-join-keys/

* Infobright has this automatically within their "data packs"
http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/

* MonetDB seems to have it http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662 "Cooperative Scans: Dynamic
BandwidthSharing in a DBMS"
 

Grammar
-------

To create a minmax index, we use
 CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);

Partial indexes are not supported; since an index is concerned with minimum and
maximum values of the involved columns across all the pages in the table, it
doesn't make sense to exclude values.  Another way to see "partial" indexes
here would be those that only considered some pages in the table instead of all
of them; but this would be difficult to implement and manage and, most likely,
pointless.

Expressional indexes can probably be supported in the future, but we disallow
them initially for conceptual simplicity.

Having multiple minmax indexes in the same table is acceptable, though most of
the time it would make more sense to have a single index covering all the
interesting columns.  Multiple indexes might be useful for columns added later.

Access Method Design
--------------------

Since item pointers are not stored inside indexes of this type, it is not
possible to support the amgettuple interface.  Instead, we only provide
amgetbitmap support; scanning a relation using this index requires a recheck
node on top.  The amgetbitmap routine would return a TIDBitmap comprising all
the pages in those page groups that comply with the query quals; the recheck node
prunes tuples that are not visible per snapshot and those that are not visible
per query quals.

For each supported datatype, we need an opclass with the following catalog
entries:

- support functions (pg_amproc) * pg_proc entry for min() * pg_proc entry for max()
- support operators (pg_amop): same as btree (<, <=, =, >=, >)

The min() and max() support functions are used during index construction.
The support operators are used in the optimizer, so that the index is chosen
when queries on the indexed table are planned.  (Also, we use them in the
amgetbitmap routine, to compare ScanKeys and decide whether to emit a certain
block or not).

In each index tuple (corresponding to one page range), we store:
- first block this tuple applies to
- last block this tuple applies to
- for each indexed column: * min() value across all tuples in the range * max() value across all tuples in the range *
nullspresent in any tuple?
 

With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
types (timestamptz, bigint), each tuple would be 524 bytes in length, which
seems reasonable.  Of course, larger columns are possible, such as varchar, but
creating minmax indexes on such columns seems of little practical usefulness.

This maximum index tuple size is calculated as:
BlockNumber (4 bytes) * 2 + data value (8 bytes) * 32 * 2 + null bitmap (4 bytes)


Block ranges mentioned in index entries shouldn't overlap. However, there can
be gaps where some pages have no covering index entry. (In particular, the last
few pages of the table would commonly not be summarized.)

In order to scan a table, a sequential scan of the index is done; for each
tuple for which the range limits (min/max) are proven to be required to be
scanned by the ScanKeys, we obtain the block range covered by that tuple,
and return a lossy TIDBitmap of pages in the page range; any unsummarized pages
and invalid page ranges (see below) must also be returned.


Note that we store pg_proc entries for aggregate functions.  It is the
responsibility of the access method to obtain the pg_aggregate rows starting
from the pg_proc rows; we avoid the parser code that deals with aggregates,
because the mechanism it uses to resolve from parser to executor starts from
aggregate names; this is not useful for our purposes.

Also, running the aggregate is not done through the executor support for
aggregates either, because that code is very specialized to running inside a
node with a child node feeding tuples, which IndexBuildHeapScan cannot do.
Instead, the AM is in charge of initializing state and running the transition
functions for each new tuple read by the index-build heap scan.


Data changes in the heap
------------------------

Value changes in columns that are part of a minmax index, and tuple insertion
in summarized pages, would invalidate the stored min/max values.  To support
this, each minmax index has a validity map; a range can only be considered in a
scan if it hasn't been invalidated by such changes (A range "not considered" in
the scan needs to be returned in whole regardless of the stored min/max values,
that is, it cannot be pruned per query quals).  The validity map is very
similar to the visibility map in terms of performance characteristics: quick
enough that it's not contentious, allowing updates and insertions to proceed
even when data values violate the minmax index conditions.  An invalidated
range can be made valid by re-summarization (see below).

Re-summarization is relatively expensive, because the complete page range has
to be scanned.  To avoid this, a table having a minmax index would be
configured so that inserts only go to the page(s) at the end of the table; this
avoids frequent invalidation of ranges in the middle of the table.  We provide
a table reloption that tweaks the FSM behavior, so that summarized pages are
not candidates for insertion.

A possible optimization is that whenever an update does not modify the values
covered by minmax indexes, or whenever the value changes are not outside the
min()/max() interval for the corresponding page range, the page range does not
need to be marked invalid.  This is particularly useful when considering
same-page updates (i.e. updates which find enough empty space in the same page
that the new tuple fits in it), and HOT (which frees space used by previous
updates).  This suggests we ought to recommend lower-than-100 fillfactor values
to permit these features to kick in.

When tuples are added to unsummarized pages, nothing needs to happen to the
index.

Tuples can be removed from anywhere without restriction.  See below for more on
vacuuming.

Summarization
-------------

At index creation time, the whole table is scanned; for each page range the
min() and max() values and nulls bitmap are collected, and stored in the index.
The possibly-incomplete range at the end of the table is also included.

Once in a while, it is necessary to summarize a bunch of unsummarized pages
(because the table has grown since the index was created), or re-summarize a
range that has been marked invalid.  This is simple: scan the page range
calculating the min() and max() for each indexed column, then insert the new
index entry at the end of the index.  The main interesting questions are:

a) when to do it  The perfect time to do it is as soon as a complete page range of the  configured range size has been
filled(assuming page ranges are constant  size).
 

b) who does it (what process)  It doesn't seem a good idea to have a client-connected process do it;  it would incur
unwantedlatency.  Three other options are (i) to spawn a  specialized process to do it, which perhaps can be signalled
bya  client-connected process that executes a scan and notices the need to run  summarization; or (ii) to let
autovacuumdo it, as a separate new  maintenance task.  This seems simple enough to bolt on top of already  existing
autovacuuminfrastructure.  The timing constraints of autovacuum  might be undesirable, though.  (iii) wait for user
command.

The easiest way to go around this seems to have vacuum do it.  That way we can
simply do re-summarization on the amvacuumcleanup routine.  Other answers would
mean we need a separate AM routine, which appears unwarranted at this stage.

Vacuuming
---------

Vacuuming a table that has a minmax index does not represent a significant
challenge.  Since no CTIDs are stored, it's not necessary to scan the index
when heap tuples are removed.  It might be that some min() value can be
incremented, or some max() value can be decremented; but this would represent
an optimization opportunity only, not a correctness issue.  Perhaps it's
simpler to represent this as the need to re-run summarization on the affected
page range.

Note that if there are no indexes on the table other than the minmax index,
usage of maintenance_work_mem by vacuum can be decreased significantly, because
no detailed index scan needs to take place (and thus it's not necessary for
vacuum to save TIDs to remove).  This optimization opportunity is best left for
future improvement.

Optimizer
---------

In order to make this all work, the only thing we need to do is ensure we have a
good enough opclass and amcostestimate.  With this, the optimizer is able to pick
up the index on its own.


Open questions
--------------

* Same-size page ranges? Current related literature seems to consider that each "index entry" in a minmax index must
coverthe same number of pages.  There doesn't seem to be a hard reason for this to be so; it might make sense to allow
theindex to self-tune so that some index entries cover smaller page ranges, if this allows the min()/max() values to be
morecompact.  This would incur larger space overhead for the index itself, but might allow better pruning of page
rangesduring scan.  In the limit of one index tuple per page, the index itself would occupy too much space, even though
wewould be able to skip reading the most heap pages, because the min()/max() ranges are tight; in the opposite limit of
asingle tuple that summarizes the whole table, we wouldn't be able to prune anything from the seqscan even though the
indexis very small.
 

* More compact representation for TIDBitmap? TIDBitmap is the structure currently used to represent bitmap scans.  The
representationof lossy page ranges is not optimal for our purposes, because it uses a Bitmapset to represent pages in
therange; since we're going to return all pages in a large range, it might be more convenient to allow for a struct
thatinstead uses start and end page numbers to represent the range. This is mentioned in passing in tidbitmap.c
comments,so perhaps the time has come for its implementation.  (In experimentation, tidbitmap has no trouble with
rangesof a thousand pages.)
 


References:

Email thread on pgsql-hackers http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site From: Simon
RiggsTo: pgsql-hackers Subject: Dynamic Partitioning using Segment Visibility Map
 

http://wiki.postgresql.org/wiki/Segment_Exclusion
http://wiki.postgresql.org/wiki/Segment_Visibility_Map


-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Josh Berkus
Date:
Alvaro,

This sounds really interesting, and I can see the possibilities.
However ...

> Value changes in columns that are part of a minmax index, and tuple insertion
> in summarized pages, would invalidate the stored min/max values.  To support
> this, each minmax index has a validity map; a range can only be considered in a
> scan if it hasn't been invalidated by such changes (A range "not considered" in
> the scan needs to be returned in whole regardless of the stored min/max values,
> that is, it cannot be pruned per query quals).  The validity map is very
> similar to the visibility map in terms of performance characteristics: quick
> enough that it's not contentious, allowing updates and insertions to proceed
> even when data values violate the minmax index conditions.  An invalidated
> range can be made valid by re-summarization (see below).

This begins to sound like these indexes are only useful on append-only
tables.  Not that there aren't plenty of those, but ...

> Re-summarization is relatively expensive, because the complete page range has
> to be scanned.

Why?  Why can't we just update the affected pages in the index?

>  To avoid this, a table having a minmax index would be
> configured so that inserts only go to the page(s) at the end of the table; this
> avoids frequent invalidation of ranges in the middle of the table.  We provide
> a table reloption that tweaks the FSM behavior, so that summarized pages are
> not candidates for insertion.

We haven't had an index type which modifies table insertion behavior
before, and I'm not keen to start now; imagine having two indexes on the
same table each with their own, conflicting, requirements.  This is
sounding a lot more like a candidate for our prospective pluggable
storage manager.  Also, the above doesn't help us at all with UPDATEs.

If we're going to start adding reloptions for specific table behavior,
I'd rather think of all of the optimizations we might have for a
prospective "append-only table" and bundle those, rather than tying it
to whether a certain index exists or not.

Also, I hate the name ... if this feature goes ahead, I'm going to be
lobbying to change it.  But that's pretty minor compared to the update
issues.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [RFC] Minmax indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> To avoid this, a table having a minmax index would be
>> configured so that inserts only go to the page(s) at the end of the table; this
>> avoids frequent invalidation of ranges in the middle of the table.  We provide
>> a table reloption that tweaks the FSM behavior, so that summarized pages are
>> not candidates for insertion.

> We haven't had an index type which modifies table insertion behavior
> before, and I'm not keen to start now; imagine having two indexes on the
> same table each with their own, conflicting, requirements.

I agree; such a restriction is a nonstarter for a secondary index.  I
don't believe that hacking the FSM would be sufficient to guarantee the
required behavior, either.

We've talked a lot about index-organized tables in the past.  How much
of the use case for this would be subsumed by a feature like that?
        regards, tom lane



Re: [RFC] Minmax indexes

From
Greg Stark
Date:
On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Re-summarization is relatively expensive, because the complete page range has
> to be scanned.

That doesn't sound too bad to me. It just means there's a downside to
having larger page ranges. I would expect the page ranges to be
something in the ballpark of 32 pages --  scanning 32 pages to
resummarize doesn't sound that painful but sounds like it's large
enough that the resulting index would be a reasonable size.

But I don't understand why an insert would invalid a tuple. An insert
can just update the min and max incrementally. It's a delete that
invalidates the range but as you note it doesn't really invalidate it,
just mark it as needing a refresh -- and even then only if the value
being deleted is equal to either the min or max.

> Same-size page ranges?
> Current related literature seems to consider that each "index entry" in a
> minmax index must cover the same number of pages.  There doesn't seem to be a

I assume the reason for this in the literature is the need to quickly
find the summary for a given page when you're handling an insert or
delete. If you have some kind of meta data structure that lets you
find it (which I gather is what the validity map is?) then you
wouldn't need it. But that seems like a difficulty cost to justify
compared to just having a 1:1 mapping from block to bitmap tuple.

-- 
greg



Re: [RFC] Minmax indexes

From
Simon Riggs
Date:
On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote:
> Alvaro,
>
> This sounds really interesting, and I can see the possibilities.
> However ...
>
>> Value changes in columns that are part of a minmax index, and tuple insertion
>> in summarized pages, would invalidate the stored min/max values.  To support
>> this, each minmax index has a validity map; a range can only be considered in a
>> scan if it hasn't been invalidated by such changes (A range "not considered" in
>> the scan needs to be returned in whole regardless of the stored min/max values,
>> that is, it cannot be pruned per query quals).  The validity map is very
>> similar to the visibility map in terms of performance characteristics: quick
>> enough that it's not contentious, allowing updates and insertions to proceed
>> even when data values violate the minmax index conditions.  An invalidated
>> range can be made valid by re-summarization (see below).
>
> This begins to sound like these indexes are only useful on append-only
> tables.  Not that there aren't plenty of those, but ...

The index is basically using the "index only scan" mechanism. The
"only useful on append-only tables" comment would/should apply also to
index only scans. I can't see a reason to raise that specifically for
this index type.


>> Re-summarization is relatively expensive, because the complete page range has
>> to be scanned.
>
> Why?  Why can't we just update the affected pages in the index?

Again, same thing as index-only scans. For IOS, we reset the
visibility info at vacuum. The route proposed here follows exactly the
same timing, same mechanism. I can't see a reason for any difference
between the two.


>>  To avoid this, a table having a minmax index would be
>> configured so that inserts only go to the page(s) at the end of the table; this
>> avoids frequent invalidation of ranges in the middle of the table.  We provide
>> a table reloption that tweaks the FSM behavior, so that summarized pages are
>> not candidates for insertion.
>
> We haven't had an index type which modifies table insertion behavior
> before, and I'm not keen to start now; imagine having two indexes on the
> same table each with their own, conflicting, requirements.  This is
> sounding a lot more like a candidate for our prospective pluggable
> storage manager.  Also, the above doesn't help us at all with UPDATEs.
>
> If we're going to start adding reloptions for specific table behavior,
> I'd rather think of all of the optimizations we might have for a
> prospective "append-only table" and bundle those, rather than tying it
> to whether a certain index exists or not.

I agree that the FSM behaviour shouldn't be linked to index existence.

IMHO that should be a separate table parameter, WITH (fsm_mode = append)

Index only scans would also benefit from that.


> Also, I hate the name ... if this feature goes ahead, I'm going to be
> lobbying to change it.  But that's pretty minor compared to the update
> issues.

This feature has already had 3 different names. I don't think the name
is crucial, but it makes sense to give it a name up front. So if you
want to lobby for that then you'd need to come up with a name soon, so
poor Alvaro can cope with name #4.

(There's no consistency in naming from any other implementation either).

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Simon Riggs
Date:
On 15 June 2013 00:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> We've talked a lot about index-organized tables in the past.  How much
> of the use case for this would be subsumed by a feature like that?

IOTs would only work for one specific organisation, acting like a
multi-column btree. So you could use IOTs for this but only in a
limited way.

MinMax indexes cover multiple columns, possibly even all columns if
desired. So MMIs have much greater usefulness, as well as not
requiring a full reload of the data when you decide you want one.

Personally, I think IOTs are not worth pursuing. IOTs for Postgres
would require major changes to the definition of the heap, so we're a
long way from implementing them directly and coping with all of the
detailed implementation horrors. Grouped Item Tuple indexes are
theoretically equivalent in terms of size and number of block accesses
required to access data. Heikki's earlier work on that could easily be
revived. IOTs would be a multi-year effort with conflicting use cases.

Note that SQLServer and Sybase use block range indexes for primary
indexes, neatly avoiding the hassle Oracle walked into when they chose
to do IOTs. I think we should learn that same lession ourselves.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote:
>> If we're going to start adding reloptions for specific table behavior,
>> I'd rather think of all of the optimizations we might have for a
>> prospective "append-only table" and bundle those, rather than tying it
>> to whether a certain index exists or not.

> I agree that the FSM behaviour shouldn't be linked to index existence.
> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
> Index only scans would also benefit from that.

-1 ... I cannot believe that such a parameter would ever get turned on
in production by anyone.  If your table has a significant update rate,
the resulting table bloat would make such behavior completely
infeasible.  If you have few enough updates to make such a behavior
practical, then you can live with the expensive index updates instead.

I also find the analogy to index-only scans to be bogus, because those
didn't require any user tuning.

There's a nearby thread complaining bitterly about our willingness to
create hard-to-use, hard-to-tune "features".  In its current form,
this will be another one of those.
        regards, tom lane



Re: [RFC] Minmax indexes

From
Simon Riggs
Date:
On 15 June 2013 16:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote:
>>> If we're going to start adding reloptions for specific table behavior,
>>> I'd rather think of all of the optimizations we might have for a
>>> prospective "append-only table" and bundle those, rather than tying it
>>> to whether a certain index exists or not.
>
>> I agree that the FSM behaviour shouldn't be linked to index existence.
>> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
>> Index only scans would also benefit from that.
>
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone.

It's an extremely common use case, epecially with larger databases.

> If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible.  If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.

Depends. What I had in mind was that "append" mode would allow updates
normally within the last 1 GB of a table, but relocate updates should
they occur in earlier segments. That would cover the majority of use
cases I see.

> I also find the analogy to index-only scans to be bogus, because those
> didn't require any user tuning.

I don't think anybody is suggesting there is a requirement for user
tuning on this type of index.

But I think having one parameter would make it the same as say, GIN
indexes which also have a single tuning parameter.

I think the idea that index only scans "just work" is itself bogus. A
mild random update pattern will reduce the benefit of IOS scans
considerably, which is hard to understand or control, especially
without any way of measuring it. We could eliminate that problem if we
chose, but its not even documented as a potential issue for some
reason so its hard to discuss. Not being able to see or control a
known problem is not the same thing as "doesn't require user tuning".

> There's a nearby thread complaining bitterly about our willingness to
> create hard-to-use, hard-to-tune "features".  In its current form,
> this will be another one of those.

Not based upon anything mentioned here so far, or that I'm aware of.

I think it would be interesting to see some anonymous voting on
feature complaints, so we can see what people really think without
needing to risk offending the main authors of each of the various
parts of our software.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Craig Ringer
Date:
On 06/15/2013 11:39 PM, Tom Lane wrote:
> There's a nearby thread complaining bitterly about our willingness to
> create hard-to-use, hard-to-tune "features".  In its current form,
> this will be another one of those.
For what it's worth, I was going for "usability concerns", not
"complaint"... and I'd much rather have most of those features, even
with poor usability, than not have them at all. I'm not convinced anyone
can introduce something that's beautiful design and implementation
perfection first time, every time. Learning where the pain points are
and improving them is what I'm going for.

There's also, IMO, nothing wrong with somewhat obscure options for
obscure use cases and workloads. The problem is when difficult to
understand or weird and obscure parameters need adjustment for perfectly
ordinary middle-of-the-road workloads.

I'm curious about using minmax indexes to guide insert/update placement,
so new tuples are written in regions of the table where they already
match the indexed range - essentially providing coarse-grained
auto-clustering. Otherwise with updates a table is likely to see
scattering of values such that the minmax ranges all keep on expanding
and the index becomes less and less selective. I realise that this may
not be even remotely feasible, but if it were then it might help reduce
the need for explicit tuning. I suspect we'd also land up wanting table
optimisation down the track, where tuples were shuffled around during
autovacuum to narrow the minmax index ranges.

I'm not expressing an opinion on the option discussed here as I haven't
worked with enough DBs that'd benefit from it. Personally I'm just
seriously impressed that this feature is coming together.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services




Re: [RFC] Minmax indexes

From
Josh Berkus
Date:
>> I agree that the FSM behaviour shouldn't be linked to index existence.
>> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
>> Index only scans would also benefit from that.
> 
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone.  If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible.  If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.

I'm also thinking that if a table is really append-only, then there are
never any middle-of-the-table pages in the FSM, no?

Where this falls down is the table which is
mostly-append-but-occasionally-needs-an-update-or-delete.  I think the
answer there is to look for a way to make updating the index block range
faster, not ways to modify how we append to the heap.  If we told users
"tables with Minmax indexes will be very expensive to update" then I
think they'd live with it; dropping and readding an index to enable fast
updates is something which is already familiar.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [RFC] Minmax indexes

From
Simon Riggs
Date:
On 17 June 2013 02:05, Josh Berkus <josh@agliodbs.com> wrote:
>
>>> I agree that the FSM behaviour shouldn't be linked to index existence.
>>> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
>>> Index only scans would also benefit from that.
>>
>> -1 ... I cannot believe that such a parameter would ever get turned on
>> in production by anyone.  If your table has a significant update rate,
>> the resulting table bloat would make such behavior completely
>> infeasible.  If you have few enough updates to make such a behavior
>> practical, then you can live with the expensive index updates instead.
>
> I'm also thinking that if a table is really append-only, then there are
> never any middle-of-the-table pages in the FSM, no?
>
> Where this falls down is the table which is
> mostly-append-but-occasionally-needs-an-update-or-delete.  I think the
> answer there is to look for a way to make updating the index block range
> faster, not ways to modify how we append to the heap.  If we told users
> "tables with Minmax indexes will be very expensive to update" then I
> think they'd live with it; dropping and readding an index to enable fast
> updates is something which is already familiar.

This feature is using a similar technique to enhance SeqScans as we
already use on VACUUM. We don't really care about whether we have 100%
scan avoidance because we were likely to be using a WHERE clause that
doesn't give perfect constraint elimination anyway. So on a large
table we don't care about the fact that we still have to scan 1-5% of
the table - we are still 20 times faster than a full seqscan.

So there isn't a "fall down" thing here. We expect the recently
loaded/updated data to be scanned and that's OK.

Having the minmax index updated greedily is just adding extra work for
fast diminishing returns. We can always add that later if really
needed, but I doubt it will be needed - in just the same way as mat
views aren't greedily updated.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Josh Berkus
Date:
> So there isn't a "fall down" thing here. We expect the recently
> loaded/updated data to be scanned and that's OK.
> 
> Having the minmax index updated greedily is just adding extra work for
> fast diminishing returns. We can always add that later if really
> needed, but I doubt it will be needed - in just the same way as mat
> views aren't greedily updated.

Ok, in that case, can we add the patch without messing with the FSM
logic?  It'll work out-of-the-box for append-only tables, and that's a
pretty solid use case.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [RFC] Minmax indexes

From
Alvaro Herrera
Date:
Greg Stark wrote:
> On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Re-summarization is relatively expensive, because the complete page range has
> > to be scanned.
> 
> That doesn't sound too bad to me. It just means there's a downside to
> having larger page ranges. I would expect the page ranges to be
> something in the ballpark of 32 pages --  scanning 32 pages to
> resummarize doesn't sound that painful but sounds like it's large
> enough that the resulting index would be a reasonable size.

Actually, I'm thinking that a range is more like, say, 1280 pages, or 10
MB.  My goal is to consider tables in the 10 TB magnitude; if I store
one index tuple for every 32 pages, I would end up with too large an
index.  With 10 MBs per range I can index the whole table with an index
of 50 MB, which seems reasonable to scan.  But anyway my intention is
that page range size is configurable.

> But I don't understand why an insert would invalid a tuple. An insert
> can just update the min and max incrementally. It's a delete that
> invalidates the range but as you note it doesn't really invalidate it,
> just mark it as needing a refresh -- and even then only if the value
> being deleted is equal to either the min or max.

No, I don't intend to update the index tuple with each heap insert.  I
think this will end up being too slow.  The validity map is there to
hold a single bit for each page saying whether the page range is known
valid or not; one insert into any page in the range invalidates the
range (and any scan of the table needs to scan that range as a whole).
This way I can wait until the storm of inserts has passed from a range,
and only then do the summarization for that range.  This avoids having
to summarize the range over and over.

Alternatively, I could consider the index tuple always valid, and update
it online as soon as we do an insert or update (i.e. examine the min/max
values in the index tuple, and update it to match if the new value is
out of bounds).  This seems too slow, so I won't try.

Also, a delete does not invalidate a range either.  As Simon said
elsewhere in a thread, if the range is not minimal, this is not a
problem; we might have to scan some more ranges than absolutely
necessary, but it's not a correctness problem.  The heap scan recheck
node will get rid of the unwanted tuples anyway.

> > Same-size page ranges?
> > Current related literature seems to consider that each "index entry" in a
> > minmax index must cover the same number of pages.  There doesn't seem to be a
> 
> I assume the reason for this in the literature is the need to quickly
> find the summary for a given page when you're handling an insert or
> delete.

Yeah, that's one thing to keep in mind.  I haven't gotten too much into
this; I only added these two entries at the end for my own future
reference, because I will need to consider them at some point.

Right now my intention is to have each index have a fixed page range
size, which is defined at index creation time.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Alvaro Herrera
Date:
Josh Berkus wrote:

> > Value changes in columns that are part of a minmax index, and tuple insertion
> > in summarized pages, would invalidate the stored min/max values.  To support
> > this, each minmax index has a validity map; a range can only be considered in a
> > scan if it hasn't been invalidated by such changes (A range "not considered" in
> > the scan needs to be returned in whole regardless of the stored min/max values,
> > that is, it cannot be pruned per query quals).  The validity map is very
> > similar to the visibility map in terms of performance characteristics: quick
> > enough that it's not contentious, allowing updates and insertions to proceed
> > even when data values violate the minmax index conditions.  An invalidated
> > range can be made valid by re-summarization (see below).
> 
> This begins to sound like these indexes are only useful on append-only
> tables.  Not that there aren't plenty of those, but ...

But what?  This is a useful use-case; one that's not served by any other
type of index.  Sure, you can have btrees over append-only tables, but
they are really large and have huge maintainance costs.  A smaller lossy
index is useful if low-maintenance and if it avoids keeping the cost of
scanning the table low enough.

These indexes can be considered a sort of partitioning of a large table.
Only instead of having to define CHECK (insert_date >= 'a month') for
each partition, you just create the index on the insert_date column, and
the index will allow a seqscan of the table to skip pages that don't
match the query's WHERE clause on that column.

> > Re-summarization is relatively expensive, because the complete page range has
> > to be scanned.
> 
> Why?  Why can't we just update the affected pages in the index?

The page range has to be scanned in order to find out the min/max values
for the indexed columns on the range; and then, with these data, update
the index.

> >  To avoid this, a table having a minmax index would be
> > configured so that inserts only go to the page(s) at the end of the table; this
> > avoids frequent invalidation of ranges in the middle of the table.  We provide
> > a table reloption that tweaks the FSM behavior, so that summarized pages are
> > not candidates for insertion.
> 
> We haven't had an index type which modifies table insertion behavior
> before, and I'm not keen to start now; imagine having two indexes on the
> same table each with their own, conflicting, requirements.

This is not a requirement.  It merely makes the index more effective.

> If we're going to start adding reloptions for specific table behavior,
> I'd rather think of all of the optimizations we might have for a
> prospective "append-only table" and bundle those, rather than tying it
> to whether a certain index exists or not.

Eh?  Sure, my intention for this reloption is for the user to be able to
state their intention for the table, and each feature that has
append-only table optimization does its thing.  I wasn't thinking in
anything automatic.

> Also, I hate the name ...

Feel free to propose other names; that way I can hate your proposals
too (or maybe not).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Andres Freund
Date:
On 2013-06-17 16:23:40 -0400, Alvaro Herrera wrote:
> > > Re-summarization is relatively expensive, because the complete page range has
> > > to be scanned.
> > 
> > Why?  Why can't we just update the affected pages in the index?
> 
> The page range has to be scanned in order to find out the min/max values
> for the indexed columns on the range; and then, with these data, update
> the index.

Why? Assuming the initial summarization has been performed you can check
the current min/max value, check whether it's still smaller/bigger than
the value you're inserting and if not update the index accordingly with
the new value. You don't even need to wait for the new value to become
visible since the ranges don't need to be minimal.

I think the contention this possibly causes may a better argument, but I
am not sure how much of a problem that really becomes if we find a
deadlock free way to only lock the minmax pages exlusively if the range
is violated.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: [RFC] Minmax indexes

From
Alvaro Herrera
Date:
Tom Lane wrote:

> We've talked a lot about index-organized tables in the past.  How much
> of the use case for this would be subsumed by a feature like that?

IOTs are not flexible enough.  You can only have one index that you
index-organize the table on; and you can search only based on a prefix
of the index key.  If you want to change the key, ... um. I don't even
know what you'd do.

With minmax indexes, on the other hand, you can create one or several,
and they let you scan the table based on any of the indexed columns.  So
you can create a minmax index on creation_date, insertion_date, ship_date;
and have it serve queries that use any of these columns.  (You probably
don't add key column(s) to the minmax index because you already have
btrees on them.)

On the other hand, IOTs are expensive to insert into.  For each tuple to
insert you need to start from the root and descend the tree, insert your
tuple, then propagate splits upwards.  If you have a 10 TB table, you
cannot afford to have to do all that for each and every tuple.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Josh Berkus
Date:
>> This begins to sound like these indexes are only useful on append-only
>> tables.  Not that there aren't plenty of those, but ...
> 
> But what?  

... "but" the other comments further down in my email.  Also, my
successive comments in other emails.

>> Why?  Why can't we just update the affected pages in the index?
> 
> The page range has to be scanned in order to find out the min/max values
> for the indexed columns on the range; and then, with these data, update
> the index.

Seems like you could incrementally update the range, at least for
inserts.  If you insert a row which doesn't decrease the min or increase
the max, you can ignore it, and if it does increase/decrease, you can
change the min/max.  No?

For updates, things are more complicated.  If the row you're updating
was the min/max, in theory you should update it to adjust that, but you
can't verify that it was the ONLY min/max row without doing a full scan.My suggestion would be to add a "dirty" flag
whichwould indicate that
 
that block could use a rescan next VACUUM, and otherwise ignore changing
the min/max.  After all, the only defect to having min to low or max too
high for a block would be scanning too many blocks.  Which you'd do
anyway with it marked "invalid".

> This is not a requirement.  It merely makes the index more effective.

Right.  So I'm saying let's do this index without the FSM modifications,
and then consider those as their own, separate patch, if we even do them.

> Eh?  Sure, my intention for this reloption is for the user to be able to
> state their intention for the table, and each feature that has
> append-only table optimization does its thing.  I wasn't thinking in
> anything automatic.

99.7% of our users have no idea what to do with reloptions.  We'd have
to expose it with an ALTER TABLE SET append_only=true.

>> Also, I hate the name ...
> 
> Feel free to propose other names; that way I can hate your proposals
> too (or maybe not).

Well, my first thought was "block-range indexing", which I think is the
best description, but that isn't exactly an exciting name for a feature
which will likely be worthy of short-listing for 9.4.  I'd prefer it
over minmax, which users will think only works on aggregates, but it's
still not a great name.  "Summary Index" also comes to mind, but really
isn't a lot more exciting.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [RFC] Minmax indexes

From
Tomas Vondra
Date:
Hi!

This sounds really interesting, so a few quick comments.

On 15.6.2013 00:28, Alvaro Herrera wrote:

> In each index tuple (corresponding to one page range), we store: -
> first block this tuple applies to - last block this tuple applies to 
> - for each indexed column: * min() value across all tuples in the
> range * max() value across all tuples in the range * nulls present in
> any tuple?

What about adding a counter how many times the min/max value is present
in the range? The other messages in this thread suggest that the refresh
after UPDATE/DELETE is one of the concerns - as Greg Stark mentioned the
range invalidation may only happen when running DELETE on a row matching
the min/max value. I believe having a counter would be an improvement -
a refresh would be needed only if it reaches 0.

> Summarization -------------
> 
> At index creation time, the whole table is scanned; for each page
> range the min() and max() values and nulls bitmap are collected, and
> stored in the index. The possibly-incomplete range at the end of the
> table is also included.

Would it make sense to do this using an existing (b-tree) index for very
large tables? Clearly it doesn't make sense to create a b-tree index and
then minmax on the same column, but for very large databases upgraded
using pg_upgrade this might be interesting.

> Once in a while, it is necessary to summarize a bunch of unsummarized
> pages (because the table has grown since the index was created), or
> re-summarize a range that has been marked invalid.  This is simple:
> scan the page range calculating the min() and max() for each indexed
> column, then insert the new index entry at the end of the index.  The
> main interesting questions are:
> 
> a) when to do it The perfect time to do it is as soon as a complete
> page range of the configured range size has been filled (assuming
> page ranges are constant size).
> 
> b) who does it (what process) It doesn't seem a good idea to have a
> client-connected process do it; it would incur unwanted latency.
> Three other options are (i) to spawn a specialized process to do it,
> which perhaps can be signalled by a client-connected process that
> executes a scan and notices the need to run summarization; or (ii) to
> let autovacuum do it, as a separate new maintenance task.  This seems
> simple enough to bolt on top of already existing autovacuum
> infrastructure.  The timing constraints of autovacuum might be
> undesirable, though.  (iii) wait for user command.
>
> 
> The easiest way to go around this seems to have vacuum do it.  That
> way we can simply do re-summarization on the amvacuumcleanup routine.
> Other answers would mean we need a separate AM routine, which appears
> unwarranted at this stage.

I don't think this should be added to the autovacuum daemon. It's quite
tricky to tune autovacuum to be aggressive just enough, i.e. not to run
too frequently / not to lag. I'm afraid this would add task consuming
unpredictable amounts of time, which would make the autovacuum tuning
even trickier.

I can live with non-summarized minmax indexes, but I can't live with
non-vacuumed tables.

> Open questions --------------
> 
> * Same-size page ranges? Current related literature seems to consider
> that each "index entry" in a minmax index must cover the same number
> of pages.  There doesn't seem to be a hard reason for this to be so;
> it might make sense to allow the index to self-tune so that some
> index entries cover smaller page ranges, if this allows the
> min()/max() values to be more compact.  This would incur larger space
> overhead for the index itself, but might allow better pruning of page
> ranges during scan.  In the limit of one index tuple per page, the
> index itself would occupy too much space, even though we would be 
> able to skip reading the most heap pages, because the min()/max() 
> ranges are tight; in the opposite limit of a single tuple that 
> summarizes the whole table, we wouldn't be able to prune anything
> from the seqscan even though the index is very small.

I see no particular reason not to allow that, and having variable range
size would be a very nice feature IMHO.

Do you have an indea on how the self-tuning might work?

I was thinking about something like this:

(1) estimate the initial range size, trying to keep good "selectivity"
while not having very small ranges, using
  (a) average row length  (b) number of distinct values in the column  (c) MCV/histogram/correlation

(2) once the index is built, running a "merge" on the ranges - merging
the adjacent pages if the resulting selectivity is not significantly
worse (again, this might be evaluated using MCV, histogram)

But it should be possible to override this (initial range size, max
range size) or disable heuristics completely, as having large ranges
probably makes the resummarizing more expensive. Although having the
counter might fix that as it's more likely there's another row with
min/max value.

BTW could this be used to pre-sort the table, so that the actual sort
algorithm would start with a reasonably sorted set of tuples? I.e.
sorting the index by (min,max) vector and then returning the table pages
in this order? I think it might be even possible to efficiently split
the table into multiple parts T(0), T(1), T(2), ..., T(n) where all rows
in T(i) are smaller than T(i+1).

This in turn would make it possible to "partition" the table for
aggregates - either running them in parallel or in sequence (i.e. not
running into OOM with HashAggregate). Yes, I know supporting this is far
in the future, I'm just "brainstorming" here ;-)

regards
Tomas



Re: [RFC] Minmax indexes

From
Bruce Momjian
Date:
On Sat, Jun 15, 2013 at 11:39:23AM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote:
> >> If we're going to start adding reloptions for specific table behavior,
> >> I'd rather think of all of the optimizations we might have for a
> >> prospective "append-only table" and bundle those, rather than tying it
> >> to whether a certain index exists or not.
> 
> > I agree that the FSM behaviour shouldn't be linked to index existence.
> > IMHO that should be a separate table parameter, WITH (fsm_mode = append)
> > Index only scans would also benefit from that.
> 
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone.  If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible.  If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.

Can you have pages that are receiving updates _not_ track min/max, until
the page is nearly full?  This would require full scans of such pages,
but there might be few of them.  The amount of free spaces on the page
as reported by FSM might be useful here.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: [RFC] Minmax indexes

From
Simon Riggs
Date:
On 25 June 2013 00:51, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Jun 15, 2013 at 11:39:23AM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote:
> >> If we're going to start adding reloptions for specific table behavior,
> >> I'd rather think of all of the optimizations we might have for a
> >> prospective "append-only table" and bundle those, rather than tying it
> >> to whether a certain index exists or not.
>
> > I agree that the FSM behaviour shouldn't be linked to index existence.
> > IMHO that should be a separate table parameter, WITH (fsm_mode = append)
> > Index only scans would also benefit from that.
>
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone.  If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible.  If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.

Can you have pages that are receiving updates _not_ track min/max, until
the page is nearly full?  This would require full scans of such pages,
but there might be few of them.  The amount of free spaces on the page
as reported by FSM might be useful here.

Yes, that is the proposal. Just like index only scans. 

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [RFC] Minmax indexes

From
Alvaro Herrera
Date:
I just noticed that this patch was closed with "returned with feedback"
in the commitfest app.  This is good, IMV -- it's saying that the
opinion of the various people commenting on the thread is positive, and
therefore no more discussion is currently needed.

I will post an actual patch to CF2, at which point more detailed
discussion can be had.

Thanks everyone for their feedback.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Jim Nasby
Date:
On 6/17/13 3:38 PM, Josh Berkus wrote:
>>> Why?  Why can't we just update the affected pages in the index?
>> >
>> >The page range has to be scanned in order to find out the min/max values
>> >for the indexed columns on the range; and then, with these data, update
>> >the index.
> Seems like you could incrementally update the range, at least for
> inserts.  If you insert a row which doesn't decrease the min or increase
> the max, you can ignore it, and if it does increase/decrease, you can
> change the min/max.  No?
>
> For updates, things are more complicated.  If the row you're updating
> was the min/max, in theory you should update it to adjust that, but you
> can't verify that it was the ONLY min/max row without doing a full scan.
>   My suggestion would be to add a "dirty" flag which would indicate that
> that block could use a rescan next VACUUM, and otherwise ignore changing
> the min/max.  After all, the only defect to having min to low or max too
> high for a block would be scanning too many blocks.  Which you'd do
> anyway with it marked "invalid".

If we add a dirty flag it would probably be wise to allow for more than one value so we can do a clock-sweep. That
wouldallow for detecting a range that is getting dirtied repeatedly and not bother to try and re-summarize it until
later.

Something else I don't think was mentioned... re-summarization should be somehow tied to access activity: if a query
willneed to seqscan a segment that needs to be summarized, we should take that opportunity to summarize at the same
timewhile pages are in cache. Maybe that can be done in the backend itself; maybe we'd want a separate process.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: [RFC] Minmax indexes

From
Claudio Freire
Date:
On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby <jim@nasby.net> wrote:
> On 6/17/13 3:38 PM, Josh Berkus wrote:
>>>>
>>>> Why?  Why can't we just update the affected pages in the index?
>>>
>>> >
>>> >The page range has to be scanned in order to find out the min/max values
>>> >for the indexed columns on the range; and then, with these data, update
>>> >the index.
>>
>> Seems like you could incrementally update the range, at least for
>> inserts.  If you insert a row which doesn't decrease the min or increase
>> the max, you can ignore it, and if it does increase/decrease, you can
>> change the min/max.  No?
>>
>> For updates, things are more complicated.  If the row you're updating
>> was the min/max, in theory you should update it to adjust that, but you
>> can't verify that it was the ONLY min/max row without doing a full scan.
>>   My suggestion would be to add a "dirty" flag which would indicate that
>> that block could use a rescan next VACUUM, and otherwise ignore changing
>> the min/max.  After all, the only defect to having min to low or max too
>> high for a block would be scanning too many blocks.  Which you'd do
>> anyway with it marked "invalid".
>
>
> If we add a dirty flag it would probably be wise to allow for more than one
> value so we can do a clock-sweep. That would allow for detecting a range
> that is getting dirtied repeatedly and not bother to try and re-summarize it
> until later.
>
> Something else I don't think was mentioned... re-summarization should be
> somehow tied to access activity: if a query will need to seqscan a segment
> that needs to be summarized, we should take that opportunity to summarize at
> the same time while pages are in cache. Maybe that can be done in the
> backend itself; maybe we'd want a separate process.


This smells a lot like hint bits and all the trouble they bring.



Re: [RFC] Minmax indexes

From
Jim Nasby
Date:
On 6/28/13 12:26 PM, Claudio Freire wrote:
> On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby <jim@nasby.net> wrote:
>> On 6/17/13 3:38 PM, Josh Berkus wrote:
>>>>>
>>>>> Why?  Why can't we just update the affected pages in the index?
>>>>
>>>>>
>>>>> The page range has to be scanned in order to find out the min/max values
>>>>> for the indexed columns on the range; and then, with these data, update
>>>>> the index.
>>>
>>> Seems like you could incrementally update the range, at least for
>>> inserts.  If you insert a row which doesn't decrease the min or increase
>>> the max, you can ignore it, and if it does increase/decrease, you can
>>> change the min/max.  No?
>>>
>>> For updates, things are more complicated.  If the row you're updating
>>> was the min/max, in theory you should update it to adjust that, but you
>>> can't verify that it was the ONLY min/max row without doing a full scan.
>>>    My suggestion would be to add a "dirty" flag which would indicate that
>>> that block could use a rescan next VACUUM, and otherwise ignore changing
>>> the min/max.  After all, the only defect to having min to low or max too
>>> high for a block would be scanning too many blocks.  Which you'd do
>>> anyway with it marked "invalid".
>>
>>
>> If we add a dirty flag it would probably be wise to allow for more than one
>> value so we can do a clock-sweep. That would allow for detecting a range
>> that is getting dirtied repeatedly and not bother to try and re-summarize it
>> until later.
>>
>> Something else I don't think was mentioned... re-summarization should be
>> somehow tied to access activity: if a query will need to seqscan a segment
>> that needs to be summarized, we should take that opportunity to summarize at
>> the same time while pages are in cache. Maybe that can be done in the
>> backend itself; maybe we'd want a separate process.
>
>
> This smells a lot like hint bits and all the trouble they bring.

Possibly; though if that's the case just having a second process do the work would probably eliminate most of that
problem.
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: [RFC] Minmax indexes

From
Josh Berkus
Date:
>> On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby <jim@nasby.net> wrote:
>>> If we add a dirty flag it would probably be wise to allow for more
>>> than one
>>> value so we can do a clock-sweep. That would allow for detecting a range
>>> that is getting dirtied repeatedly and not bother to try and
>>> re-summarize it
>>> until later.

Given that I'm talking about doing the resummarization at VACUUM time, I
don't think that's justified.  That only makes sense if you're doing the
below ...

>>> Something else I don't think was mentioned... re-summarization should be
>>> somehow tied to access activity: if a query will need to seqscan a
>>> segment
>>> that needs to be summarized, we should take that opportunity to
>>> summarize at
>>> the same time while pages are in cache. Maybe that can be done in the
>>> backend itself; maybe we'd want a separate process.

Writing at SELECT time is something our users hate, with significant
justification.  This suggestion is possibly a useful optimization, but
I'd like to see the simplest possible version of minmax indexes go into
9.4, so that we can see how they work in practice before we start adding
complicated performance optimizations involving extra write overhead and
background workers.  We're more likely to engineer an expensive
de-optimization that way.

I wouldn't be unhappy to have the very basic minmax indexes -- the one
where we just flag pages as invalid if they get updated -- go into 9.4.
That would still work for large, WORM tables, which are a significant
and pervasive use case.  If Alvaro gets to it, I think the "updating the
range, rescan at VACUUM time" approach isn't much more complex, but if
he doesn't I'd still vote to accept the feature.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [RFC] Minmax indexes

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> This is a preliminary proposal for Minmax indexes.  I'm experimenting
> with the code, but it's too crude to post yet, so here's a document
> explaining what they are and how they work, so that reviewers can poke
> holes to have the design improved.

After going further with the implementation, I have added a new concept,
the reverse range map.

Reverse Range Map
-----------------

The reverse range map is a separate fork each Minmax index has.  Its purpose
is to let a way to quickly find the index tuple corresponding to a page range;
for a given heap page number, there's an O(1) way to obtain the TID of the
corresponding index tuple.
To scan the index, we first obtain the TID of index tuple for page 0.  If
this returns a valid TID, we read that tuple to determine the min/max bounds
for this page range.  If it returns invalid, then the range is unsummarized,
and the scan must return the whole range as needing scan.  After this
index entry has been processed, we obtain the TID of the index tuple for
page 0+pagesPerRange (currently this is a compile-time constant, but
there's no reason this cannot be a per-index property).  Continue adding
pagesPerRange until we reach the end of the heap.

To read the TID during an index scan, we follow this protocol:

* read revmap page
* obtain share lock on the buffer
* read the TID
* LockTuple that TID (using the index as relation).  A shared lock is sufficient.  We need the LockTuple to prevent
VACUUMfrom recycling the index tuple; see below.
 
* release buffer lock
* read the index tuple
* release the tuple lock

On heap insert/update, it is normally cheaper to update the summary tuple
(grab the old tuple, expand the min/max range according to the new value
being inserted, insert the new index tuple, update the reverse range map)
than letting the range be unsummarized, which would require scanning the
pages in the range.

[However, when many updates on a range are going to be done, it'd be
better to mark it as unsummarized (i.e. set the revmap TID to Invalid)
and do a resummarization later, to prevent the index from bloating.
Also, it'd be sensible to allow postponing of the index update, if many
tuples are inserted; something like GIN's "pending list".  We would need
to keep track the TIDs of the inserted heap tuples, so that we can
figure out the new min/max values, without having to scan the whole
range.]

To update the summary tuple for a page range, we use this protocol:

* insert a new index tuple anywhere; note its TID
* read revmap page
* obtain exclusive lock on buffer
* write the TID
* release lock

This ensures no concurrent reader can obtain a partially-written TID.
Note we don't need a tuple lock here.  Concurrent scans don't have to
worry about whether they got the old or new index tuple: if they get the
old one, the tighter values are okay from a correctness standpoint because
due to MVCC they can't possibly see the just-inserted heap tuples anyway.

For vacuuming, we need to figure out which index tuples are no longer
referenced from the reverse range map.  This requires some brute force,
but is simple:

1) scan the complete index, store each existing TIDs in a dynahash.  Hash key is the TID, hash value is a boolean
initiallyset to false.
 
2) scan the complete revmap sequentially, read the TIDs on each page.  Share  lock on each page is sufficient.  For
eachTID so obtained, grab the  element from the hash and update the boolean to true.
 
3) Scan the index again; for each tuple found, search the hash table.  If the tuple is not present, it must have been
addedafter our initial  scan; ignore it.  If the hash flag is true, then the tuple is referenced;  ignore it.  If the
hashflag is false, then the index tuple is no longer  referenced by the revmap; but they could be about to be accessed
bya  concurrent scan.  Do ConditionalLockTuple.  If this fails, ignore the  tuple, it will be deleted by a future
vacuum. If lock is acquired,  then we can safely remove the index tuple.
 
4) Index pages with free space can be detected by this second scan.  Register  those with the FSM.

Note this doesn't require scanning the heap at all, or being involved in
the heap's cleanup procedure.  (In particular, tables with only minmax
indexes do not require the removed tuples' TIDs to be collected during
the heap cleanup pass.)   XXX I think this is wrong in detail; we
probably need to be using LockBufferForCleanup.

With the reverse range map it is very easy to see which page ranges in
the heap need resummarization; it's the ones marked with InvalidTid.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [RFC] Minmax indexes

From
Josh Berkus
Date:
On 07/18/2013 10:39 PM, Alvaro Herrera wrote:
> To scan the index, we first obtain the TID of index tuple for page 0.  If
> this returns a valid TID, we read that tuple to determine the min/max bounds
> for this page range.  If it returns invalid, then the range is unsummarized,
> and the scan must return the whole range as needing scan.  After this
> index entry has been processed, we obtain the TID of the index tuple for
> page 0+pagesPerRange (currently this is a compile-time constant, but
> there's no reason this cannot be a per-index property).  Continue adding
> pagesPerRange until we reach the end of the heap.

Conceptually, this sounds like a good initial solution to the update
problem.

I still think we could do incremental updates to the minmax indexes per
the idea I discussed, but that could be a later version.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: [RFC] Minmax indexes

From
Noah Misch
Date:
On Fri, Jun 14, 2013 at 06:28:06PM -0400, Alvaro Herrera wrote:
> Partial indexes are not supported; since an index is concerned with minimum and
> maximum values of the involved columns across all the pages in the table, it
> doesn't make sense to exclude values.

It can make sense if the predicate references a column other than the indexed
column(s).  Unlike a partial btree index, a partial minmax index would be no
smaller.  It could have narrower min-max spreads, reducing the recheck work
done by queries.

> Another way to see "partial" indexes
> here would be those that only considered some pages in the table instead of all
> of them; but this would be difficult to implement and manage and, most likely,
> pointless.

That's a distinct feature from the AM-independent partial index mechanism, in
any event.

> Expressional indexes can probably be supported in the future, but we disallow
> them initially for conceptual simplicity.

Whether an index column uses an expression is irrelevant to each existing core
AM.  How does minmax differ in this respect?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com