Thread: [RFC] Minmax indexes
Hi, This is a preliminary proposal for Minmax indexes. I'm experimenting with the code, but it's too crude to post yet, so here's a document explaining what they are and how they work, so that reviewers can poke holes to have the design improved. My intention is to have a patch to show for CF2, so please do have a look at this and comment. This is part of the AXLE project http://www.axleproject.eu and the intention is to support tables of very large size. In a quick experiment, I have a table of ~12 GB and its corresponding index is 65 kB in size, making the time to do the equivalent of a seqscan a small fraction of that taken by a real seqscan. This technique sits between a bitmap scan of a normal btree, and a seqscan: the minmax index tells the bitmap heap scan what pages to seqscan, allowing it to skip a large fraction of pages that are known not to contain tuples matching the query quals. This is a huge win for large data warehouses. Without further ado, here's what I propose. Minmax Range Indexes ==================== Minmax indexes are a new access method intended to enable very fast scanning of extremely large tables. The essential idea of a minmax index is to keep track of the min() and max() values in consecutive groups of heap pages (page ranges). These values can be used by constraint exclusion to avoid scanning such pages, depending on query quals. The main drawback of this is having to update the stored min/max values of each page range as tuples are inserted into them. Other database systems already have this feature. Some examples: * Oracle Exadata calls this "storage indexes" http://richardfoote.wordpress.com/category/storage-indexes/ * Netezza has "zone maps" http://nztips.com/2010/11/netezza-integer-join-keys/ * Infobright has this automatically within their "data packs" http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/ * MonetDB seems to have it http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662 "Cooperative Scans: Dynamic BandwidthSharing in a DBMS" Grammar ------- To create a minmax index, we use CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e); Partial indexes are not supported; since an index is concerned with minimum and maximum values of the involved columns across all the pages in the table, it doesn't make sense to exclude values. Another way to see "partial" indexes here would be those that only considered some pages in the table instead of all of them; but this would be difficult to implement and manage and, most likely, pointless. Expressional indexes can probably be supported in the future, but we disallow them initially for conceptual simplicity. Having multiple minmax indexes in the same table is acceptable, though most of the time it would make more sense to have a single index covering all the interesting columns. Multiple indexes might be useful for columns added later. Access Method Design -------------------- Since item pointers are not stored inside indexes of this type, it is not possible to support the amgettuple interface. Instead, we only provide amgetbitmap support; scanning a relation using this index requires a recheck node on top. The amgetbitmap routine would return a TIDBitmap comprising all the pages in those page groups that comply with the query quals; the recheck node prunes tuples that are not visible per snapshot and those that are not visible per query quals. For each supported datatype, we need an opclass with the following catalog entries: - support functions (pg_amproc) * pg_proc entry for min() * pg_proc entry for max() - support operators (pg_amop): same as btree (<, <=, =, >=, >) The min() and max() support functions are used during index construction. The support operators are used in the optimizer, so that the index is chosen when queries on the indexed table are planned. (Also, we use them in the amgetbitmap routine, to compare ScanKeys and decide whether to emit a certain block or not). In each index tuple (corresponding to one page range), we store: - first block this tuple applies to - last block this tuple applies to - for each indexed column: * min() value across all tuples in the range * max() value across all tuples in the range * nullspresent in any tuple? With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length types (timestamptz, bigint), each tuple would be 524 bytes in length, which seems reasonable. Of course, larger columns are possible, such as varchar, but creating minmax indexes on such columns seems of little practical usefulness. This maximum index tuple size is calculated as: BlockNumber (4 bytes) * 2 + data value (8 bytes) * 32 * 2 + null bitmap (4 bytes) Block ranges mentioned in index entries shouldn't overlap. However, there can be gaps where some pages have no covering index entry. (In particular, the last few pages of the table would commonly not be summarized.) In order to scan a table, a sequential scan of the index is done; for each tuple for which the range limits (min/max) are proven to be required to be scanned by the ScanKeys, we obtain the block range covered by that tuple, and return a lossy TIDBitmap of pages in the page range; any unsummarized pages and invalid page ranges (see below) must also be returned. Note that we store pg_proc entries for aggregate functions. It is the responsibility of the access method to obtain the pg_aggregate rows starting from the pg_proc rows; we avoid the parser code that deals with aggregates, because the mechanism it uses to resolve from parser to executor starts from aggregate names; this is not useful for our purposes. Also, running the aggregate is not done through the executor support for aggregates either, because that code is very specialized to running inside a node with a child node feeding tuples, which IndexBuildHeapScan cannot do. Instead, the AM is in charge of initializing state and running the transition functions for each new tuple read by the index-build heap scan. Data changes in the heap ------------------------ Value changes in columns that are part of a minmax index, and tuple insertion in summarized pages, would invalidate the stored min/max values. To support this, each minmax index has a validity map; a range can only be considered in a scan if it hasn't been invalidated by such changes (A range "not considered" in the scan needs to be returned in whole regardless of the stored min/max values, that is, it cannot be pruned per query quals). The validity map is very similar to the visibility map in terms of performance characteristics: quick enough that it's not contentious, allowing updates and insertions to proceed even when data values violate the minmax index conditions. An invalidated range can be made valid by re-summarization (see below). Re-summarization is relatively expensive, because the complete page range has to be scanned. To avoid this, a table having a minmax index would be configured so that inserts only go to the page(s) at the end of the table; this avoids frequent invalidation of ranges in the middle of the table. We provide a table reloption that tweaks the FSM behavior, so that summarized pages are not candidates for insertion. A possible optimization is that whenever an update does not modify the values covered by minmax indexes, or whenever the value changes are not outside the min()/max() interval for the corresponding page range, the page range does not need to be marked invalid. This is particularly useful when considering same-page updates (i.e. updates which find enough empty space in the same page that the new tuple fits in it), and HOT (which frees space used by previous updates). This suggests we ought to recommend lower-than-100 fillfactor values to permit these features to kick in. When tuples are added to unsummarized pages, nothing needs to happen to the index. Tuples can be removed from anywhere without restriction. See below for more on vacuuming. Summarization ------------- At index creation time, the whole table is scanned; for each page range the min() and max() values and nulls bitmap are collected, and stored in the index. The possibly-incomplete range at the end of the table is also included. Once in a while, it is necessary to summarize a bunch of unsummarized pages (because the table has grown since the index was created), or re-summarize a range that has been marked invalid. This is simple: scan the page range calculating the min() and max() for each indexed column, then insert the new index entry at the end of the index. The main interesting questions are: a) when to do it The perfect time to do it is as soon as a complete page range of the configured range size has been filled(assuming page ranges are constant size). b) who does it (what process) It doesn't seem a good idea to have a client-connected process do it; it would incur unwantedlatency. Three other options are (i) to spawn a specialized process to do it, which perhaps can be signalled bya client-connected process that executes a scan and notices the need to run summarization; or (ii) to let autovacuumdo it, as a separate new maintenance task. This seems simple enough to bolt on top of already existing autovacuuminfrastructure. The timing constraints of autovacuum might be undesirable, though. (iii) wait for user command. The easiest way to go around this seems to have vacuum do it. That way we can simply do re-summarization on the amvacuumcleanup routine. Other answers would mean we need a separate AM routine, which appears unwarranted at this stage. Vacuuming --------- Vacuuming a table that has a minmax index does not represent a significant challenge. Since no CTIDs are stored, it's not necessary to scan the index when heap tuples are removed. It might be that some min() value can be incremented, or some max() value can be decremented; but this would represent an optimization opportunity only, not a correctness issue. Perhaps it's simpler to represent this as the need to re-run summarization on the affected page range. Note that if there are no indexes on the table other than the minmax index, usage of maintenance_work_mem by vacuum can be decreased significantly, because no detailed index scan needs to take place (and thus it's not necessary for vacuum to save TIDs to remove). This optimization opportunity is best left for future improvement. Optimizer --------- In order to make this all work, the only thing we need to do is ensure we have a good enough opclass and amcostestimate. With this, the optimizer is able to pick up the index on its own. Open questions -------------- * Same-size page ranges? Current related literature seems to consider that each "index entry" in a minmax index must coverthe same number of pages. There doesn't seem to be a hard reason for this to be so; it might make sense to allow theindex to self-tune so that some index entries cover smaller page ranges, if this allows the min()/max() values to be morecompact. This would incur larger space overhead for the index itself, but might allow better pruning of page rangesduring scan. In the limit of one index tuple per page, the index itself would occupy too much space, even though wewould be able to skip reading the most heap pages, because the min()/max() ranges are tight; in the opposite limit of asingle tuple that summarizes the whole table, we wouldn't be able to prune anything from the seqscan even though the indexis very small. * More compact representation for TIDBitmap? TIDBitmap is the structure currently used to represent bitmap scans. The representationof lossy page ranges is not optimal for our purposes, because it uses a Bitmapset to represent pages in therange; since we're going to return all pages in a large range, it might be more convenient to allow for a struct thatinstead uses start and end page numbers to represent the range. This is mentioned in passing in tidbitmap.c comments,so perhaps the time has come for its implementation. (In experimentation, tidbitmap has no trouble with rangesof a thousand pages.) References: Email thread on pgsql-hackers http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site From: Simon RiggsTo: pgsql-hackers Subject: Dynamic Partitioning using Segment Visibility Map http://wiki.postgresql.org/wiki/Segment_Exclusion http://wiki.postgresql.org/wiki/Segment_Visibility_Map -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro, This sounds really interesting, and I can see the possibilities. However ... > Value changes in columns that are part of a minmax index, and tuple insertion > in summarized pages, would invalidate the stored min/max values. To support > this, each minmax index has a validity map; a range can only be considered in a > scan if it hasn't been invalidated by such changes (A range "not considered" in > the scan needs to be returned in whole regardless of the stored min/max values, > that is, it cannot be pruned per query quals). The validity map is very > similar to the visibility map in terms of performance characteristics: quick > enough that it's not contentious, allowing updates and insertions to proceed > even when data values violate the minmax index conditions. An invalidated > range can be made valid by re-summarization (see below). This begins to sound like these indexes are only useful on append-only tables. Not that there aren't plenty of those, but ... > Re-summarization is relatively expensive, because the complete page range has > to be scanned. Why? Why can't we just update the affected pages in the index? > To avoid this, a table having a minmax index would be > configured so that inserts only go to the page(s) at the end of the table; this > avoids frequent invalidation of ranges in the middle of the table. We provide > a table reloption that tweaks the FSM behavior, so that summarized pages are > not candidates for insertion. We haven't had an index type which modifies table insertion behavior before, and I'm not keen to start now; imagine having two indexes on the same table each with their own, conflicting, requirements. This is sounding a lot more like a candidate for our prospective pluggable storage manager. Also, the above doesn't help us at all with UPDATEs. If we're going to start adding reloptions for specific table behavior, I'd rather think of all of the optimizations we might have for a prospective "append-only table" and bundle those, rather than tying it to whether a certain index exists or not. Also, I hate the name ... if this feature goes ahead, I'm going to be lobbying to change it. But that's pretty minor compared to the update issues. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> To avoid this, a table having a minmax index would be >> configured so that inserts only go to the page(s) at the end of the table; this >> avoids frequent invalidation of ranges in the middle of the table. We provide >> a table reloption that tweaks the FSM behavior, so that summarized pages are >> not candidates for insertion. > We haven't had an index type which modifies table insertion behavior > before, and I'm not keen to start now; imagine having two indexes on the > same table each with their own, conflicting, requirements. I agree; such a restriction is a nonstarter for a secondary index. I don't believe that hacking the FSM would be sufficient to guarantee the required behavior, either. We've talked a lot about index-organized tables in the past. How much of the use case for this would be subsumed by a feature like that? regards, tom lane
On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Re-summarization is relatively expensive, because the complete page range has > to be scanned. That doesn't sound too bad to me. It just means there's a downside to having larger page ranges. I would expect the page ranges to be something in the ballpark of 32 pages -- scanning 32 pages to resummarize doesn't sound that painful but sounds like it's large enough that the resulting index would be a reasonable size. But I don't understand why an insert would invalid a tuple. An insert can just update the min and max incrementally. It's a delete that invalidates the range but as you note it doesn't really invalidate it, just mark it as needing a refresh -- and even then only if the value being deleted is equal to either the min or max. > Same-size page ranges? > Current related literature seems to consider that each "index entry" in a > minmax index must cover the same number of pages. There doesn't seem to be a I assume the reason for this in the literature is the need to quickly find the summary for a given page when you're handling an insert or delete. If you have some kind of meta data structure that lets you find it (which I gather is what the validity map is?) then you wouldn't need it. But that seems like a difficulty cost to justify compared to just having a 1:1 mapping from block to bitmap tuple. -- greg
On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote: > Alvaro, > > This sounds really interesting, and I can see the possibilities. > However ... > >> Value changes in columns that are part of a minmax index, and tuple insertion >> in summarized pages, would invalidate the stored min/max values. To support >> this, each minmax index has a validity map; a range can only be considered in a >> scan if it hasn't been invalidated by such changes (A range "not considered" in >> the scan needs to be returned in whole regardless of the stored min/max values, >> that is, it cannot be pruned per query quals). The validity map is very >> similar to the visibility map in terms of performance characteristics: quick >> enough that it's not contentious, allowing updates and insertions to proceed >> even when data values violate the minmax index conditions. An invalidated >> range can be made valid by re-summarization (see below). > > This begins to sound like these indexes are only useful on append-only > tables. Not that there aren't plenty of those, but ... The index is basically using the "index only scan" mechanism. The "only useful on append-only tables" comment would/should apply also to index only scans. I can't see a reason to raise that specifically for this index type. >> Re-summarization is relatively expensive, because the complete page range has >> to be scanned. > > Why? Why can't we just update the affected pages in the index? Again, same thing as index-only scans. For IOS, we reset the visibility info at vacuum. The route proposed here follows exactly the same timing, same mechanism. I can't see a reason for any difference between the two. >> To avoid this, a table having a minmax index would be >> configured so that inserts only go to the page(s) at the end of the table; this >> avoids frequent invalidation of ranges in the middle of the table. We provide >> a table reloption that tweaks the FSM behavior, so that summarized pages are >> not candidates for insertion. > > We haven't had an index type which modifies table insertion behavior > before, and I'm not keen to start now; imagine having two indexes on the > same table each with their own, conflicting, requirements. This is > sounding a lot more like a candidate for our prospective pluggable > storage manager. Also, the above doesn't help us at all with UPDATEs. > > If we're going to start adding reloptions for specific table behavior, > I'd rather think of all of the optimizations we might have for a > prospective "append-only table" and bundle those, rather than tying it > to whether a certain index exists or not. I agree that the FSM behaviour shouldn't be linked to index existence. IMHO that should be a separate table parameter, WITH (fsm_mode = append) Index only scans would also benefit from that. > Also, I hate the name ... if this feature goes ahead, I'm going to be > lobbying to change it. But that's pretty minor compared to the update > issues. This feature has already had 3 different names. I don't think the name is crucial, but it makes sense to give it a name up front. So if you want to lobby for that then you'd need to come up with a name soon, so poor Alvaro can cope with name #4. (There's no consistency in naming from any other implementation either). --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 15 June 2013 00:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > We've talked a lot about index-organized tables in the past. How much > of the use case for this would be subsumed by a feature like that? IOTs would only work for one specific organisation, acting like a multi-column btree. So you could use IOTs for this but only in a limited way. MinMax indexes cover multiple columns, possibly even all columns if desired. So MMIs have much greater usefulness, as well as not requiring a full reload of the data when you decide you want one. Personally, I think IOTs are not worth pursuing. IOTs for Postgres would require major changes to the definition of the heap, so we're a long way from implementing them directly and coping with all of the detailed implementation horrors. Grouped Item Tuple indexes are theoretically equivalent in terms of size and number of block accesses required to access data. Heikki's earlier work on that could easily be revived. IOTs would be a multi-year effort with conflicting use cases. Note that SQLServer and Sybase use block range indexes for primary indexes, neatly avoiding the hassle Oracle walked into when they chose to do IOTs. I think we should learn that same lession ourselves. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote: >> If we're going to start adding reloptions for specific table behavior, >> I'd rather think of all of the optimizations we might have for a >> prospective "append-only table" and bundle those, rather than tying it >> to whether a certain index exists or not. > I agree that the FSM behaviour shouldn't be linked to index existence. > IMHO that should be a separate table parameter, WITH (fsm_mode = append) > Index only scans would also benefit from that. -1 ... I cannot believe that such a parameter would ever get turned on in production by anyone. If your table has a significant update rate, the resulting table bloat would make such behavior completely infeasible. If you have few enough updates to make such a behavior practical, then you can live with the expensive index updates instead. I also find the analogy to index-only scans to be bogus, because those didn't require any user tuning. There's a nearby thread complaining bitterly about our willingness to create hard-to-use, hard-to-tune "features". In its current form, this will be another one of those. regards, tom lane
On 15 June 2013 16:39, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote: >>> If we're going to start adding reloptions for specific table behavior, >>> I'd rather think of all of the optimizations we might have for a >>> prospective "append-only table" and bundle those, rather than tying it >>> to whether a certain index exists or not. > >> I agree that the FSM behaviour shouldn't be linked to index existence. >> IMHO that should be a separate table parameter, WITH (fsm_mode = append) >> Index only scans would also benefit from that. > > -1 ... I cannot believe that such a parameter would ever get turned on > in production by anyone. It's an extremely common use case, epecially with larger databases. > If your table has a significant update rate, > the resulting table bloat would make such behavior completely > infeasible. If you have few enough updates to make such a behavior > practical, then you can live with the expensive index updates instead. Depends. What I had in mind was that "append" mode would allow updates normally within the last 1 GB of a table, but relocate updates should they occur in earlier segments. That would cover the majority of use cases I see. > I also find the analogy to index-only scans to be bogus, because those > didn't require any user tuning. I don't think anybody is suggesting there is a requirement for user tuning on this type of index. But I think having one parameter would make it the same as say, GIN indexes which also have a single tuning parameter. I think the idea that index only scans "just work" is itself bogus. A mild random update pattern will reduce the benefit of IOS scans considerably, which is hard to understand or control, especially without any way of measuring it. We could eliminate that problem if we chose, but its not even documented as a potential issue for some reason so its hard to discuss. Not being able to see or control a known problem is not the same thing as "doesn't require user tuning". > There's a nearby thread complaining bitterly about our willingness to > create hard-to-use, hard-to-tune "features". In its current form, > this will be another one of those. Not based upon anything mentioned here so far, or that I'm aware of. I think it would be interesting to see some anonymous voting on feature complaints, so we can see what people really think without needing to risk offending the main authors of each of the various parts of our software. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 06/15/2013 11:39 PM, Tom Lane wrote: > There's a nearby thread complaining bitterly about our willingness to > create hard-to-use, hard-to-tune "features". In its current form, > this will be another one of those. For what it's worth, I was going for "usability concerns", not "complaint"... and I'd much rather have most of those features, even with poor usability, than not have them at all. I'm not convinced anyone can introduce something that's beautiful design and implementation perfection first time, every time. Learning where the pain points are and improving them is what I'm going for. There's also, IMO, nothing wrong with somewhat obscure options for obscure use cases and workloads. The problem is when difficult to understand or weird and obscure parameters need adjustment for perfectly ordinary middle-of-the-road workloads. I'm curious about using minmax indexes to guide insert/update placement, so new tuples are written in regions of the table where they already match the indexed range - essentially providing coarse-grained auto-clustering. Otherwise with updates a table is likely to see scattering of values such that the minmax ranges all keep on expanding and the index becomes less and less selective. I realise that this may not be even remotely feasible, but if it were then it might help reduce the need for explicit tuning. I suspect we'd also land up wanting table optimisation down the track, where tuples were shuffled around during autovacuum to narrow the minmax index ranges. I'm not expressing an opinion on the option discussed here as I haven't worked with enough DBs that'd benefit from it. Personally I'm just seriously impressed that this feature is coming together. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
>> I agree that the FSM behaviour shouldn't be linked to index existence. >> IMHO that should be a separate table parameter, WITH (fsm_mode = append) >> Index only scans would also benefit from that. > > -1 ... I cannot believe that such a parameter would ever get turned on > in production by anyone. If your table has a significant update rate, > the resulting table bloat would make such behavior completely > infeasible. If you have few enough updates to make such a behavior > practical, then you can live with the expensive index updates instead. I'm also thinking that if a table is really append-only, then there are never any middle-of-the-table pages in the FSM, no? Where this falls down is the table which is mostly-append-but-occasionally-needs-an-update-or-delete. I think the answer there is to look for a way to make updating the index block range faster, not ways to modify how we append to the heap. If we told users "tables with Minmax indexes will be very expensive to update" then I think they'd live with it; dropping and readding an index to enable fast updates is something which is already familiar. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 17 June 2013 02:05, Josh Berkus <josh@agliodbs.com> wrote: > >>> I agree that the FSM behaviour shouldn't be linked to index existence. >>> IMHO that should be a separate table parameter, WITH (fsm_mode = append) >>> Index only scans would also benefit from that. >> >> -1 ... I cannot believe that such a parameter would ever get turned on >> in production by anyone. If your table has a significant update rate, >> the resulting table bloat would make such behavior completely >> infeasible. If you have few enough updates to make such a behavior >> practical, then you can live with the expensive index updates instead. > > I'm also thinking that if a table is really append-only, then there are > never any middle-of-the-table pages in the FSM, no? > > Where this falls down is the table which is > mostly-append-but-occasionally-needs-an-update-or-delete. I think the > answer there is to look for a way to make updating the index block range > faster, not ways to modify how we append to the heap. If we told users > "tables with Minmax indexes will be very expensive to update" then I > think they'd live with it; dropping and readding an index to enable fast > updates is something which is already familiar. This feature is using a similar technique to enhance SeqScans as we already use on VACUUM. We don't really care about whether we have 100% scan avoidance because we were likely to be using a WHERE clause that doesn't give perfect constraint elimination anyway. So on a large table we don't care about the fact that we still have to scan 1-5% of the table - we are still 20 times faster than a full seqscan. So there isn't a "fall down" thing here. We expect the recently loaded/updated data to be scanned and that's OK. Having the minmax index updated greedily is just adding extra work for fast diminishing returns. We can always add that later if really needed, but I doubt it will be needed - in just the same way as mat views aren't greedily updated. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> So there isn't a "fall down" thing here. We expect the recently > loaded/updated data to be scanned and that's OK. > > Having the minmax index updated greedily is just adding extra work for > fast diminishing returns. We can always add that later if really > needed, but I doubt it will be needed - in just the same way as mat > views aren't greedily updated. Ok, in that case, can we add the patch without messing with the FSM logic? It'll work out-of-the-box for append-only tables, and that's a pretty solid use case. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Greg Stark wrote: > On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Re-summarization is relatively expensive, because the complete page range has > > to be scanned. > > That doesn't sound too bad to me. It just means there's a downside to > having larger page ranges. I would expect the page ranges to be > something in the ballpark of 32 pages -- scanning 32 pages to > resummarize doesn't sound that painful but sounds like it's large > enough that the resulting index would be a reasonable size. Actually, I'm thinking that a range is more like, say, 1280 pages, or 10 MB. My goal is to consider tables in the 10 TB magnitude; if I store one index tuple for every 32 pages, I would end up with too large an index. With 10 MBs per range I can index the whole table with an index of 50 MB, which seems reasonable to scan. But anyway my intention is that page range size is configurable. > But I don't understand why an insert would invalid a tuple. An insert > can just update the min and max incrementally. It's a delete that > invalidates the range but as you note it doesn't really invalidate it, > just mark it as needing a refresh -- and even then only if the value > being deleted is equal to either the min or max. No, I don't intend to update the index tuple with each heap insert. I think this will end up being too slow. The validity map is there to hold a single bit for each page saying whether the page range is known valid or not; one insert into any page in the range invalidates the range (and any scan of the table needs to scan that range as a whole). This way I can wait until the storm of inserts has passed from a range, and only then do the summarization for that range. This avoids having to summarize the range over and over. Alternatively, I could consider the index tuple always valid, and update it online as soon as we do an insert or update (i.e. examine the min/max values in the index tuple, and update it to match if the new value is out of bounds). This seems too slow, so I won't try. Also, a delete does not invalidate a range either. As Simon said elsewhere in a thread, if the range is not minimal, this is not a problem; we might have to scan some more ranges than absolutely necessary, but it's not a correctness problem. The heap scan recheck node will get rid of the unwanted tuples anyway. > > Same-size page ranges? > > Current related literature seems to consider that each "index entry" in a > > minmax index must cover the same number of pages. There doesn't seem to be a > > I assume the reason for this in the literature is the need to quickly > find the summary for a given page when you're handling an insert or > delete. Yeah, that's one thing to keep in mind. I haven't gotten too much into this; I only added these two entries at the end for my own future reference, because I will need to consider them at some point. Right now my intention is to have each index have a fixed page range size, which is defined at index creation time. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Josh Berkus wrote: > > Value changes in columns that are part of a minmax index, and tuple insertion > > in summarized pages, would invalidate the stored min/max values. To support > > this, each minmax index has a validity map; a range can only be considered in a > > scan if it hasn't been invalidated by such changes (A range "not considered" in > > the scan needs to be returned in whole regardless of the stored min/max values, > > that is, it cannot be pruned per query quals). The validity map is very > > similar to the visibility map in terms of performance characteristics: quick > > enough that it's not contentious, allowing updates and insertions to proceed > > even when data values violate the minmax index conditions. An invalidated > > range can be made valid by re-summarization (see below). > > This begins to sound like these indexes are only useful on append-only > tables. Not that there aren't plenty of those, but ... But what? This is a useful use-case; one that's not served by any other type of index. Sure, you can have btrees over append-only tables, but they are really large and have huge maintainance costs. A smaller lossy index is useful if low-maintenance and if it avoids keeping the cost of scanning the table low enough. These indexes can be considered a sort of partitioning of a large table. Only instead of having to define CHECK (insert_date >= 'a month') for each partition, you just create the index on the insert_date column, and the index will allow a seqscan of the table to skip pages that don't match the query's WHERE clause on that column. > > Re-summarization is relatively expensive, because the complete page range has > > to be scanned. > > Why? Why can't we just update the affected pages in the index? The page range has to be scanned in order to find out the min/max values for the indexed columns on the range; and then, with these data, update the index. > > To avoid this, a table having a minmax index would be > > configured so that inserts only go to the page(s) at the end of the table; this > > avoids frequent invalidation of ranges in the middle of the table. We provide > > a table reloption that tweaks the FSM behavior, so that summarized pages are > > not candidates for insertion. > > We haven't had an index type which modifies table insertion behavior > before, and I'm not keen to start now; imagine having two indexes on the > same table each with their own, conflicting, requirements. This is not a requirement. It merely makes the index more effective. > If we're going to start adding reloptions for specific table behavior, > I'd rather think of all of the optimizations we might have for a > prospective "append-only table" and bundle those, rather than tying it > to whether a certain index exists or not. Eh? Sure, my intention for this reloption is for the user to be able to state their intention for the table, and each feature that has append-only table optimization does its thing. I wasn't thinking in anything automatic. > Also, I hate the name ... Feel free to propose other names; that way I can hate your proposals too (or maybe not). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-17 16:23:40 -0400, Alvaro Herrera wrote: > > > Re-summarization is relatively expensive, because the complete page range has > > > to be scanned. > > > > Why? Why can't we just update the affected pages in the index? > > The page range has to be scanned in order to find out the min/max values > for the indexed columns on the range; and then, with these data, update > the index. Why? Assuming the initial summarization has been performed you can check the current min/max value, check whether it's still smaller/bigger than the value you're inserting and if not update the index accordingly with the new value. You don't even need to wait for the new value to become visible since the ranges don't need to be minimal. I think the contention this possibly causes may a better argument, but I am not sure how much of a problem that really becomes if we find a deadlock free way to only lock the minmax pages exlusively if the range is violated. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane wrote: > We've talked a lot about index-organized tables in the past. How much > of the use case for this would be subsumed by a feature like that? IOTs are not flexible enough. You can only have one index that you index-organize the table on; and you can search only based on a prefix of the index key. If you want to change the key, ... um. I don't even know what you'd do. With minmax indexes, on the other hand, you can create one or several, and they let you scan the table based on any of the indexed columns. So you can create a minmax index on creation_date, insertion_date, ship_date; and have it serve queries that use any of these columns. (You probably don't add key column(s) to the minmax index because you already have btrees on them.) On the other hand, IOTs are expensive to insert into. For each tuple to insert you need to start from the root and descend the tree, insert your tuple, then propagate splits upwards. If you have a 10 TB table, you cannot afford to have to do all that for each and every tuple. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>> This begins to sound like these indexes are only useful on append-only >> tables. Not that there aren't plenty of those, but ... > > But what? ... "but" the other comments further down in my email. Also, my successive comments in other emails. >> Why? Why can't we just update the affected pages in the index? > > The page range has to be scanned in order to find out the min/max values > for the indexed columns on the range; and then, with these data, update > the index. Seems like you could incrementally update the range, at least for inserts. If you insert a row which doesn't decrease the min or increase the max, you can ignore it, and if it does increase/decrease, you can change the min/max. No? For updates, things are more complicated. If the row you're updating was the min/max, in theory you should update it to adjust that, but you can't verify that it was the ONLY min/max row without doing a full scan.My suggestion would be to add a "dirty" flag whichwould indicate that that block could use a rescan next VACUUM, and otherwise ignore changing the min/max. After all, the only defect to having min to low or max too high for a block would be scanning too many blocks. Which you'd do anyway with it marked "invalid". > This is not a requirement. It merely makes the index more effective. Right. So I'm saying let's do this index without the FSM modifications, and then consider those as their own, separate patch, if we even do them. > Eh? Sure, my intention for this reloption is for the user to be able to > state their intention for the table, and each feature that has > append-only table optimization does its thing. I wasn't thinking in > anything automatic. 99.7% of our users have no idea what to do with reloptions. We'd have to expose it with an ALTER TABLE SET append_only=true. >> Also, I hate the name ... > > Feel free to propose other names; that way I can hate your proposals > too (or maybe not). Well, my first thought was "block-range indexing", which I think is the best description, but that isn't exactly an exciting name for a feature which will likely be worthy of short-listing for 9.4. I'd prefer it over minmax, which users will think only works on aggregates, but it's still not a great name. "Summary Index" also comes to mind, but really isn't a lot more exciting. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hi! This sounds really interesting, so a few quick comments. On 15.6.2013 00:28, Alvaro Herrera wrote: > In each index tuple (corresponding to one page range), we store: - > first block this tuple applies to - last block this tuple applies to > - for each indexed column: * min() value across all tuples in the > range * max() value across all tuples in the range * nulls present in > any tuple? What about adding a counter how many times the min/max value is present in the range? The other messages in this thread suggest that the refresh after UPDATE/DELETE is one of the concerns - as Greg Stark mentioned the range invalidation may only happen when running DELETE on a row matching the min/max value. I believe having a counter would be an improvement - a refresh would be needed only if it reaches 0. > Summarization ------------- > > At index creation time, the whole table is scanned; for each page > range the min() and max() values and nulls bitmap are collected, and > stored in the index. The possibly-incomplete range at the end of the > table is also included. Would it make sense to do this using an existing (b-tree) index for very large tables? Clearly it doesn't make sense to create a b-tree index and then minmax on the same column, but for very large databases upgraded using pg_upgrade this might be interesting. > Once in a while, it is necessary to summarize a bunch of unsummarized > pages (because the table has grown since the index was created), or > re-summarize a range that has been marked invalid. This is simple: > scan the page range calculating the min() and max() for each indexed > column, then insert the new index entry at the end of the index. The > main interesting questions are: > > a) when to do it The perfect time to do it is as soon as a complete > page range of the configured range size has been filled (assuming > page ranges are constant size). > > b) who does it (what process) It doesn't seem a good idea to have a > client-connected process do it; it would incur unwanted latency. > Three other options are (i) to spawn a specialized process to do it, > which perhaps can be signalled by a client-connected process that > executes a scan and notices the need to run summarization; or (ii) to > let autovacuum do it, as a separate new maintenance task. This seems > simple enough to bolt on top of already existing autovacuum > infrastructure. The timing constraints of autovacuum might be > undesirable, though. (iii) wait for user command. > > > The easiest way to go around this seems to have vacuum do it. That > way we can simply do re-summarization on the amvacuumcleanup routine. > Other answers would mean we need a separate AM routine, which appears > unwarranted at this stage. I don't think this should be added to the autovacuum daemon. It's quite tricky to tune autovacuum to be aggressive just enough, i.e. not to run too frequently / not to lag. I'm afraid this would add task consuming unpredictable amounts of time, which would make the autovacuum tuning even trickier. I can live with non-summarized minmax indexes, but I can't live with non-vacuumed tables. > Open questions -------------- > > * Same-size page ranges? Current related literature seems to consider > that each "index entry" in a minmax index must cover the same number > of pages. There doesn't seem to be a hard reason for this to be so; > it might make sense to allow the index to self-tune so that some > index entries cover smaller page ranges, if this allows the > min()/max() values to be more compact. This would incur larger space > overhead for the index itself, but might allow better pruning of page > ranges during scan. In the limit of one index tuple per page, the > index itself would occupy too much space, even though we would be > able to skip reading the most heap pages, because the min()/max() > ranges are tight; in the opposite limit of a single tuple that > summarizes the whole table, we wouldn't be able to prune anything > from the seqscan even though the index is very small. I see no particular reason not to allow that, and having variable range size would be a very nice feature IMHO. Do you have an indea on how the self-tuning might work? I was thinking about something like this: (1) estimate the initial range size, trying to keep good "selectivity" while not having very small ranges, using (a) average row length (b) number of distinct values in the column (c) MCV/histogram/correlation (2) once the index is built, running a "merge" on the ranges - merging the adjacent pages if the resulting selectivity is not significantly worse (again, this might be evaluated using MCV, histogram) But it should be possible to override this (initial range size, max range size) or disable heuristics completely, as having large ranges probably makes the resummarizing more expensive. Although having the counter might fix that as it's more likely there's another row with min/max value. BTW could this be used to pre-sort the table, so that the actual sort algorithm would start with a reasonably sorted set of tuples? I.e. sorting the index by (min,max) vector and then returning the table pages in this order? I think it might be even possible to efficiently split the table into multiple parts T(0), T(1), T(2), ..., T(n) where all rows in T(i) are smaller than T(i+1). This in turn would make it possible to "partition" the table for aggregates - either running them in parallel or in sequence (i.e. not running into OOM with HashAggregate). Yes, I know supporting this is far in the future, I'm just "brainstorming" here ;-) regards Tomas
On Sat, Jun 15, 2013 at 11:39:23AM -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote: > >> If we're going to start adding reloptions for specific table behavior, > >> I'd rather think of all of the optimizations we might have for a > >> prospective "append-only table" and bundle those, rather than tying it > >> to whether a certain index exists or not. > > > I agree that the FSM behaviour shouldn't be linked to index existence. > > IMHO that should be a separate table parameter, WITH (fsm_mode = append) > > Index only scans would also benefit from that. > > -1 ... I cannot believe that such a parameter would ever get turned on > in production by anyone. If your table has a significant update rate, > the resulting table bloat would make such behavior completely > infeasible. If you have few enough updates to make such a behavior > practical, then you can live with the expensive index updates instead. Can you have pages that are receiving updates _not_ track min/max, until the page is nearly full? This would require full scans of such pages, but there might be few of them. The amount of free spaces on the page as reported by FSM might be useful here. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 25 June 2013 00:51, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Jun 15, 2013 at 11:39:23AM -0400, Tom Lane wrote:Can you have pages that are receiving updates _not_ track min/max, until
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On 15 June 2013 00:01, Josh Berkus <josh@agliodbs.com> wrote:
> >> If we're going to start adding reloptions for specific table behavior,
> >> I'd rather think of all of the optimizations we might have for a
> >> prospective "append-only table" and bundle those, rather than tying it
> >> to whether a certain index exists or not.
>
> > I agree that the FSM behaviour shouldn't be linked to index existence.
> > IMHO that should be a separate table parameter, WITH (fsm_mode = append)
> > Index only scans would also benefit from that.
>
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone. If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible. If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.
the page is nearly full? This would require full scans of such pages,
but there might be few of them. The amount of free spaces on the page
as reported by FSM might be useful here.
Yes, that is the proposal. Just like index only scans.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
I just noticed that this patch was closed with "returned with feedback" in the commitfest app. This is good, IMV -- it's saying that the opinion of the various people commenting on the thread is positive, and therefore no more discussion is currently needed. I will post an actual patch to CF2, at which point more detailed discussion can be had. Thanks everyone for their feedback. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 6/17/13 3:38 PM, Josh Berkus wrote: >>> Why? Why can't we just update the affected pages in the index? >> > >> >The page range has to be scanned in order to find out the min/max values >> >for the indexed columns on the range; and then, with these data, update >> >the index. > Seems like you could incrementally update the range, at least for > inserts. If you insert a row which doesn't decrease the min or increase > the max, you can ignore it, and if it does increase/decrease, you can > change the min/max. No? > > For updates, things are more complicated. If the row you're updating > was the min/max, in theory you should update it to adjust that, but you > can't verify that it was the ONLY min/max row without doing a full scan. > My suggestion would be to add a "dirty" flag which would indicate that > that block could use a rescan next VACUUM, and otherwise ignore changing > the min/max. After all, the only defect to having min to low or max too > high for a block would be scanning too many blocks. Which you'd do > anyway with it marked "invalid". If we add a dirty flag it would probably be wise to allow for more than one value so we can do a clock-sweep. That wouldallow for detecting a range that is getting dirtied repeatedly and not bother to try and re-summarize it until later. Something else I don't think was mentioned... re-summarization should be somehow tied to access activity: if a query willneed to seqscan a segment that needs to be summarized, we should take that opportunity to summarize at the same timewhile pages are in cache. Maybe that can be done in the backend itself; maybe we'd want a separate process. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby <jim@nasby.net> wrote: > On 6/17/13 3:38 PM, Josh Berkus wrote: >>>> >>>> Why? Why can't we just update the affected pages in the index? >>> >>> > >>> >The page range has to be scanned in order to find out the min/max values >>> >for the indexed columns on the range; and then, with these data, update >>> >the index. >> >> Seems like you could incrementally update the range, at least for >> inserts. If you insert a row which doesn't decrease the min or increase >> the max, you can ignore it, and if it does increase/decrease, you can >> change the min/max. No? >> >> For updates, things are more complicated. If the row you're updating >> was the min/max, in theory you should update it to adjust that, but you >> can't verify that it was the ONLY min/max row without doing a full scan. >> My suggestion would be to add a "dirty" flag which would indicate that >> that block could use a rescan next VACUUM, and otherwise ignore changing >> the min/max. After all, the only defect to having min to low or max too >> high for a block would be scanning too many blocks. Which you'd do >> anyway with it marked "invalid". > > > If we add a dirty flag it would probably be wise to allow for more than one > value so we can do a clock-sweep. That would allow for detecting a range > that is getting dirtied repeatedly and not bother to try and re-summarize it > until later. > > Something else I don't think was mentioned... re-summarization should be > somehow tied to access activity: if a query will need to seqscan a segment > that needs to be summarized, we should take that opportunity to summarize at > the same time while pages are in cache. Maybe that can be done in the > backend itself; maybe we'd want a separate process. This smells a lot like hint bits and all the trouble they bring.
On 6/28/13 12:26 PM, Claudio Freire wrote: > On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby <jim@nasby.net> wrote: >> On 6/17/13 3:38 PM, Josh Berkus wrote: >>>>> >>>>> Why? Why can't we just update the affected pages in the index? >>>> >>>>> >>>>> The page range has to be scanned in order to find out the min/max values >>>>> for the indexed columns on the range; and then, with these data, update >>>>> the index. >>> >>> Seems like you could incrementally update the range, at least for >>> inserts. If you insert a row which doesn't decrease the min or increase >>> the max, you can ignore it, and if it does increase/decrease, you can >>> change the min/max. No? >>> >>> For updates, things are more complicated. If the row you're updating >>> was the min/max, in theory you should update it to adjust that, but you >>> can't verify that it was the ONLY min/max row without doing a full scan. >>> My suggestion would be to add a "dirty" flag which would indicate that >>> that block could use a rescan next VACUUM, and otherwise ignore changing >>> the min/max. After all, the only defect to having min to low or max too >>> high for a block would be scanning too many blocks. Which you'd do >>> anyway with it marked "invalid". >> >> >> If we add a dirty flag it would probably be wise to allow for more than one >> value so we can do a clock-sweep. That would allow for detecting a range >> that is getting dirtied repeatedly and not bother to try and re-summarize it >> until later. >> >> Something else I don't think was mentioned... re-summarization should be >> somehow tied to access activity: if a query will need to seqscan a segment >> that needs to be summarized, we should take that opportunity to summarize at >> the same time while pages are in cache. Maybe that can be done in the >> backend itself; maybe we'd want a separate process. > > > This smells a lot like hint bits and all the trouble they bring. Possibly; though if that's the case just having a second process do the work would probably eliminate most of that problem. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
>> On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby <jim@nasby.net> wrote: >>> If we add a dirty flag it would probably be wise to allow for more >>> than one >>> value so we can do a clock-sweep. That would allow for detecting a range >>> that is getting dirtied repeatedly and not bother to try and >>> re-summarize it >>> until later. Given that I'm talking about doing the resummarization at VACUUM time, I don't think that's justified. That only makes sense if you're doing the below ... >>> Something else I don't think was mentioned... re-summarization should be >>> somehow tied to access activity: if a query will need to seqscan a >>> segment >>> that needs to be summarized, we should take that opportunity to >>> summarize at >>> the same time while pages are in cache. Maybe that can be done in the >>> backend itself; maybe we'd want a separate process. Writing at SELECT time is something our users hate, with significant justification. This suggestion is possibly a useful optimization, but I'd like to see the simplest possible version of minmax indexes go into 9.4, so that we can see how they work in practice before we start adding complicated performance optimizations involving extra write overhead and background workers. We're more likely to engineer an expensive de-optimization that way. I wouldn't be unhappy to have the very basic minmax indexes -- the one where we just flag pages as invalid if they get updated -- go into 9.4. That would still work for large, WORM tables, which are a significant and pervasive use case. If Alvaro gets to it, I think the "updating the range, rescan at VACUUM time" approach isn't much more complex, but if he doesn't I'd still vote to accept the feature. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Alvaro Herrera wrote: > This is a preliminary proposal for Minmax indexes. I'm experimenting > with the code, but it's too crude to post yet, so here's a document > explaining what they are and how they work, so that reviewers can poke > holes to have the design improved. After going further with the implementation, I have added a new concept, the reverse range map. Reverse Range Map ----------------- The reverse range map is a separate fork each Minmax index has. Its purpose is to let a way to quickly find the index tuple corresponding to a page range; for a given heap page number, there's an O(1) way to obtain the TID of the corresponding index tuple. To scan the index, we first obtain the TID of index tuple for page 0. If this returns a valid TID, we read that tuple to determine the min/max bounds for this page range. If it returns invalid, then the range is unsummarized, and the scan must return the whole range as needing scan. After this index entry has been processed, we obtain the TID of the index tuple for page 0+pagesPerRange (currently this is a compile-time constant, but there's no reason this cannot be a per-index property). Continue adding pagesPerRange until we reach the end of the heap. To read the TID during an index scan, we follow this protocol: * read revmap page * obtain share lock on the buffer * read the TID * LockTuple that TID (using the index as relation). A shared lock is sufficient. We need the LockTuple to prevent VACUUMfrom recycling the index tuple; see below. * release buffer lock * read the index tuple * release the tuple lock On heap insert/update, it is normally cheaper to update the summary tuple (grab the old tuple, expand the min/max range according to the new value being inserted, insert the new index tuple, update the reverse range map) than letting the range be unsummarized, which would require scanning the pages in the range. [However, when many updates on a range are going to be done, it'd be better to mark it as unsummarized (i.e. set the revmap TID to Invalid) and do a resummarization later, to prevent the index from bloating. Also, it'd be sensible to allow postponing of the index update, if many tuples are inserted; something like GIN's "pending list". We would need to keep track the TIDs of the inserted heap tuples, so that we can figure out the new min/max values, without having to scan the whole range.] To update the summary tuple for a page range, we use this protocol: * insert a new index tuple anywhere; note its TID * read revmap page * obtain exclusive lock on buffer * write the TID * release lock This ensures no concurrent reader can obtain a partially-written TID. Note we don't need a tuple lock here. Concurrent scans don't have to worry about whether they got the old or new index tuple: if they get the old one, the tighter values are okay from a correctness standpoint because due to MVCC they can't possibly see the just-inserted heap tuples anyway. For vacuuming, we need to figure out which index tuples are no longer referenced from the reverse range map. This requires some brute force, but is simple: 1) scan the complete index, store each existing TIDs in a dynahash. Hash key is the TID, hash value is a boolean initiallyset to false. 2) scan the complete revmap sequentially, read the TIDs on each page. Share lock on each page is sufficient. For eachTID so obtained, grab the element from the hash and update the boolean to true. 3) Scan the index again; for each tuple found, search the hash table. If the tuple is not present, it must have been addedafter our initial scan; ignore it. If the hash flag is true, then the tuple is referenced; ignore it. If the hashflag is false, then the index tuple is no longer referenced by the revmap; but they could be about to be accessed bya concurrent scan. Do ConditionalLockTuple. If this fails, ignore the tuple, it will be deleted by a future vacuum. If lock is acquired, then we can safely remove the index tuple. 4) Index pages with free space can be detected by this second scan. Register those with the FSM. Note this doesn't require scanning the heap at all, or being involved in the heap's cleanup procedure. (In particular, tables with only minmax indexes do not require the removed tuples' TIDs to be collected during the heap cleanup pass.) XXX I think this is wrong in detail; we probably need to be using LockBufferForCleanup. With the reverse range map it is very easy to see which page ranges in the heap need resummarization; it's the ones marked with InvalidTid. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07/18/2013 10:39 PM, Alvaro Herrera wrote: > To scan the index, we first obtain the TID of index tuple for page 0. If > this returns a valid TID, we read that tuple to determine the min/max bounds > for this page range. If it returns invalid, then the range is unsummarized, > and the scan must return the whole range as needing scan. After this > index entry has been processed, we obtain the TID of the index tuple for > page 0+pagesPerRange (currently this is a compile-time constant, but > there's no reason this cannot be a per-index property). Continue adding > pagesPerRange until we reach the end of the heap. Conceptually, this sounds like a good initial solution to the update problem. I still think we could do incremental updates to the minmax indexes per the idea I discussed, but that could be a later version. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Jun 14, 2013 at 06:28:06PM -0400, Alvaro Herrera wrote: > Partial indexes are not supported; since an index is concerned with minimum and > maximum values of the involved columns across all the pages in the table, it > doesn't make sense to exclude values. It can make sense if the predicate references a column other than the indexed column(s). Unlike a partial btree index, a partial minmax index would be no smaller. It could have narrower min-max spreads, reducing the recheck work done by queries. > Another way to see "partial" indexes > here would be those that only considered some pages in the table instead of all > of them; but this would be difficult to implement and manage and, most likely, > pointless. That's a distinct feature from the AM-independent partial index mechanism, in any event. > Expressional indexes can probably be supported in the future, but we disallow > them initially for conceptual simplicity. Whether an index column uses an expression is irrelevant to each existing core AM. How does minmax differ in this respect? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com