Re: [RFC] Minmax indexes - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [RFC] Minmax indexes |
Date | |
Msg-id | 51BF82AB.1020208@fuzzy.cz Whole thread Raw |
In response to | [RFC] Minmax indexes (Alvaro Herrera <alvherre@2ndquadrant.com>) |
List | pgsql-hackers |
Hi! This sounds really interesting, so a few quick comments. On 15.6.2013 00:28, Alvaro Herrera wrote: > In each index tuple (corresponding to one page range), we store: - > first block this tuple applies to - last block this tuple applies to > - for each indexed column: * min() value across all tuples in the > range * max() value across all tuples in the range * nulls present in > any tuple? What about adding a counter how many times the min/max value is present in the range? The other messages in this thread suggest that the refresh after UPDATE/DELETE is one of the concerns - as Greg Stark mentioned the range invalidation may only happen when running DELETE on a row matching the min/max value. I believe having a counter would be an improvement - a refresh would be needed only if it reaches 0. > Summarization ------------- > > At index creation time, the whole table is scanned; for each page > range the min() and max() values and nulls bitmap are collected, and > stored in the index. The possibly-incomplete range at the end of the > table is also included. Would it make sense to do this using an existing (b-tree) index for very large tables? Clearly it doesn't make sense to create a b-tree index and then minmax on the same column, but for very large databases upgraded using pg_upgrade this might be interesting. > Once in a while, it is necessary to summarize a bunch of unsummarized > pages (because the table has grown since the index was created), or > re-summarize a range that has been marked invalid. This is simple: > scan the page range calculating the min() and max() for each indexed > column, then insert the new index entry at the end of the index. The > main interesting questions are: > > a) when to do it The perfect time to do it is as soon as a complete > page range of the configured range size has been filled (assuming > page ranges are constant size). > > b) who does it (what process) It doesn't seem a good idea to have a > client-connected process do it; it would incur unwanted latency. > Three other options are (i) to spawn a specialized process to do it, > which perhaps can be signalled by a client-connected process that > executes a scan and notices the need to run summarization; or (ii) to > let autovacuum do it, as a separate new maintenance task. This seems > simple enough to bolt on top of already existing autovacuum > infrastructure. The timing constraints of autovacuum might be > undesirable, though. (iii) wait for user command. > > > The easiest way to go around this seems to have vacuum do it. That > way we can simply do re-summarization on the amvacuumcleanup routine. > Other answers would mean we need a separate AM routine, which appears > unwarranted at this stage. I don't think this should be added to the autovacuum daemon. It's quite tricky to tune autovacuum to be aggressive just enough, i.e. not to run too frequently / not to lag. I'm afraid this would add task consuming unpredictable amounts of time, which would make the autovacuum tuning even trickier. I can live with non-summarized minmax indexes, but I can't live with non-vacuumed tables. > Open questions -------------- > > * Same-size page ranges? Current related literature seems to consider > that each "index entry" in a minmax index must cover the same number > of pages. There doesn't seem to be a hard reason for this to be so; > it might make sense to allow the index to self-tune so that some > index entries cover smaller page ranges, if this allows the > min()/max() values to be more compact. This would incur larger space > overhead for the index itself, but might allow better pruning of page > ranges during scan. In the limit of one index tuple per page, the > index itself would occupy too much space, even though we would be > able to skip reading the most heap pages, because the min()/max() > ranges are tight; in the opposite limit of a single tuple that > summarizes the whole table, we wouldn't be able to prune anything > from the seqscan even though the index is very small. I see no particular reason not to allow that, and having variable range size would be a very nice feature IMHO. Do you have an indea on how the self-tuning might work? I was thinking about something like this: (1) estimate the initial range size, trying to keep good "selectivity" while not having very small ranges, using (a) average row length (b) number of distinct values in the column (c) MCV/histogram/correlation (2) once the index is built, running a "merge" on the ranges - merging the adjacent pages if the resulting selectivity is not significantly worse (again, this might be evaluated using MCV, histogram) But it should be possible to override this (initial range size, max range size) or disable heuristics completely, as having large ranges probably makes the resummarizing more expensive. Although having the counter might fix that as it's more likely there's another row with min/max value. BTW could this be used to pre-sort the table, so that the actual sort algorithm would start with a reasonably sorted set of tuples? I.e. sorting the index by (min,max) vector and then returning the table pages in this order? I think it might be even possible to efficiently split the table into multiple parts T(0), T(1), T(2), ..., T(n) where all rows in T(i) are smaller than T(i+1). This in turn would make it possible to "partition" the table for aggregates - either running them in parallel or in sequence (i.e. not running into OOM with HashAggregate). Yes, I know supporting this is far in the future, I'm just "brainstorming" here ;-) regards Tomas
pgsql-hackers by date: