Thread: RFC: planner statistics in 7.2
Request for comments: Overview -------- The planner is currently badly handicapped by inadequate statistical information. The stats that VACUUM ANALYZE currently computes are: per-table:number of disk pagesnumber of tuples per-column:dispersionminimum and maximum values (if datatype has a suitable '<' operator)most common valuefraction of entriesthat are the most common valuefraction of entries that are NULL Not only is this an impoverished dataset, but the dispersion and most-common-value are calculated in a very unreliable way, and cannot be trusted. Even though the stats are meager and untrustworthy, they are expensive to get: the per-column data is updated only by VACUUM ANALYZE, which is quite unpleasant to do regularly on large tables. It is possible to get better statistics with less work. Here is a proposal for revising the stats mechanisms for 7.2. I believe that it's still a good idea for the stats to be updated by an explicit housekeeping command. Although we could think of updating them on-the-fly, that seems expensive and slow. Another objection is that planning results will become difficult to reproduce if the underlying statistics change constantly. But if we stick to the explicit-command approach then we need to make the command much faster than VACUUM ANALYZE. We can address that in two ways: (1) The statistics-gathering process should be available as a standalone command, ANALYZE [ tablename ], not only as part of VACUUM. (This was already discussed and agreed to for 7.1, but it never got done.) Note that a pure ANALYZE command needs only a read lock on the target table, not an exclusive lock as VACUUM needs, so it's much more friendly to concurrent transactions. (2) Statistics should be computed on the basis of a random sample of the target table, rather than a complete scan. According to the literature I've looked at, sampling a few thousand tuples is sufficient to give good statistics even for extremely large tables; so it should be possible to run ANALYZE in a short amount of time regardless of the table size. If we do both of these then I think that ANALYZE will be painless enough to do frequently, so there's no need to try to fold stats-updating into regular operations. More extensive statistics ------------------------- The min, max, and most common value are not enough data, particularly not for tables with heavily skewed data distributions. Instead, pg_statistic should store three small arrays for each column: 1. The K most common values and their frequencies, for some (adjustable) parameter K. This will be omitted if the column appears unique (no duplicate values found). 2. The M boundary values of an equi-depth histogram, ie, the values that divide the data distribution into equal-population sets. For example, if M is 3 this would consist of the min, median, and max values. K and M might both be about 10 for a typical column. In principle at least, these numbers could be user-settable parameters, to allow trading off estimation accuracy against amount of space used in pg_statistics. A further refinement is that the histogram should describe the distribution of the data *after excluding the given most-common values* (that is, it's a "compressed histogram"). This allows more accurate representation of the data distribution when there are just a few very-common values. For a column with not many distinct values (consider a boolean column), the most-common-value list might describe the complete dataset, in which case the histogram collapses to nothing. (We'd still have it store the min and max, just so that it's not necessary to scan the most-common-value array to determine those values.) I am also inclined to remove the attdispersion parameter in favor of storing an estimate of the number of distinct values in the column. We can compute more accurate selectivities using the most-common-value frequencies and the distinct-values estimate than we could get from dispersion. Another useful idea is to try to estimate the correlation between physical table order and logical order of any given column --- this would let us account for the effect of clustering when estimating indexscan costs. I don't yet know the appropriate formula to use, but this seems quite doable. Finally, it would be a good idea to add an estimate of average width of varlena fields to pg_statistic. This would allow the estimated tuple width computed by the planner to have something to do with reality, which in turn would improve the cost estimation for hash joins (which need to estimate the size of the in-memory hash table). Computing the statistics ------------------------ The best way to obtain these stats seems to be: 1. Scan the target table and obtain a random sample of R tuples. R is chosen in advance based on target histogram size (M) --- in practice R will be about 3000 or so. If the table contains fewer than that many tuples then the "sample" will be the whole table. The sample can be obtained in one pass using Vitter's algorithm or similar. Note that for expected values of R we shouldn't have any trouble storing all the sampled tuples in memory. 2. For each column in the table, extract the column value from each sampled tuple, and sort these values into order. A simple scan of the values then suffices to find the most common values --- we only need to count adjacent duplicates and remember the K highest counts. After we have those, simple arithmetic will let us find the positions that contain the histogram boundary elements. Again, all this can be done in-memory and so should be pretty fast. The sort step will require detoasting any toasted values. To avoid risk of memory overflow, we may exclude extremely wide toasted values from the sort --- this shouldn't affect the stats much, since such values are unlikely to represent duplicates. Such an exclusion will also prevent wide values from taking up lots of space in pg_statistic, if they happened to be chosen as histogram entries. Issue: for datatypes that have no '<' operator, we of course cannot compute a histogram --- but if there is an '=' then the most-common-value and number-of-distinct-values stats still make sense. Is there a way to derive these without O(R^2) brute-force comparisons? We could still do a scan of the R sample values using something like the existing comparison algorithm (but using S > K work slots); this would cost about S*R rather than R^2 comparisons. A different approach that's been discussed on pghackers is to make use of btree indexes for columns that have such indexes: we could scan the indexes to visit all the column values in sorted order. I have rejected that approach because (a) it doesn't help for columns without a suitable index; (b) our indexes don't distinguish deleted and live tuples, which would skew the statistics --- in particular, we couldn't tell a frequently-updated single tuple from a commonly repeated value; (c) scanning multiple indexes would likely require more total I/O than just grabbing sample tuples from the main table --- especially if we have to do that anyway to handle columns without indexes. User-visible changes -------------------- Aside from implementing the stand-alone ANALYZE statement, it seems that it would be a good idea to allow user control of the target numbers of statistical entries (K and M above). A simple approach would be a SET variable or explicit parameter for ANALYZE. But I am inclined to think that it'd be better to create a persistent per-column state for this, set by sayALTER TABLE tab SET COLUMN col STATS COUNT n (better names welcome). The target count for each column could be stored in pg_attribute. Doing it this way would let the dbadmin establish higher or lower targets for especially irregular or simple columns, and then forget about the numbers --- he wouldn't have to tweak his periodic cron script to apply the right parameters to the right tables/columns. regards, tom lane
> (1) The statistics-gathering process should be available as a standalone > command, ANALYZE [ tablename ], not only as part of VACUUM. (This was > already discussed and agreed to for 7.1, but it never got done.) Note > that a pure ANALYZE command needs only a read lock on the target table, > not an exclusive lock as VACUUM needs, so it's much more friendly to > concurrent transactions. 7.1 already does the ANALYZE part of VACUUM ANALYZE with lighter locking. I just never split out the command to be separate, partly because of fear of user confusion, and I ran out of time. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> that a pure ANALYZE command needs only a read lock on the target table, >> not an exclusive lock as VACUUM needs, so it's much more friendly to >> concurrent transactions. > 7.1 already does the ANALYZE part of VACUUM ANALYZE with lighter > locking. Right, but you still have to run through the VACUUM part, which will hold down an exclusive lock for a considerable amount of time (if the table is big). regards, tom lane
Philip Warner <pjw@rhyme.com.au> writes: > At 18:37 19/04/01 -0400, Tom Lane wrote: >> (2) Statistics should be computed on the basis of a random sample of the >> target table, rather than a complete scan. According to the literature >> I've looked at, sampling a few thousand tuples is sufficient to give good >> statistics even for extremely large tables; so it should be possible to >> run ANALYZE in a short amount of time regardless of the table size. > This sounds great; can the same be done for clustering. ie. pick a random > sample of index nodes, look at the record pointers and so determine how > well clustered the table is? My intention was to use the same tuples sampled for the data histograms to estimate how well sorted the data is. However it's not immediately clear that that'll give a trustworthy estimate; I'm still studying it ... >> ALTER TABLE tab SET COLUMN col STATS COUNT n > Sounds fine - user-selectability at the column level seems a good idea. > Would there be any value in not making it part of a normal SQLxx statement, > and adding an 'ALTER STATISTICS' command? eg. > ALTER STATISTICS FOR tab[.column] COLLECT n > ALTER STATISTICS FOR tab SAMPLE m Is that more standard than the other syntax? regards, tom lane
At 18:37 19/04/01 -0400, Tom Lane wrote: >(2) Statistics should be computed on the basis of a random sample of the >target table, rather than a complete scan. According to the literature >I've looked at, sampling a few thousand tuples is sufficient to give good >statistics even for extremely large tables; so it should be possible to >run ANALYZE in a short amount of time regardless of the table size. This sounds great; can the same be done for clustering. ie. pick a random sample of index nodes, look at the record pointers and so determine how well clustered the table is? >A simple approach would be a SET >variable or explicit parameter for ANALYZE. But I am inclined to think >that it'd be better to create a persistent per-column state for this, >set by say > ALTER TABLE tab SET COLUMN col STATS COUNT n Sounds fine - user-selectability at the column level seems a good idea. Would there be any value in not making it part of a normal SQLxx statement, and adding an 'ALTER STATISTICS' command? eg. ALTER STATISTICS FOR tab[.column] COLLECT n ALTER STATISTICS FOR tab SAMPLE m etc. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 20:48 19/04/01 -0400, Tom Lane wrote: > >> This sounds great; can the same be done for clustering. ie. pick a random >> sample of index nodes, look at the record pointers and so determine how >> well clustered the table is? > >My intention was to use the same tuples sampled for the data histograms >to estimate how well sorted the data is. However it's not immediately >clear that that'll give a trustworthy estimate; I'm still studying it ... I'm not sure you want to know how well sorted it is in general, but you do want to know the expected cost in IOs of reading all records from a given index node, so you can more accurately estimate indexscan costs. AFAICS it does not require that the entire table be sorted. So checking the pointers on the index nodes gives an idea of clustering. > >> ALTER STATISTICS FOR tab[.column] COLLECT n >> ALTER STATISTICS FOR tab SAMPLE m > >Is that more standard than the other syntax? > Not at all. It just avoids messing with one of the standard statements. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > I'm not sure you want to know how well sorted it is in general, but you do > want to know the expected cost in IOs of reading all records from a given > index node, so you can more accurately estimate indexscan costs. AFAICS it > does not require that the entire table be sorted. So checking the pointers > on the index nodes gives an idea of clustering. But you don't really need to look at the index (if it even exists at the time you do the ANALYZE). The extent to which the data is ordered in the table is a property of the table, not the index. I'd prefer to get the stat just from the table and not have to do additional I/O to examine the indexes. But, as I said, I'm still reading the literature about estimation methods for indexscans. It may turn out that a statistic calculated this way isn't the right thing to use, or isn't trustworthy when taken over a small sample. >> Is that more standard than the other syntax? > Not at all. It just avoids messing with one of the standard statements. Oh, so you're deliberately not being standard. I see. Probably a reasonable point. regards, tom lane
At 21:14 19/04/01 -0400, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >> I'm not sure you want to know how well sorted it is in general, but you do >> want to know the expected cost in IOs of reading all records from a given >> index node, so you can more accurately estimate indexscan costs. AFAICS it >> does not require that the entire table be sorted. So checking the pointers >> on the index nodes gives an idea of clustering. > >But you don't really need to look at the index (if it even exists >at the time you do the ANALYZE). The extent to which the data is >ordered in the table is a property of the table, not the index. But the value (and cost) of using a specific index in an indexscan depends on that index (or am I missing something?). ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > At 21:14 19/04/01 -0400, Tom Lane wrote: >> But you don't really need to look at the index (if it even exists >> at the time you do the ANALYZE). The extent to which the data is >> ordered in the table is a property of the table, not the index. > But the value (and cost) of using a specific index in an indexscan depends > on that index (or am I missing something?). All that we're discussing here is one specific parameter in the cost estimation for an indexscan, viz, the extent to which the table ordering agrees with the index ordering. As long as they both agree about the ordering operator, this number doesn't depend on the index --- the index is by definition in perfect agreement with the ordering operator. There are other parameters in the total cost estimate that will depend on the index, but this one doesn't AFAICS. regards, tom lane
Tom Lane writes: > 2. The M boundary values of an equi-depth histogram, ie, the values that > divide the data distribution into equal-population sets. For example, if > M is 3 this would consist of the min, median, and max values. Why not model that data in a normal distribution (or some other such model)? This should give you fairly good estimates for <, > and between type queries. That way the user wouldn't have to tweak M. (I couldn't even imagine a way to explain to the user how to pick good M's.) > Issue: for datatypes that have no '<' operator, we of course cannot > compute a histogram --- but if there is an '=' then the most-common-value > and number-of-distinct-values stats still make sense. I think the statistics calculation should be data type (or operator, or opclass) specific, like the selectivity estimation functions. For geometric data types you need completely different kinds of statistics, and this would allow users to plug in different methods. The pg_statistics table could just have an array of floats where each data type can store statistics as it chooses and the selectivity estimation routines can interpret the values in a different way per data type. That way you can also make some common sense optimization without hacks, e.g., for boolean columns you possibly only need to calculate 1 value (number of trues). Of course, how to calculate up to N different sets of statistics for N different columns that require up to N different numbers of passes over the table is left as a challenge. ;-) This brings up a question I have: Are statistics calculated for every column? Should they be? Is there a possibility to speed up ANALYZE by controlling this? -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut <peter_e@gmx.net> writes: > Tom Lane writes: >> 2. The M boundary values of an equi-depth histogram, ie, the values that >> divide the data distribution into equal-population sets. For example, if >> M is 3 this would consist of the min, median, and max values. > Why not model that data in a normal distribution (or some other such > model)? If we knew an appropriate distribution model, we could do that. There is no distribution model that is appropriate for everything... certainly the normal distribution is not. A compressed histogram is more flexible than any other simple model I know of. > (I couldn't even imagine a way to explain to the user how to pick good > M's.) I was just planning to say "if you get bad estimates on a column with a highly irregular distribution, try increasing the default M". Eventually we might have enough experience with it to offer more detailed advice. >> Issue: for datatypes that have no '<' operator, we of course cannot >> compute a histogram --- but if there is an '=' then the most-common-value >> and number-of-distinct-values stats still make sense. > I think the statistics calculation should be data type (or operator, or > opclass) specific, like the selectivity estimation functions. Yeah, that's probably what we should do, but so far no one has even suggested what stats we might want for geometric datatypes. Tell you what: let's throw in a field that indicates "kind of statistics gathered", and have the meaning of the array fields depend on that. The proposal I gave describes just the stats to gather for scalar types. > pg_statistics table could just have an array of floats where each data > type can store statistics as it chooses and the selectivity estimation > routines can interpret the values in a different way per data type. Man does not live by floats alone --- in particular we need to be able to store specific values of the target datatype. Probably what we want is three or four instances of the pattern stats_kind int, -- identifies type of info stats_numbers float[], -- numerical info such as occurrence fraction stats_values text[], -- array of values of column datatype -- (in external representation) Depending on the value of stats_kind, either stats_numbers or stats_values might be unused, in which case it could be set to NULL so it doesn't waste space. For scalar datatypes, an instance of this pattern could hold the most common values and their frequencies, and another one could hold the histogram boundary points (with stats_numbers unused). > Of course, how to calculate up to N different sets of statistics for N > different columns that require up to N different numbers of passes over > the table is left as a challenge. ;-) As long as it's OK to gather the stats from a random sample of the table, I think the structure I proposed would work fine. Remember the second part of ANALYZE is looping over each column separately anyway --- so it could easily apply a datatype-dependent algorithm for computing statistics. > This brings up a question I have: Are statistics calculated for every > column? Should they be? Is there a possibility to speed up ANALYZE by > controlling this? Presently, frequency stats will be calculated for every column that has an '=' operator, and min/max stats will be calculated if there's a '<' operator for the type. I was planning to retain that convention. However if we are going to allow adjustable M, we could also add the stipulation that the DBA could set M = 0 for columns that he doesn't want stats gathered for. (This would make sense for a column that is never used in a WHERE clause.) regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Remember one idea is for index scans to automatically update the expired > flag in the index bitfields when they check the heap tuple. There is no such flag, and I'm not planning to add one before I fix statistics ;-) There are a number of things that'd have to be thought about before such a thing could happen. For one thing, trying to obtain write locks instead of read locks during a btree scan would blow us out of the water for supporting concurrent btree operations. regards, tom lane
> A different approach that's been discussed on pghackers is to make use > of btree indexes for columns that have such indexes: we could scan the > indexes to visit all the column values in sorted order. I have rejected > that approach because (a) it doesn't help for columns without a suitable > index; (b) our indexes don't distinguish deleted and live tuples, > which would skew the statistics --- in particular, we couldn't tell a > frequently-updated single tuple from a commonly repeated value; (c) > scanning multiple indexes would likely require more total I/O than just > grabbing sample tuples from the main table --- especially if we have to > do that anyway to handle columns without indexes. Remember one idea is for index scans to automatically update the expired flag in the index bitfields when they check the heap tuple. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Remember one idea is for index scans to automatically update the expired > > flag in the index bitfields when they check the heap tuple. > > There is no such flag, and I'm not planning to add one before I fix > statistics ;-) > > There are a number of things that'd have to be thought about before > such a thing could happen. For one thing, trying to obtain write locks > instead of read locks during a btree scan would blow us out of the water > for supporting concurrent btree operations. Yes, it was just an idea. Someone may add it for 7.2. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 22:27 19/04/01 -0400, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >> At 21:14 19/04/01 -0400, Tom Lane wrote: >>> But you don't really need to look at the index (if it even exists >>> at the time you do the ANALYZE). The extent to which the data is >>> ordered in the table is a property of the table, not the index. > >> But the value (and cost) of using a specific index in an indexscan depends >> on that index (or am I missing something?). > >All that we're discussing here is one specific parameter in the cost >estimation for an indexscan, viz, the extent to which the table ordering >agrees with the index ordering. This does not necessarily follow. A table ordering need not follow the sort order of an index for the index to have a low indexscan cost. All that is required is that most of the rows referred to by an index node must reside in a page or pages that will be read by one IO. eg. a table that has a sequence based ID, with, say 20% of rows updated, will work nicely with an indexscan on the ID, even though it has never been clustered. What I'm suggesting is that if you look at a random sample of index nodes, you should be able to get a statistically valid estimate of the 'clumping' of the data pointed to by the index. Am I still missing the point? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: >> All that we're discussing here is one specific parameter in the cost >> estimation for an indexscan, viz, the extent to which the table ordering >> agrees with the index ordering. > This does not necessarily follow. A table ordering need not follow the sort > order of an index for the index to have a low indexscan cost. All that is > required is that most of the rows referred to by an index node must reside > in a page or pages that will be read by one IO. eg. a table that has a > sequence based ID, with, say 20% of rows updated, will work nicely with an > indexscan on the ID, even though it has never been clustered. Right, what matters is the extent of correlation between table ordering and index ordering, not how it got to be that way. > What I'm suggesting is that if you look at a random sample of index nodes, > you should be able to get a statistically valid estimate of the 'clumping' > of the data pointed to by the index. And I'm saying that you don't actually have to look at the index in order to compute the very same estimate. The only property of the index that matters is its sort order; if you assume you know the right sort order (and in practice there's usually only one interesting possibility for a column) then you can compute the correlation just by looking at the table. Andreas correctly points out that this approach doesn't extend very well to multi-column or functional indexes, but I'm willing to punt on those for the time being ... regards, tom lane
Tom Lane wrote: > Philip Warner <pjw@rhyme.com.au> writes: > >> What I'm suggesting is that if you look at a random sample of index nodes, >> you should be able to get a statistically valid estimate of the 'clumping' >> of the data pointed to by the index. > > > And I'm saying that you don't actually have to look at the index in > order to compute the very same estimate. The only property of the index > that matters is its sort order; if you assume you know the right sort > order (and in practice there's usually only one interesting possibility > for a column) then you can compute the correlation just by looking at > the table. This is more true for unique indexes than for non-unique ones unless our non-unique indexes are smart enough to insert equal index nodes in table order . > Andreas correctly points out that this approach doesn't extend very well > to multi-column or functional indexes, but I'm willing to punt on those > for the time being ... ---------- Hannu
At 10:10 23/04/01 -0400, Tom Lane wrote: >>> All that we're discussing here is one specific parameter in the cost >>> estimation for an indexscan, viz, the extent to which the table ordering >>> agrees with the index ordering. > >> This does not necessarily follow. A table ordering need not follow the sort >> order of an index for the index to have a low indexscan cost. All that is >> required is that most of the rows referred to by an index node must reside >> in a page or pages that will be read by one IO. eg. a table that has a >> sequence based ID, with, say 20% of rows updated, will work nicely with an >> indexscan on the ID, even though it has never been clustered. > >Right, what matters is the extent of correlation between table ordering >and index ordering, not how it got to be that way. No; *local* ordering needs to *roughly* match. Global ordering and record-by-record ordering don't matter. For example, for a table with an ID field, the rows may be stored as (where --- indicates a mythical page) ----- 5 9 6 7 ----- 1 10 2 3 ----- 4 8 11 12 ----- A sorted index may have nodes pointers to (1,2,3), (4,5,6), (7,8,9) and (10,11,12). The first node would take 1 IO, then each of the others would take 2. This would give a much more reasonable estimate for the indexscan costs (assuming a random sample is adequate). >> What I'm suggesting is that if you look at a random sample of index nodes, >> you should be able to get a statistically valid estimate of the 'clumping' >> of the data pointed to by the index. > >And I'm saying that you don't actually have to look at the index in >order to compute the very same estimate. No. Not given the above. >The only property of the index >that matters is its sort order; if you assume you know the right sort >order (and in practice there's usually only one interesting possibility >for a column) then you can compute the correlation just by looking at >the table. This is true, but only if you are strictly interested in sort order, which I don't think we are. >Andreas correctly points out that this approach doesn't extend very well >to multi-column or functional indexes, but I'm willing to punt on those >for the time being ... My approach should work with arbitrary indexes. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Bruce Momjian wrote: > > A different approach that's been discussed on pghackers is to make use > > of btree indexes for columns that have such indexes: we could scan the > > indexes to visit all the column values in sorted order. I have rejected > > that approach because (a) it doesn't help for columns without a suitable > > index; (b) our indexes don't distinguish deleted and live tuples, > > which would skew the statistics --- in particular, we couldn't tell a > > frequently-updated single tuple from a commonly repeated value; (c) > > scanning multiple indexes would likely require more total I/O than just > > grabbing sample tuples from the main table --- especially if we have to > > do that anyway to handle columns without indexes. > > Remember one idea is for index scans to automatically update the expired > flag in the index bitfields when they check the heap tuple. And we should really do that. While playing around with my (for 7.2 to be) access statistics stuff I found that when running pg_bench, a couple of thousand index scans cause millions and millions of buffer fetches, because that pg_bench updates one and the same row over and over again and it has a PKEY. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com # _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com