Thread: estimating # of distinct values
Hi everyone, about two weeks ago I've started a thread about cross-column stats. One of the proposed solutions is based on number of distinct values, and the truth is the current distinct estimator has some problems. I've done some small research - I have read about 20 papers on this, and I'd like to present a short summary here, possible solutions etc. The simple truth is 1) sampling-based estimators are a dead-end 2) there are very interesting stream-based estimators 3) the stream-based estimators have some disadvantages I wrote a more thorough summary on a wiki (http://wiki.postgresql.org/wiki/Estimating_Distinct) with a list of the most interesting papers etc. so just a very short explanation. 1) sampling-based estimators are a dead-end The paper from Charikar & Chaudhuri states (and proves) that for each sampling-based estimator, there's a data set wherethe estimator produces arbitrary error (with an indispensable probability). So replacing one sampling-based estimatorwith another generally does not improve the situation - fixes one dataset, breaks another one. The error is directly related to the size of the sample, so the truth is that to get a good distinct estimate you needto scan a significant portion of the table (almost all of it). So the estimators based on tiny samples are a dead end. 2) there are very interesting stream-based estimators A very interesting idea comes from the field of data streams. These estimates are based no a one pass through the data,and then incremental updates. A very nice thing is that these algorithms don't need that much space, as they don'tstore the actual values. The idea behind this is similar to the Bloom filter, i.e. set bits of a bitmap using a hash of the value. But in thiscase it's not required to be able to answer questions like 'is this an element of the set X?' so it's actually evenmore space efficient. For an introduction see the first paper published in 1985 by Flajolet (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7100). One of the recent articles (published in June 2010) actually presents an optimal algorithm with O(e^-2 + log(n)) spacecomplexity and O(1) update complexity. The "e" means precision, i.e. that the estimate will be within [(1-e)F, (1+e)F]where F is the real value. The paper on self-learning bitmap states that to track 10^9 distinct values with 1% relative error you need about 61 kilobitsof space (which is absolutely awesome). The optimal algorithm needs a bit more space I think, A very interesting solution id "distinct sampling" that somehow extends the Wegman's adaptive sampling approach. 3) the stream-based estimators have some disadvantages Not everything is perfect, though - the most serious disadvantage is that those algorithms (usually) don't handle removalof elements. Inserts are easy to handle, but deletes are hard (just as in case of Bloom filters). So using this stream-based approach might lead to degradation in case of heavily updated tables :-( I see two possible solutions here: (a) use counters instead of bits, and track actual counts - but this is tricky, especially with 'probabilistic' algorithmsand needs much more space (but still, 32x 61kb is just 250kB) (b) count how many deletes/updates were performed, and if needed rebuild the whole bitmap But even though these disadvantages, there really is no other way to enhance the estimates. I don't think this shouldbe a default behavior - just as in case of cross-column stats this should be optional when the current estimatordoes not work well. regards Tomas
2010/12/27 Tomas Vondra <tv@fuzzy.cz>: > But even though these disadvantages, there really is no other > way to enhance the estimates. I don't think this should be a > default behavior - just as in case of cross-column stats this should > be optional when the current estimator does not work well. This is going to be a lot of work to implement, so before you do it, we should try to reach a consensus that (a) it's part of an overall strategy that the community generally supports and (b) we have consensus on the design for this part. With respect to (a), I have to admit I've found the discussion on cross-column stats to be quite difficult to follow. I'd like to see a very simple description of exactly what information we're going to store, under what circumstances we'll store it, and how we'll use it to compute selectivity estimates. With respect to (b), I think I'd need to see a much more detailed design for how you intend to make this work. Off the top of my head there seems to be some pretty serious feasibility problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > With respect to (b), I think I'd need to see a much more detailed > design for how you intend to make this work. Off the top of my > head there seems to be some pretty serious feasibility problems. I had one random thought on that -- it seemed like a large concern was that there would need to be at least an occasional scan of the entire table to rebuild the distinct value information. Don't we already require an occasional scan of the entire table for freezing transaction IDs? Could this be part of any vacuum of the entire table? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Robert Haas <robertmhaas@gmail.com> wrote: >> With respect to (b), I think I'd need to see a much more detailed >> design for how you intend to make this work. Off the top of my >> head there seems to be some pretty serious feasibility problems. > I had one random thought on that -- it seemed like a large concern > was that there would need to be at least an occasional scan of the > entire table to rebuild the distinct value information. Don't we > already require an occasional scan of the entire table for freezing > transaction IDs? Could this be part of any vacuum of the entire > table? Well, first, those scans occur only once every few hundred million transactions, which is not likely a suitable timescale for maintaining statistics. And second, we keep on having discussions about rejiggering the whole tuple-freezing strategy. Even if piggybacking on those scans looked useful, it'd be unwise to assume it'll continue to work the same way it does now. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, first, those scans occur only once every few hundred million > transactions, which is not likely a suitable timescale for > maintaining statistics. I was assuming that the pass of the entire table was priming for the incremental updates described at the start of this thread. I'm not clear on how often the base needs to be updated for the incremental updates to keep the numbers "close enough". > And second, we keep on having discussions about rejiggering > the whole tuple-freezing strategy. Even if piggybacking on those > scans looked useful, it'd be unwise to assume it'll continue to > work the same way it does now. Sure, it might need to trigger its own scan in the face of heavy deletes anyway, since the original post points out that the algorithm handles inserts better than deletes, but as long as we currently have some sequential pass of the data, it seemed sane to piggyback on it when possible. And maybe we should be considering things like this when we weigh the pros and cons of rejiggering. This issue of correlated values comes up pretty often.... -Kevin
Dne 27.12.2010 22:46, Robert Haas napsal(a): > 2010/12/27 Tomas Vondra <tv@fuzzy.cz>: >> But even though these disadvantages, there really is no other >> way to enhance the estimates. I don't think this should be a >> default behavior - just as in case of cross-column stats this should >> be optional when the current estimator does not work well. > > This is going to be a lot of work to implement, so before you do it, > we should try to reach a consensus that (a) it's part of an overall > strategy that the community generally supports and (b) we have > consensus on the design for this part. Yes, it is going to be a lot of work. But I don't think we need a consensus of the whole community prior to building something that works. I plan to build a very simple prototype soon, so let's talk about this later. I've started these two threads mainly as a 'brainstorming' - to present what I've learned from the various papers, propose a possible solution, highlight issues I see, and get ideas on how to solve them. > With respect to (a), I have to admit I've found the discussion on > cross-column stats to be quite difficult to follow. I'd like to see a > very simple description of exactly what information we're going to > store, under what circumstances we'll store it, and how we'll use it > to compute selectivity estimates. Yes, I know it was difficult to follow that discussion. That's why I created a 'summary page' on wiki http://wiki.postgresql.org/wiki/Cross_Columns_Stats Generally we need to gather two types of stats: (a) multi-dimensional histogram - this can be done about the same way we create single-dimensional histograms, that'snot a big problem (although more data might be necessary to get accurate histograms) (b) # of distinct values - this is the tricky part, as described in this problem > With respect to (b), I think I'd need to see a much more detailed > design for how you intend to make this work. Off the top of my head > there seems to be some pretty serious feasibility problems. Well, it's not going to be easy, that's for sure. My "big plan" is something like this: (a) Build a simple 'contrib-like' module that allows manual collection of stats and computing of estimates (not integratedwith the optimizer in any way). This should help us better understand what stats do we need etc. (b) Minimal integration with the core - still manual collection fo stats, the optimizer uses the stats if available. (c) Enhancements - automatic updates of the stats, etc. And all this should be implemented so that you don't have to pay unless you want to use the new stats. regards Tomas
Dne 28.12.2010 00:04, Kevin Grittner napsal(a): > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Well, first, those scans occur only once every few hundred million >> transactions, which is not likely a suitable timescale for >> maintaining statistics. > > I was assuming that the pass of the entire table was priming for the > incremental updates described at the start of this thread. I'm not > clear on how often the base needs to be updated for the incremental > updates to keep the numbers "close enough". Well, that really depends on the workload. If you never remove all occurences of a given value (e.g. a ZIP code), you don't need to rescan the table at all. All you need is to build the stats once and then update them incrementally - and I belive this could be handled by autovacuum. >> And second, we keep on having discussions about rejiggering >> the whole tuple-freezing strategy. Even if piggybacking on those >> scans looked useful, it'd be unwise to assume it'll continue to >> work the same way it does now. > > Sure, it might need to trigger its own scan in the face of heavy > deletes anyway, since the original post points out that the > algorithm handles inserts better than deletes, but as long as we > currently have some sequential pass of the data, it seemed sane to > piggyback on it when possible. And maybe we should be considering > things like this when we weigh the pros and cons of rejiggering. > This issue of correlated values comes up pretty often.... Again, there are two types of stats - one of them needs to scan the whole table (estimate of distinct values), the other one does not (multi-dimentional histograms). These two cases are independent - you don't necessarily need both. Better ndistinct estimates would actually solve some of the current issues, it's not just a matter of cross-column stats. regards Tomas
> The simple truth is > > 1) sampling-based estimators are a dead-end While I don't want to discourage you from working on steam-based estimators ... I'd love to see you implement a proof-of-concept for PostgreSQL, and test it ... the above is a non-argument. It requires us to accept that sample-based estimates cannot ever be made to work, simply because you say so. The Charikar and Chaudhuri paper does not, in fact, say that it is impossible to improve sampling-based estimators as you claim it does. In fact, the authors offer several ways to improve sampling-based estimators. Further, 2000 was hardly the end of sampling-estimation paper publication; there are later papers with newer ideas. For example, I still think we could tremendously improve our current sampling-based estimator without increasing I/O by moving to block-based estimation*. The accuracy statistics for block-based samples of 5% of the table look quite good. I would agree that it's impossible to get a decent estimate of n-distinct from a 1% sample. But there's a huge difference between 5% or 10% and "a majority of the table". Again, don't let this discourage you from attempting to write a steam-based estimator. But do realize that you'll need to *prove* its superiority, head-to-head, against sampling-based estimators. [* http://www.jstor.org/pss/1391058 (unfortunately, no longer public-access)] -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus <josh@agliodbs.com> wrote: > While I don't want to discourage you from working on steam-based > estimators ... I'd love to see you implement a proof-of-concept for > PostgreSQL, and test it ... the above is a non-argument. It requires us > to accept that sample-based estimates cannot ever be made to work, > simply because you say so. This argument has been made on this mailing list and on pgsql-performance many times, so I have been assuming that this is settled mathematics. Admittedly, I don't have a citation for this... > I would agree that it's impossible to get a decent estimate of > n-distinct from a 1% sample. But there's a huge difference between 5% > or 10% and "a majority of the table". Considering the way that disks work, it's not that huge. If the table is small enough to fit in memory conveniently, it probably wouldn't be unacceptably costly even to read the whole thing. But if it's a terabyte on disk, reading every tenth block is probably going to carry most of the I/O cost of reading every block. Even reading every twentieth or hundredth block is going to be significantly more expensive than what we do now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> >> The simple truth is >> >> 1) sampling-based estimators are a dead-end > > The Charikar and Chaudhuri paper does not, in fact, say that it is > impossible to improve sampling-based estimators as you claim it does. In > fact, the authors offer several ways to improve sampling-based > estimators. Further, 2000 was hardly the end of sampling-estimation > paper publication; there are later papers with newer ideas. Well, the paper states that there is a lower bound of the possible error of a sampling based estimator, depending on the sample size. The actual inequality is error(d) >= sqrt( (n-r)/2r * 1/q) where error(D) is "ratio error" (d - estimated number of distinct values, D - actual number of distinct values) error(d) = max{ D/d, d/D } And all this is with probability q >= e^{-1000} (you can choose this). Say you have a table with 1.000.000 rows and you use a sample of 1.000 rows to do an estimate. In that case you get erorr(d) >= 99 with q = 0.05 erorr(d) >= 70 with q = 0.1 error(d) >= 31 with q = 0.5 if you can 10% of the table, you get this error(d) >= 9.5 with q = 0.05 error(d) >= 6.7 with q = 0.1 error(d) >= 3 with q = 0.5 So even with 10% of the table, there's a 10% probability to get an estimate that's 7x overestimated or underestimated. With lower probability the interval is much wider. > For example, I still think we could tremendously improve our current > sampling-based estimator without increasing I/O by moving to block-based > estimation*. The accuracy statistics for block-based samples of 5% of > the table look quite good. Well, that's certainly possible. But you can only achieve the error(d) lower boundary consistently (for all datasets), you can't do better. And they've already presented an estimator that does exactly this (called AE - Adaptive Estimator in the paper). > I would agree that it's impossible to get a decent estimate of > n-distinct from a 1% sample. But there's a huge difference between 5% > or 10% and "a majority of the table". Sure we can do better. But there's a limit we can't cross no matter what estimator we choose and how large the sample will be. I've seen several post-2000 paper on sample-based estimators, but those are mostly hybrid estimators, i.e. estimators composed of several simple estimators. So in the end it's pretty complicated, you need to gather a lot of different stats, and still you can't get better error than the lower bound :-( So yes, we can improve the current estimator (making it more complex), but it does not solve the problem actually. > Again, don't let this discourage you from attempting to write a > steam-based estimator. But do realize that you'll need to *prove* its > superiority, head-to-head, against sxampling-based estimators. > > [* http://www.jstor.org/pss/1391058 (unfortunately, no longer > public-access)] It's available here http://www.stat.washington.edu/research/reports/1990s/ But it does not present an estimator contradicting the Charikar and Chaudhuri paper - it's still 'just' a sample-based estimator, alough they draw the sample at the block level. But yes, that's good idea and I've already mentioned it in the cross-column stats thread I think. The question is if a sample obtained in this way will be as good as the current samples. This way you could get quite separate 'clusters' of values, one cluster for each block, although in reality the values are uniformly distributed. And the resulting histogram would be crippled by this I guess too. But if you know about interesting papers on sample-based estimators (especially post-2000), let me know. I've searched for them, but those I found were not very interesting IMHO. Tomas
<tv@fuzzy.cz> wrote: > So even with 10% of the table, there's a 10% probability to get an > estimate that's 7x overestimated or underestimated. With lower > probability the interval is much wider. Hmmm... Currently I generally feel I'm doing OK when the estimated rows for a step are in the right order of magnitude -- a 7% error would be a big improvement in most cases. Let's not lose track of the fact that these estimates are useful even when they are not dead-on accurate. -Kevin
> <tv@fuzzy.cz> wrote: > >> So even with 10% of the table, there's a 10% probability to get an >> estimate that's 7x overestimated or underestimated. With lower >> probability the interval is much wider. > > Hmmm... Currently I generally feel I'm doing OK when the estimated > rows for a step are in the right order of magnitude -- a 7% error > would be a big improvement in most cases. Let's not lose track of > the fact that these estimates are useful even when they are not > dead-on accurate. Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal' so this is actually the minimum - you can get a much bigger difference with lower probability. So you can easily get an estimate that is a few orders off. Anyway I really don't want precise values, just a reasonable estimate. As I said, we could use the AE estimate they proposed in the paper. It has the nice feature that it actually reaches the low boundary (thus the inequality changes to equality). The downside is that there are estimators with better behavior on some datasets. Tomas
> Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal' > so this is actually the minimum - you can get a much bigger difference > with lower probability. So you can easily get an estimate that is a few > orders off. FWIW, based on query performance, estimates which are up to 5X off are tolerable, and anything within 3X is considered "accurate". Above 5X the probability of bad query plans becomes problematically high. Of course, if you're doing cross-column stats, the accuracy of each individual column becomes critical since estimation error could be combiniational in the worst case (i.e. if colA is 3X and colB is 0.3X then colA<->colB will be 9X off). Anyway, I look forward to your experiments with stream-based estimators. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Dec27, 2010, at 23:49 , Kevin Grittner wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> With respect to (b), I think I'd need to see a much more detailed >> design for how you intend to make this work. Off the top of my >> head there seems to be some pretty serious feasibility problems. > > I had one random thought on that -- it seemed like a large concern > was that there would need to be at least an occasional scan of the > entire table to rebuild the distinct value information. I believe we could actually avoid that. First, the paper "An Optimal Algorithm for the Distinct Elements Problem" actually contains an algorithm with *does* handle deletes - it's called "L_0" estimate there. Second, as Tomas pointed out, the stream-based estimator is essentially a simplified version of a bloom filter. It starts out with a field of N zero bits, and sets K of them to 1 for each value v in the stream. Which bits are set to 1 depends on some hash function(s) H_i(v). It's then easy to compute how many 1-bits you'd expect to find in the bit field after seeing D distinct values, and by reversing that you can estimate D from the number of 1-bits in the bit field. To avoid having to rescan large tables, instead of storing one such bit field, we'd store one per B pages of data. We'd then only need to scan a range of B pages around every updated or deleted tuple, and could afterwards compute a new global estimate of D by combining the individual bit fields with bitwise and. Since the need to regularly VACUUM tables hit by updated or deleted won't go away any time soon, we could piggy-back the bit field rebuilding onto VACUUM to avoid a second scan. A good value for B would probably be around 1000*<size of bitfield>/<page size>. If the bitfield needs ~100k, that'd make B ~= 12000 pages ~= 100MB. best regards, Florian Pflug
Dne 30.12.2010 15:43, Florian Pflug napsal(a): > On Dec27, 2010, at 23:49 , Kevin Grittner wrote: >> Robert Haas <robertmhaas@gmail.com> wrote: >> >>> With respect to (b), I think I'd need to see a much more detailed >>> design for how you intend to make this work. Off the top of my >>> head there seems to be some pretty serious feasibility problems. >> >> I had one random thought on that -- it seemed like a large concern >> was that there would need to be at least an occasional scan of the >> entire table to rebuild the distinct value information. > > I believe we could actually avoid that. > > First, the paper "An Optimal Algorithm for the Distinct Elements Problem" > actually contains an algorithm with *does* handle deletes - it's called > "L_0" estimate there. Hmmm, that's interesting. I know there's a part about L_0 estimation, but that's about estimating Hamming norm of a vector - so I've ignored it as I thought we can't use it to estimate number of distinct values. But if it really handles deletions and if we can use it, then it's really interesting. > Second, as Tomas pointed out, the stream-based estimator is essentially a > simplified version of a bloom filter. It starts out with a field of > N zero bits, and sets K of them to 1 for each value v in the stream. > Which bits are set to 1 depends on some hash function(s) H_i(v). It's > then easy to compute how many 1-bits you'd expect to find in the bit > field after seeing D distinct values, and by reversing that you can > estimate D from the number of 1-bits in the bit field. No, I haven't said the stream-based estimators are simplified versions of a Bloom filter. I said the approach is very similar - all the algorithms use bitmaps and hash functions, but the algorithms (Bloom filter vs. probabilistic counting and adaptive sampling) are very different. The Bloom filter is much more straightforward. The other algorithms are much more sophisticated which allows to use less space. > To avoid having to rescan large tables, instead of storing one such > bit field, we'd store one per B pages of data. We'd then only need > to scan a range of B pages around every updated or deleted tuple, > and could afterwards compute a new global estimate of D by combining > the individual bit fields with bitwise and. I don't think this could help. 1) This works just with the Bloom filters, not with the other algorithms (you can't combine the segments using bitmap OR). 2) With heavily modified tables the updates are usually 'spread' through the whole table, so you'll have to rebuild allthe segments anyway. > Since the need to regularly VACUUM tables hit by updated or deleted > won't go away any time soon, we could piggy-back the bit field > rebuilding onto VACUUM to avoid a second scan. Well, I guess it's a bit more complicated. First of all, there's a local VACUUM when doing HOT updates. Second, you need to handle inserts too (what if the table just grows?). But I'm not a VACUUM expert, so maybe I'm wrong and this is the right place to handle rebuilds of distinct stats. I was thinking about something else - we could 'attach' the rebuild to an actual seq scan if the amount of changes reaches some threshold (since the last rebuild). Only in case the amount of changes reaches a higher threshold, we would rebuild the stats on our own. Something like IF (# of updates * deletes > 5%) THEN wait for seq scan IF (# of updates * deletes > 10%) THEN rebuild the stats I've found a nice ppt describing how Oracle does this: http://www.oraclegeek.net/downloads/One_Pass_Distinct_Sampling.ppt and there's PDF version http://www.oraclegeek.net/downloads/OnePassDistinctSampling.pdf According to this, Oracle is using the probabilistic counting approach (see slide 26). And once they find out there were to many changes in the table, they rebuild the whole thing. I'm not saying we should do exactly the same, just that this might be a good direction. regards Tomas
Excerpts from Tomas Vondra's message of jue dic 30 16:38:03 -0300 2010: > > Since the need to regularly VACUUM tables hit by updated or deleted > > won't go away any time soon, we could piggy-back the bit field > > rebuilding onto VACUUM to avoid a second scan. > > Well, I guess it's a bit more complicated. First of all, there's a local > VACUUM when doing HOT updates. Second, you need to handle inserts too > (what if the table just grows?). > > But I'm not a VACUUM expert, so maybe I'm wrong and this is the right > place to handle rebuilds of distinct stats. I was thinking that we could have two different ANALYZE modes, one "full" and one "incremental"; autovacuum could be modified to use one or the other depending on how many changes there are (of course, the user could request one or the other, too; not sure what should be the default behavior). So the incremental one wouldn't worry about deletes, only inserts, and could be called very frequently. The other one would trigger a full table scan (or nearly so) to produce a better estimate in the face of many deletions. I haven't followed this discussion closely so I'm not sure that this would be workable. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > I was thinking that we could have two different ANALYZE modes, one > "full" and one "incremental"; autovacuum could be modified to use one or > the other depending on how many changes there are (of course, the user > could request one or the other, too; not sure what should be the default > behavior). How is an incremental ANALYZE going to work at all? It has no way to find out the recent changes in the table, for *either* inserts or deletes. Unless you want to seqscan the whole table looking for tuples with xmin later than something-or-other ... which more or less defeats the purpose. regards, tom lane
Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > I was thinking that we could have two different ANALYZE modes, one > > "full" and one "incremental"; autovacuum could be modified to use one or > > the other depending on how many changes there are (of course, the user > > could request one or the other, too; not sure what should be the default > > behavior). > > How is an incremental ANALYZE going to work at all? It has no way to > find out the recent changes in the table, for *either* inserts or > deletes. Unless you want to seqscan the whole table looking for tuples > with xmin later than something-or-other ... which more or less defeats > the purpose. Yeah, I was thinking that this incremental ANALYZE would be the stream in the "stream-based estimator" but evidently it doesn't work that way. The stream that needs to be passed to the estimator consists of new tuples as they are being inserted into the table, so this would need to be done by the inserter process ... or it'd need to transmit the CTIDs for someone else to stream them ... not an easy thing, in itself. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote: > Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> I was thinking that we could have two different ANALYZE modes, one >>> "full" and one "incremental"; autovacuum could be modified to use one or >>> the other depending on how many changes there are (of course, the user >>> could request one or the other, too; not sure what should be the default >>> behavior). >> >> How is an incremental ANALYZE going to work at all? It has no way to >> find out the recent changes in the table, for *either* inserts or >> deletes. Unless you want to seqscan the whole table looking for tuples >> with xmin later than something-or-other ... which more or less defeats >> the purpose. > > Yeah, I was thinking that this incremental ANALYZE would be the stream > in the "stream-based estimator" but evidently it doesn't work that way. > The stream that needs to be passed to the estimator consists of new > tuples as they are being inserted into the table, so this would need to > be done by the inserter process ... or it'd need to transmit the CTIDs > for someone else to stream them ... not an easy thing, in itself. Perhaps listen/notify could be used for this, now that it allows passing a payload. BTW, if we reduce the frequency at which full scans of large tables are needed then presumably the cost of the scans couldbe largely ignored. If we don't need to scan frequently then we shouldn't care very much about how long a scan takes,which means we could throttle it heavily. Presumably even a heavily used system can spare 500kB/s of IO to performbackground scanning... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote: > How is an incremental ANALYZE going to work at all? How about a kind of continuous analyze ? Instead of analyzing just once and then drop the intermediate results, keep them on disk for all tables and then piggyback the background writer (or have a dedicated process if that's not algorithmically feasible) and before writing out stuff update the statistics based on the values found in modified buffers. Probably it could take a random sample of buffers to minimize overhead, but if it is done by a background thread the overhead could be minimal anyway on multi-core systems. Not sure this makes sense at all, but if yes it would deliver the most up to date statistics you can think of. Cheers, Csaba.
> On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote: >> How is an incremental ANALYZE going to work at all? > > How about a kind of continuous analyze ? > > Instead of analyzing just once and then drop the intermediate results, > keep them on disk for all tables and then piggyback the background > writer (or have a dedicated process if that's not algorithmically > feasible) and before writing out stuff update the statistics based on > the values found in modified buffers. Probably it could take a random > sample of buffers to minimize overhead, but if it is done by a > background thread the overhead could be minimal anyway on multi-core > systems. Hi, the problem is you will eventually need to drop the results and rebuild it, as the algorithms do not handle deletes (ok, Florian mentioned an algorithm L_0 described in one of the papers, but I'm not sure we can use it). I'm not sure a constantly running background process is a good idea. I'd prefer storing an info about the modified tuples somewhere, and starting analyze only when a given threshold is reached. I'm not sure how to do that, though. Another thing I'm not sure about is where to store those intermediate stats (used to get the current estimate, updated incrementally). I was thinking about pg_stats but I'm not sure it's the right place - depending on the algorithm, this may be a fet kilobytes up to several megabytes. And it's not needed except when updating it. Any ideas? regards Tomas
On Jan 7, 2011, at 5:32 AM, tv@fuzzy.cz wrote: > Another thing I'm not sure about is where to store those intermediate > stats (used to get the current estimate, updated incrementally). I was > thinking about pg_stats but I'm not sure it's the right place - depending > on the algorithm, this may be a fet kilobytes up to several megabytes. And > it's not needed except when updating it. Any ideas? If you're using it essentially as a queue, maybe a resource fork makes sense? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Dne 7.1.2011 20:56, Jim Nasby napsal(a): > On Jan 7, 2011, at 5:32 AM, tv@fuzzy.cz wrote: >> Another thing I'm not sure about is where to store those intermediate >> stats (used to get the current estimate, updated incrementally). I was >> thinking about pg_stats but I'm not sure it's the right place - depending >> on the algorithm, this may be a fet kilobytes up to several megabytes. And >> it's not needed except when updating it. Any ideas? > > If you're using it essentially as a queue, maybe a resource fork makes sense? A resource fork? Not sure what you mean, could you describe it in more detail? regards Tomas
> A resource fork? Not sure what you mean, could you describe it in more > detail? Ooops, resource forks are a filesystem thing; we call them relation forks. From src/backend/storage/smgr/README: Relation Forks ============== Since 8.4, a single smgr relation can be comprised of multiple physical files, called relation forks. This allows storing additional metadata like Free Space information in additional forks, which can be grown and truncated independently of the main data file, while still treating it all as a single physical relation in system catalogs. It is assumed that the main fork, fork number 0 or MAIN_FORKNUM, always exists. Fork numbers are assigned in src/include/storage/relfilenode.h. Functions in smgr.c and md.c take an extra fork number argument, in addition to relfilenode and block number, to identify which relation fork you want to access. Since most code wants to access the main fork, a shortcut version of ReadBuffer that accesses MAIN_FORKNUM is provided in the buffer manager for convenience. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
> On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote: >> the problem is you will eventually need to drop the results and rebuild >> it, as the algorithms do not handle deletes (ok, Florian mentioned an >> algorithm L_0 described in one of the papers, but I'm not sure we can >> use >> it). > > Yes, but even then you can start with much better cards if you already > have an estimate of what it looks like, based on the fact that you did > continuous updating of it. For example you'll have a pretty good > estimate of the bounds of the number of distinct values, while if you > really start from scratch you have nothing to start with but assume that > you must cope with the complete range between all values are distinct -> > there's only a few of them. Sure, using the previous estimate is a good idea. I just wanted to point out there is no reasonable way to handle deletes, so that you have to drop the stats are rebuild it from scratch. The biggest problem is not choosing a reasonable parameters (some of the parameters can handle a few billions ndistinct values with something like 128kB of memory and less than 5% error). The really serious concern is I/O generated by rebuilding the stats. >> Another thing I'm not sure about is where to store those intermediate >> stats (used to get the current estimate, updated incrementally). > > The forks implementation proposed in other responses is probably the > best idea if usable. It will also solve you the problem of memory > limitations, at the expense of more resources used if the structure > grows too big, but it will still be probably fast enough if it stays > small and in cache. Hm, the forks seem to be an interesting option. It's probably much better than storing that directly in the memory (not a good idea if there is a lot of data). And you don't really need the data to get the estimate, it's needed just when updating the stats. So the last thing I'm not sure is how to store the changed rows, so that the update process can get a list of new values. Someone already offered LISTEN/NOTIFY, but I guess we could just write the ctids into a file (maybe another fork)? regards Tomas
On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote: > the problem is you will eventually need to drop the results and rebuild > it, as the algorithms do not handle deletes (ok, Florian mentioned an > algorithm L_0 described in one of the papers, but I'm not sure we can use > it). Yes, but even then you can start with much better cards if you already have an estimate of what it looks like, based on the fact that you did continuous updating of it. For example you'll have a pretty good estimate of the bounds of the number of distinct values, while if you really start from scratch you have nothing to start with but assume that you must cope with the complete range between all values are distinct -> there's only a few of them. > I'm not sure a constantly running background process is a good idea. I'd > prefer storing an info about the modified tuples somewhere, and starting > analyze only when a given threshold is reached. I'm not sure how to do > that, though. > > Another thing I'm not sure about is where to store those intermediate > stats (used to get the current estimate, updated incrementally). The forks implementation proposed in other responses is probably the best idea if usable. It will also solve you the problem of memory limitations, at the expense of more resources used if the structure grows too big, but it will still be probably fast enough if it stays small and in cache. Cheers, Csaba.
Hi, a short update regarding the ndistinct stream estimators - I've implemented the estimators described in the papers I've mentioned in my previous posts. If you want try it, the sources are available at github, at http://tvondra.github.com/pg_estimator (ignore the page, I have to update it, just download the sources). You can build the estimators as standalone benchmarks (build.sh) or as aggregate functions (build-agg.sh) - I guess the latter is much easier to play with. The installation is described on my blog, along with some two simple benchmarks http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/ Feel free to play with it with different data, and if you notice something strange just let me know. Several comment regarding the implemented estimators: 1) PCSA, Probabilistic Counting (both described in the 1985 paper) - quite good performance, especially with respect to the memory requirements (95 and 495 B) 2) Adaptive Sampling (1990), Self-Learning Bitmap (2009) - very different algorithms, almost the same performance (especially with large number of distinct values) - my personal favourites, as the memory requirements are still very reasonable (compared to the Bloom Counter) 3) Rough Estimator (2010) - this algorithm is described as an "optimal algorithm" (not exactly, it's a building block of the optimal algorithm),but the results are not as good as with (1) and (2) - one of the problems is it needs k-wise independent hash functions, and the MD5 hashing scheme used with the other algorithmsprobably does not conform to this condition :-( - I've tried to implement such hash functions on my own (using prime field and polynomials in modulo arithmetics), butthe results were quite bad - if you know how to get such hash functions, let me know. - this algorithm is much more complex than the other algorithms, so if someone could do a review, that would be nice (maybethere is a bug resulting in the big estimate error) 4) Bloom Counter - basically a Bloom filter + a counter (incremented when an new element is added to the filter) - due to the design (no false negatives, positive false positives) it's significantly biased (gives underestimates), butI think that can be easily fixed with a coefficient (depends on the number of items / false positive eror rate) - needs significantly more memory to get similar performance as the other algorithms (a Bloom counter that can track 1milion distinct values needs about 1.2MB of memory, while a S-Bitmap that tracks 10 milion items needs just 65kB) - so unless we can use the special feature of Bloom Filter (i.e. the ability to detect items that are in the data), thisis a waste of memory I know Google uses Bloom filters in BigTable to limit I/O, i.e. before doing the I/O they ask the Bloom filter if thereare such data - if the answer is no, they don't have to do the I/O (false negatives not allowed), if the answer isyes they do the I/O. Not sure if we can / want to do something like this in PostgreSQL, as it significantly depends on the workload (and inmost cases we know the data are there, so we have to do the I/O anyway). But I guess it might be a good idea for a contribmodule (to maintain the Bloom filter on your own and do the decision on application). The tricky part here is tomaintain the bloom filter up to date. I'll try to fix the optimal estimator (not sure how to do that right now), and I'll investigate the proposed 'relation fork' proposed as a way to store the estimator. regards Tomas
Dne 9.1.2011 13:58, Jim Nasby napsal(a): >> A resource fork? Not sure what you mean, could you describe it in more >> detail? > > Ooops, resource forks are a filesystem thing; we call them relation forks. >From src/backend/storage/smgr/README: OK, I think using relation forks seems like a good solution. I've done some basic research and I think these are the basic steps when adding a new fork: 1) define a new item in the ForkNum enum (relfilenode.h) - this should be somethink like DISTINCT_FORK I guess 2) modify the ForkNames (catalog.c) and the relpathbackend so that the proper filename is assigned to the fork And then it will be accessed through smgr (smgrcreate etc.). Am I right or is there something else I need to do? There are a few open questions though: 1) Forks are 'per relation' but the distinct estimators are 'per column' (or 'per group of columns') so I'm not sure whetherthe file should contain all the estimators for the table, or if there should be one fork for each estimator. Theformer is a bit difficult to manage, the latter somehow breaks the current fork naming convention. 2) Where to keep the info that there is an estimator for a column? I guess we could put this into pg_attribute (it's oneboolean). But what about the estimators for groups of columns? Because that's why I'm building this - to get distinctestimates for groups of columns. I guess we'll need a new system catalog to track this? (The same will be true for multi-dimensional histograms anyway). 3) I still am not sure how to manage the updates, i.e. how to track the new values. One possibility might be to do that synchronously - whenever a new item is inserted into the table, check if there's anestimator and update it. Updating the estimators is quite efficient (e.g. the bitmap manages to do 10.000.000 insertsin 9 seconds on my ancient workstation) although there might be issues with locking etc. The other possibility is to update the estimator asynchronously, i.e. store the new values somewhere (or just ctid ofthe row), and then process it periodically. I'm not sure how to intercept the new rows and where to store them. In anotherfork? Somewhere else? regards Tomas
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote: > 1) Forks are 'per relation' but the distinct estimators are 'per > column' (or 'per group of columns') so I'm not sure whether the file > should contain all the estimators for the table, or if there should > be one fork for each estimator. The former is a bit difficult to > manage, the latter somehow breaks the current fork naming convention. Yeah, when I looked at the fork stuff I was disappointed to find out there's essentially no support for dynamically addingforks. There's two other possible uses for that I can think of: - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount of overheadwe pay for the current setup. - Dynamic forks would make it possible to do a column-store database, or at least something approximating one. Without some research, there's no way to know if either of the above makes sense; but without dynamic forks we're prettymuch dead in the water. So I wonder what it would take to support dynamically adding forks... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote: > - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount ofoverhead we pay for the current setup. That seems like an interesting idea, but I actually don't see why it would be any more efficient, and it seems like you'd end up reinventing things like vacuum and free space map management. > - Dynamic forks would make it possible to do a column-store database, or at least something approximating one. I've been wondering whether we could do something like this by treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two tables t1 and t2, one with columns pk, a1, a2, a3 and the other with columns pk, b1, b2, b3. SELECT * FROM t would be translated into SELECT * FROM t1, t2 WHERE t1.pk = t2.pk. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote: >> 1) Forks are 'per relation' but the distinct estimators are 'per >> column' (or 'per group of columns') so I'm not sure whether the file >> should contain all the estimators for the table, or if there should >> be one fork for each estimator. The former is a bit difficult to >> manage, the latter somehow breaks the current fork naming convention. > > Yeah, when I looked at the fork stuff I was disappointed to find out > there's essentially no support for dynamically adding forks. There's two > other possible uses for that I can think of: > > - Forks are very possibly a more efficient way to deal with TOAST than > having separate tables. There's a fair amount of overhead we pay for the > current setup. > - Dynamic forks would make it possible to do a column-store database, or > at least something approximating one. > > Without some research, there's no way to know if either of the above makes > sense; but without dynamic forks we're pretty much dead in the water. > > So I wonder what it would take to support dynamically adding forks... Interesting ideas, but a bit out of scope. I think I'll go with one fork containing all the estimators for now, although it might be inconvenient in some cases. I was thinking about rebuilding a single estimator with increased precision - in that case the size changes so that all the other data has to be shifted. But this won't be very common (usually all the estimators will be rebuilt at the same time), and it's actually doable. So the most important question is how to intercept the new/updated rows, and where to store them. I think each backend should maintain it's own private list of new records and forward them only in case of commit. Does that sound reasonable? regards Tomas
On Tue, Jan 18, 2011 at 4:53 AM, <tv@fuzzy.cz> wrote: > So the most important question is how to intercept the new/updated rows, > and where to store them. I think each backend should maintain it's own > private list of new records and forward them only in case of commit. Does > that sound reasonable? At the risk of sounding demoralizing, nothing about this proposal sounds very promising to me, and that sounds like a particularly bad idea. What happens if the transaction updates a billion records? Or even a million records? Are you going to store all of those in backend-local memory until commit time? Or spool them in a potentially-gigantic disk file somewhere? That memory allocation - or file - could grow to be larger than the size of the entire database in the worst case. And COMMIT could take an awfully long time if it has to spool megabytes or gigabytes of data off to some other process. And what happens if there's a crash after the COMMIT but before all that data is sent? The estimates become permanently wrong? And are we doing all of this just to get a more accurate estimate of ndistinct? For the amount of effort that it will probably take to get this working at all, you could probably implement index-only scans and have enough bandwidth left over to tackle global temporary tables. And unless I'm missing something, the chances that the performance consequences will be tolerable are pretty much zero. And it would only benefit the tiny fraction of users for whom bad n_distinct estimates cause bad plans, and then the even tinier percentage of those who can't conveniently fix it by using the manual override that we already have in place - which presumably means people who have gigantic tables that are regularly rewritten with massive changes in the data distribution that affect plan choice. Is that more than the empty set? Maybe the idea here is that this wouldn't fix just ndistinct estimates but would also help with multi-column statistics. Even if that's the case, I think it's almost certainly a dead end from a performance standpoint. Some kind of manual ANALYZE process that can be invoked when needed to scan the entire table would be painful but maybe acceptable for certain people with a problem they can't fix any other way, but doing this automatically for everyone seems like a really bad idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: > On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote: >> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amount ofoverhead we pay for the current setup. > > That seems like an interesting idea, but I actually don't see why it > would be any more efficient, and it seems like you'd end up > reinventing things like vacuum and free space map management. The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up thespace in any referenced toast forks at the same time that you vacuumed the heap. >> - Dynamic forks would make it possible to do a column-store database, or at least something approximating one. > > I've been wondering whether we could do something like this by > treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two > tables t1 and t2, one with columns pk, a1, a2, a3 and the other with > columns pk, b1, b2, b3. SELECT * FROM t would be translated into > SELECT * FROM t1, t2 WHERE t1.pk = t2.pk. Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote: > On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: >> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote: >>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amountof overhead we pay for the current setup. >> >> That seems like an interesting idea, but I actually don't see why it >> would be any more efficient, and it seems like you'd end up >> reinventing things like vacuum and free space map management. > > The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up thespace in any referenced toast forks at the same time that you vacuumed the heap. How's that different from what vacuum does on a TOAST table now? >>> - Dynamic forks would make it possible to do a column-store database, or at least something approximating one. >> >> I've been wondering whether we could do something like this by >> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two >> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with >> columns pk, b1, b2, b3. SELECT * FROM t would be translated into >> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk. > > Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks. What exactly do you mean by "tuple overhead"? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote: >> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: > On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote: >>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amountof overhead we pay for the current setup. > > That seems like an interesting idea, but I actually don't see why it > would be any more efficient, and it seems like you'd end up > reinventing things like vacuum and free space map management. >> >> The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up thespace in any referenced toast forks at the same time that you vacuumed the heap. > How's that different from what vacuum does on a TOAST table now? Even more to the point: Jim hasn't provided one single reason to suppose that this would be better-performing than the existing approach. It looks to me like a large amount of work, and loss of on-disk compatibility, for nothing at all except the sake of change. regards, tom lane
On Jan 18, 2011, at 11:24 AM, Robert Haas wrote: > On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote: >> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: >>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote: >>>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amountof overhead we pay for the current setup. >>> >>> That seems like an interesting idea, but I actually don't see why it >>> would be any more efficient, and it seems like you'd end up >>> reinventing things like vacuum and free space map management. >> >> The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free up thespace in any referenced toast forks at the same time that you vacuumed the heap. > > How's that different from what vacuum does on a TOAST table now? TOAST vacuum is currently an entirely separate vacuum. It might run at the same time as the main table vacuum, but it stillhas all the work that would be associated with vacuuming a table with the definition of a toast table. In fact, at onepoint vacuuming toast took two passes: the first deleted the toast rows that were no longer needed, then you had to goback and vacuum those deleted rows. >>>> - Dynamic forks would make it possible to do a column-store database, or at least something approximating one. >>> >>> I've been wondering whether we could do something like this by >>> treating a table t with columns pk, a1, a2, a3, b1, b2, b3 as two >>> tables t1 and t2, one with columns pk, a1, a2, a3 and the other with >>> columns pk, b1, b2, b3. SELECT * FROM t would be translated into >>> SELECT * FROM t1, t2 WHERE t1.pk = t2.pk. >> >> Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks. > > What exactly do you mean by "tuple overhead"? typedef struct HeapTupleHeaderData. With only two tables it might not be that bad, depending on the fields. Beyond two tablesit's almost certainly a loser. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Jan 18, 2011, at 11:32 AM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby <jim@nasby.net> wrote: >>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: >> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby <jim@nasby.net> wrote: >>>> - Forks are very possibly a more efficient way to deal with TOAST than having separate tables. There's a fair amountof overhead we pay for the current setup. >> >> That seems like an interesting idea, but I actually don't see why it >> would be any more efficient, and it seems like you'd end up >> reinventing things like vacuum and free space map management. >>> >>> The FSM would take some effort, but I don't think vacuum would be that hard to deal with; you'd just have to free upthe space in any referenced toast forks at the same time that you vacuumed the heap. > >> How's that different from what vacuum does on a TOAST table now? > > Even more to the point: Jim hasn't provided one single reason to suppose > that this would be better-performing than the existing approach. It > looks to me like a large amount of work, and loss of on-disk > compatibility, for nothing at all except the sake of change. Yes, I was only pointing out that there were possible uses for allowing a variable number of forks per relation if Tomasfelt the need to go that direction. Changing toast would certainly require some very convincing arguments as to thebenefits. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby <jim@nasby.net> wrote: >> How's that different from what vacuum does on a TOAST table now? > > TOAST vacuum is currently an entirely separate vacuum. It might run at the same time as the main table vacuum, but it stillhas all the work that would be associated with vacuuming a table with the definition of a toast table. In fact, at onepoint vacuuming toast took two passes: the first deleted the toast rows that were no longer needed, then you had to goback and vacuum those deleted rows. I don't know whether it still works that way or not, but if it does, fixing it could perhaps be done with far less drastic surgery than what you're proposing here. >>> Possibly, but you'd be paying tuple overhead twice, which is what I was looking to avoid with forks. >> >> What exactly do you mean by "tuple overhead"? > > typedef struct HeapTupleHeaderData. With only two tables it might not be that bad, depending on the fields. Beyond twotables it's almost certainly a loser. A struct definition by itself doesn't cause overhead. Are you talking about storage on disk? CPU usage for MVCC visibility testing? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Dne 18.1.2011 18:12, Robert Haas napsal(a): > On Tue, Jan 18, 2011 at 4:53 AM, <tv@fuzzy.cz> wrote: >> So the most important question is how to intercept the new/updated rows, >> and where to store them. I think each backend should maintain it's own >> private list of new records and forward them only in case of commit. Does >> that sound reasonable? > > At the risk of sounding demoralizing, nothing about this proposal > sounds very promising to me, and that sounds like a particularly bad > idea. What happens if the transaction updates a billion records? Or > even a million records? Are you going to store all of those in > backend-local memory until commit time? Or spool them in a > potentially-gigantic disk file somewhere? That memory allocation - or > file - could grow to be larger than the size of the entire database in > the worst case. And COMMIT could take an awfully long time if it has > to spool megabytes or gigabytes of data off to some other process. > And what happens if there's a crash after the COMMIT but before all > that data is sent? The estimates become permanently wrong? Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will come up with a reasonable solution. I was thinking about it and I believe at least one of the algorithms (the "adaptive sampling") might handle "merging" i.e. the backends would not need to store the list of values, just a private copy of the estimator and update it. And at the end (after commit), the estimator would be merged back into the actual one. So the algorithm would be something like this: 1. create copy for all distinct estimators influenced by the INSERT 2. update the local copy 3. commit and merge the local copies back into the original estimator The amount of copied data is very small, depending on the number of distinct items it can handle and precision (see the adaptive.c for details). Some examples 2^48 items + 1% error => 86 kB 2^48 items + 5% error => 3.4 kB 2^64 items + 1% error => 115 kB 2^64 items + 5% error => 4.6 kB so it's much better than storing all the data. But I really need to check this is possible (right now I'm quite busy with organizing a local conference, so maybe next week). Regarding the crash scenario - if the commit fails, just throw away the local estimator copy, it's not needed. I'm not sure how to take care of the case when commit succeeds and the write of the merged estimator fails, but I think it might be possible to write the estimator to xlog or something like that. So it would be replayed during recovery etc. Or is it a stupid idea? > And are we doing all of this just to get a more accurate estimate of > ndistinct? For the amount of effort that it will probably take to get > this working at all, you could probably implement index-only scans and > have enough bandwidth left over to tackle global temporary tables. > And unless I'm missing something, the chances that the performance > consequences will be tolerable are pretty much zero. And it would > only benefit the tiny fraction of users for whom bad n_distinct > estimates cause bad plans, and then the even tinier percentage of > those who can't conveniently fix it by using the manual override that > we already have in place - which presumably means people who have > gigantic tables that are regularly rewritten with massive changes in > the data distribution that affect plan choice. Is that more than the > empty set? First of all, I've never proposed to use this as a default behaviour. Actually I've strongly argued agains that idea on several occasions. So the user would have to enable that explicitly for some columns, I really am not an idiot to force everyone to pay for this. So the goal is to help the "small fraction" of users and keep the current solution for the rest of them. And yes, the main reason why I am working on this are the multi-column stats (in case of discrete columns). You can fix the ndistinct estimate by manually overriding the estimate, but for the multi-column stats you need an estimate of ndistinct for the combination of columns, and that's a bit difficult to do manually. The other reason is I've studied statistics (and I enjoy it, which seems weird to most people). And it's a good way to learn the internals. > Maybe the idea here is that this wouldn't fix just ndistinct estimates > but would also help with multi-column statistics. Even if that's the > case, I think it's almost certainly a dead end from a performance > standpoint. Some kind of manual ANALYZE process that can be invoked > when needed to scan the entire table would be painful but maybe > acceptable for certain people with a problem they can't fix any other > way, but doing this automatically for everyone seems like a really bad > idea. Yes, as I've mentioned above, the multi-column stats are the reason why I started working on this. And yes, it really should work like this: 1. user realizes there's something wrong as the plans are bad 2. after analysis, the user realizes the reason is an imprecise estimate due to dependence between columns 3. the user enables cross-column stats on the columns and checks if it actually solved the issue 4. if the cost outweights the gains, the user drops the stats Does that sound reasonable? regards Tomas
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Yes, I was aware of this problem (amount of memory consumed with lots of > updates), and I kind of hoped someone will come up with a reasonable > solution. As far as I can see, periodically sampling some or all of the table is really the only thing that has a chance of working OK. You could try to track changes only when they're not too big, but I think that's still going to have nasty performance consequences. > I was thinking about it and I believe at least one of the algorithms > (the "adaptive sampling") might handle "merging" i.e. the backends would > not need to store the list of values, just a private copy of the > estimator and update it. And at the end (after commit), the estimator > would be merged back into the actual one. > > So the algorithm would be something like this: > > 1. create copy for all distinct estimators influenced by the INSERT > 2. update the local copy > 3. commit and merge the local copies back into the original estimator Well, maybe. But DELETEs seem like a problem. And it's still adding foreground work to every query, which I just have a hard time believing is ever going to work out > Regarding the crash scenario - if the commit fails, just throw away the > local estimator copy, it's not needed. I'm not sure how to take care of > the case when commit succeeds and the write of the merged estimator > fails, but I think it might be possible to write the estimator to xlog > or something like that. So it would be replayed during recovery etc. Or > is it a stupid idea? It's not stupid, in the sense that that is what you'd need to do if you want to avoid ever having to rescan the table. But it is another thing that I think is going to be way too expensive. > Yes, as I've mentioned above, the multi-column stats are the reason why > I started working on this. And yes, it really should work like this: > > 1. user realizes there's something wrong as the plans are bad > 2. after analysis, the user realizes the reason is an imprecise > estimate due to dependence between columns > 3. the user enables cross-column stats on the columns and checks > if it actually solved the issue > 4. if the cost outweights the gains, the user drops the stats > > Does that sound reasonable? Yes. The only caveat I'm trying to insert is that it's hard for me to see how the proposed approach could ever be cheap enough to be a win. If we need some kind of more expensive kind of ANALYZE that scans the whole table, and it's optional, sure, why not? But that's a one-time hit. You run it, it sucks, it's over, and now your queries work. Odds are good you may never need to touch it again. Now, if we can convince ourselves that multi-column stats are likely to require constant updating, then maybe there would be more to recommend the stream-based approach, although even then it seems dreadfully complex and expensive to me. But I bet these things are pretty static. If the city and state tend to imply the zip code when there are 10K rows in the table, they probably still will when there are 100K rows in the table. If users with org_id = 1 have a higher karma score on average than users with org_id != 1, that'll probably still be true when there are more users in both classes. Whether the database can understand that without rescanning the table depends on the data representation, of course, but I guess I'm just saying I'd think really, really hard before abandoning the idea of periodic sampling. You have to get an awful lot of benefit out of those cross-column stats to make it worth paying a run-time cost for them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... I guess I'm just saying I'd think really, really hard > before abandoning the idea of periodic sampling. You have to get an > awful lot of benefit out of those cross-column stats to make it worth > paying a run-time cost for them. I've been trying to not discourage Tomas from blue-sky speculation, but I have to say I agree with Robert that the chance of any usable result from this approach is very very small. Feel free to do some experimentation to prove us wrong --- but don't put a lot of effort into it before that. regards, tom lane
Dne 19.1.2011 20:25, Robert Haas napsal(a): > On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Yes, I was aware of this problem (amount of memory consumed with lots of >> updates), and I kind of hoped someone will come up with a reasonable >> solution. > > As far as I can see, periodically sampling some or all of the table is > really the only thing that has a chance of working OK. You could try > to track changes only when they're not too big, but I think that's > still going to have nasty performance consequences. Yes, sampling all the table is the only way to get reliable ndistinct estimates. I'm not sure what you mean by 'tracking changes' - as I've mentioned in the previous post, this might be solved by updating a local copy. Which requires a constant space (a few kB, see the previous post). Is that acceptable? I don't know, that's up to the user if he want's to pay this price. >> So the algorithm would be something like this: >> >> 1. create copy for all distinct estimators influenced by the INSERT >> 2. update the local copy >> 3. commit and merge the local copies back into the original estimator > > Well, maybe. But DELETEs seem like a problem. And it's still adding > foreground work to every query, which I just have a hard time > believing is ever going to work out Yes, deletes are difficult to handle. My idea was to compute something like ((tuples changed + tuples deleted) / tuples total), and indicate that a rebuild of the estimator is needed if this coefficient changes too much - e.g. log a message or something like that. >> Regarding the crash scenario - if the commit fails, just throw away the >> local estimator copy, it's not needed. I'm not sure how to take care of >> the case when commit succeeds and the write of the merged estimator >> fails, but I think it might be possible to write the estimator to xlog >> or something like that. So it would be replayed during recovery etc. Or >> is it a stupid idea? > > It's not stupid, in the sense that that is what you'd need to do if > you want to avoid ever having to rescan the table. But it is another > thing that I think is going to be way too expensive. Way too expensive? All you need to put into the logfile is a copy of the estimator, which is a few kBs. How is that 'way too expensive'? Sure, it might be expensive when the user does a lot of small changes in separate transactions, that's true, but I guess we could minimize the amount of data written to the xlog by doing a diff of the estimators or something like that. >> Yes, as I've mentioned above, the multi-column stats are the reason why >> I started working on this. And yes, it really should work like this: >> >> 1. user realizes there's something wrong as the plans are bad >> 2. after analysis, the user realizes the reason is an imprecise >> estimate due to dependence between columns >> 3. the user enables cross-column stats on the columns and checks >> if it actually solved the issue >> 4. if the cost outweights the gains, the user drops the stats >> >> Does that sound reasonable? > > Yes. The only caveat I'm trying to insert is that it's hard for me to > see how the proposed approach could ever be cheap enough to be a win. IMHO the question is not 'How expensive is that?' but rather 'How expensive is it compared to the gains?' If the user gets much better estimates and a then a much better plan, then this may be a completely acceptable price. > If we need some kind of more expensive kind of ANALYZE that scans the > whole table, and it's optional, sure, why not? But that's a one-time > hit. You run it, it sucks, it's over, and now your queries work. > Odds are good you may never need to touch it again. Now, if we can > convince ourselves that multi-column stats are likely to require > constant updating, then maybe there would be more to recommend the > stream-based approach, although even then it seems dreadfully complex > and expensive to me. But I bet these things are pretty static. If No, the multi-column statistics do not require constant updating. There are cases where a sampling is perfectly fine, although you may need a bit larger sample. Generally if you can use a multi-dimensional histogram, you don't need to scan the whole table. But the multi-dimensional histograms are not applicable to some cases. Especially to the ZIP code fail case, that was repeatedly discussed. So in case of discrete data, we need a different approach - and the solution I've proposed is based on using ndistinct estimates to get the estimates (actually it's based on one of the papers listed on wiki). > and expensive to me. But I bet these things are pretty static. If > the city and state tend to imply the zip code when there are 10K rows > in the table, they probably still will when there are 100K rows in the > table. If users with org_id = 1 have a higher karma score on average OK, how will you measure the "implicativeness"? We have discussed this in another thread and there is a lot of gotchas although it seems like a really simple problem. The solution based on ndistinct estimates turner out to be a reasonable approach that's worth a try. regards Tomas
Dne 19.1.2011 20:46, Tom Lane napsal(a): > Robert Haas <robertmhaas@gmail.com> writes: >> ... I guess I'm just saying I'd think really, really hard >> before abandoning the idea of periodic sampling. You have to get an >> awful lot of benefit out of those cross-column stats to make it worth >> paying a run-time cost for them. > > I've been trying to not discourage Tomas from blue-sky speculation, > but I have to say I agree with Robert that the chance of any usable > result from this approach is very very small. Feel free to do some > experimentation to prove us wrong --- but don't put a lot of effort > into it before that. > > regards, tom lane Discourage? Not really - I have not expected this to be a simple thing to implement. And yes, it might turn out to be a dead end eventually. OTOH the approach "it seems really expensive so let's not try to build it" is a certain dead end, so I'm not going to surrender due to such speculations (althouh those are valid concerns and I'll have to address them in the future). regards Tomas
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 19.1.2011 20:25, Robert Haas napsal(a): >> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> Yes, I was aware of this problem (amount of memory consumed with lots of >>> updates), and I kind of hoped someone will come up with a reasonable >>> solution. >> >> As far as I can see, periodically sampling some or all of the table is >> really the only thing that has a chance of working OK. You could try >> to track changes only when they're not too big, but I think that's >> still going to have nasty performance consequences. > > Yes, sampling all the table is the only way to get reliable ndistinct > estimates. IMHO continuing to focus on worst case results is a dead end. Yes, for some distributions, ndistinct is very hard to estimate well. However, let us not forget that the current ndistinct estimator is very bad but the postgres planner is, on the whole, very good. As far as I can see this is due to two facts: 1) The distribution of values in a table is rarely pathological, and usually follows one of several common patterns. ( IOW, we have good heuristics ) 2) The choice of plan is fairly robust to mis-estimation of ndistinct, because there are only a few things the executer can choose. It doesn't usually matter if a value composes 5% or 20% ( or 99% ) of the table, we still usually need to scan the entire table. Again, in my *very* humble opinion, If the ultimate goal is to improve the planner, we should try to cut down on the number of cases in which a poor estimate of ndistinct gives a really bad plan instead of chasing after marginal gains from a better ndistinct estimator. Maybe ( as I've advocated for before ) this means parameterizing the distribution of values ( ie, find that the values are roughly uniform ), maybe this means estimating the error of our statistics and using the most robust rather than the best plan, or maybe it means something totally different. But: ( and the literature seems to support me ) pounding away at the ndistinct estimator seems like a dead end. If you think about it, it's a bit ridiculous to look at the whole table *just* to "estimate" ndistinct - if we go that far why dont we just store the full distribution and be done with it? Best, Nathan
Dne 19.1.2011 23:44, Nathan Boley napsal(a): > 1) The distribution of values in a table is rarely pathological, and > usually follows one of several common patterns. ( IOW, we have good > heuristics ) Not true. You're missing the goal of this effort - to get ndistinct estimate for combination of multiple columns. Which is usually pathological in case of dependent columns. Which is exactly the case when the user will explicitly enable this feature to get better estimates (and thus better plans). > 2) The choice of plan is fairly robust to mis-estimation of ndistinct, > because there are only a few things the executer can choose. It > doesn't usually matter if a value composes 5% or 20% ( or 99% ) of > the table, we still usually need to scan the entire table. Again, don't think about a single column (although even in that case there are known fail cases). Think about multiple columns and the number of distinct combinations. With attribute value independence assumption, this is usually significantly underestimated. And that's a problem as it often leads to an index scan instead of sequential scan (or something like that). > Again, in my *very* humble opinion, If the ultimate goal is to improve > the planner, we should try to cut down on the number of cases in which > a poor estimate of ndistinct gives a really bad plan instead of > chasing after marginal gains from a better ndistinct estimator. Maybe What is a marginal gain? The ultimate goal is to build cross-column stats, which in case of dependent columns usually results is poor plans. How is fixing that a marginal gain? Yes, it may turn out it's not worth it, but it's a bit early to say that without an implementation and some hard data. > ( as I've advocated for before ) this means parameterizing the > distribution of values ( ie, find that the values are roughly uniform > ), maybe this means estimating the error of our statistics and using > the most robust rather than the best plan, or maybe it means something > totally different. But: ( and the literature seems to support me ) Which literature? I've read an awful lot of papers on this topic lately, and I don't remember anything like that. If there's an interesting paper, I'd like to read it. Yes, all the papers state estimating a ndistinct is a particularly tricky task, but that's somehow expected? I've even checked how other databases do this - e.g. Oracle does it very similarly to what I proposed (the user has to enable the feature, it has to be rebuilt from time to time, etc.). I'm not saying we should do everything like Oracle, but they are not clueless idiots. > pounding away at the ndistinct estimator seems like a dead end. If you > think about it, it's a bit ridiculous to look at the whole table > *just* to "estimate" ndistinct - if we go that far why dont we just > store the full distribution and be done with it? No, I'm not trying to do this just to improve the ndistinct estimate. The goal is to improve the cardinality estimate in case of dependent columns, and one of the papers depends on good ndistinct estimates. And actually it does not depend on ndistinct for the columns only, it depends on ndistinct estimates for the combination of columns. So improving the ndistinct estimates for columns is just a necessary first step (and only if it works reasonably well, we can do the next step). regards Tomas
On Jan19, 2011, at 23:44 , Nathan Boley wrote: > If you think about it, it's a bit ridiculous to look at the whole table > *just* to "estimate" ndistinct - if we go that far why dont we just > store the full distribution and be done with it? The crucial point that you're missing here is that ndistinct provides an estimate even if you *don't* have a specific value to search for at hand. This is way more common than you may think, it e.g. happens every you time PREPARE are statement with parameters. Even knowing the full distribution has no advantage in this case - the best you could do is to average the individual probabilities which gives ... well, 1/ndistinct. best regards, Florian Pflug
>> If you think about it, it's a bit ridiculous to look at the whole table >> *just* to "estimate" ndistinct - if we go that far why dont we just >> store the full distribution and be done with it? > > - the best you could do is to average the > individual probabilities which gives ... well, 1/ndistinct. > That is certainly *not* the best you could do in every case. The mean is only the best estimate in L2, which is definitely not the case here. Consider a table with 10K values, 9,990 of which are 1 and the rest of which are 2, 3, ..., 10, versus a table that has the same 10 distinct values evenly distributed. For a simple equality query, in the first case, a bitmap scan might be best. In the second case, a sequential scan would always be best. This is precisely the point I was trying to make in my email: the loss function is very complicated. Improving the ndistinct estimator could significantly improve the estimates of ndistinct ( in the quadratic loss sense ) while only marginally improving the plans. -Nathan
> > Not true. You're missing the goal of this effort - to get ndistinct > estimate for combination of multiple columns. Which is usually > pathological in case of dependent columns. <snip> > Again, don't think about a single column (although even in that case > there are known fail cases). Think about multiple columns and the number > of distinct combinations. With attribute value independence assumption, > this is usually significantly underestimated. And that's a problem as it > often leads to an index scan instead of sequential scan (or something > like that). My point was that, in the case of single columns, we've done pretty well despite the poor ndistinct estimates. I suspect the same will be true in the case of multiple columns; good heuristics will trump minimax estimators. >> ( as I've advocated for before ) this means parameterizing the >> distribution of values ( ie, find that the values are roughly uniform >> ), maybe this means estimating the error of our statistics and using >> the most robust rather than the best plan, or maybe it means something >> totally different. But: ( and the literature seems to support me ) > > Which literature? I've read an awful lot of papers on this topic lately, > and I don't remember anything like that. If there's an interesting > paper, I'd like to read it. Yes, all the papers state estimating a > ndistinct is a particularly tricky task, but that's somehow expected? It is definitely expected - non-parametric minimax results are always *very* weak. The papers that you've cited seem to confirm this; bounding ndistinct estimation error is tricky in the general case ( even with very simple loss functions, which we do not have ). The same is true for other non-parametric methods. Think about kernel density estimation: how many data points do you need to estimate the density of a normal distribution well? What about if you *know* that the data is normal ( or even that it comes from a big family? ). > No, I'm not trying to do this just to improve the ndistinct estimate. > The goal is to improve the cardinality estimate in case of dependent > columns, and one of the papers depends on good ndistinct estimates. > > And actually it does not depend on ndistinct for the columns only, it > depends on ndistinct estimates for the combination of columns. So > improving the ndistinct estimates for columns is just a necessary first > step (and only if it works reasonably well, we can do the next step). I think that any approach which depends on precise estimates of ndistinct is not practical. I am very happy that you've spent so much time on this, and I'm sorry if my previous email came off as combative. My point was only that simple heuristics have served us well in the past and, before we go to the effort of new, complicated schemes, we should see how well similar heuristics work in the multiple column case. I am worried that if the initial plan is too complicated then nothing will happen and, even if something does happen, it will be tough to get it committed ( check the archives for cross column stat threads - there are a lot ). Best, Nathan Boley
On Jan20, 2011, at 02:41 , Nathan Boley wrote: >>> If you think about it, it's a bit ridiculous to look at the whole table >>> *just* to "estimate" ndistinct - if we go that far why dont we just >>> store the full distribution and be done with it? >> >> - the best you could do is to average the >> individual probabilities which gives ... well, 1/ndistinct. > > That is certainly *not* the best you could do in every case. The mean > is only the best estimate in L2, which is definitely not the case > here. No, not in every case. But on average it comes out best, no? > Consider a table with 10K values, 9,990 of which are 1 and the rest of > which are 2, 3, ..., 10, versus a table that has the same 10 distinct > values evenly distributed. For a simple equality query, in the first > case, a bitmap scan might be best. In the second case, a sequential > scan would always be best. True. But selectivity estimates alone don't show the difference there. Also, in my personal experience this isn't really the area we've got problems now. The problem cases for me always were queries with a rather large number of joins (7 to 10 or so) plus rather complex search conditions. The join order, not the access strategy, then becomes the deciding factor in plan performance. And the join order *is* driven purely off the selectivity estimates, unlike the access strategy which even today takes other factors into account (like clusteredness, I believe) > This is precisely the point I was trying to make in my email: the loss > function is very complicated. Improving the ndistinct estimator could > significantly improve the estimates of ndistinct ( in the quadratic > loss sense ) while only marginally improving the plans. The point of this exercise isn't really to improve the ndistinct estimates for single columns. Rather, we'd like to get ndistinct estimates for groups of columns because that allows us to use the uniform bayesian approach to multi-column selectivity estimation that Tomas found literature on. Which on the whole seems like it *does* have a chance of greatly improving the plans in some cases. We could, of course, estimate multi-column ndistinct the same way we estimate single-column ndistincts, but quite a few people raised concerns that this wouldn't work due to the large error our current ndistinct estimates have. best regards, Florian Pflug
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> Regarding the crash scenario - if the commit fails, just throw away the >>> local estimator copy, it's not needed. I'm not sure how to take care of >>> the case when commit succeeds and the write of the merged estimator >>> fails, but I think it might be possible to write the estimator to xlog >>> or something like that. So it would be replayed during recovery etc. Or >>> is it a stupid idea? >> >> It's not stupid, in the sense that that is what you'd need to do if >> you want to avoid ever having to rescan the table. But it is another >> thing that I think is going to be way too expensive. > > Way too expensive? All you need to put into the logfile is a copy of the > estimator, which is a few kBs. How is that 'way too expensive'? At this point, this is all a matter of religion, right? Neither of us has a working implementation we've benchmarked. But yes, I believe you're going to find that implementing some kind of streaming estimator is going to impose a... <pulls number out of rear end> 6% performance penalty, even after you've optimized the living daylights out of it. Now you might say... big deal, it improves my problem queries by 100x. OK, but if you could get the same benefit by doing an occasional full table scan during off hours, you could have the same performance with a *0%* performance penalty. Even better, the code changes would be confined to ANALYZE rather than spread out all over the system, which has positive implications for robustness and likelihood of commit. I'm not trying to argue you out of working on this. It's obviously your time to spend, and if works better than I think it will, great! I'm merely offering you an opinion on what will probably happen if you go this route - namely, it'll carry an unpalatable run-time penalty. That opinion may be worth no more than what you paid for it, but there you have it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 19, 2011 at 9:35 PM, Florian Pflug <fgp@phlo.org> wrote: > Also, in my personal experience this isn't really the area we've got > problems now. The problem cases for me always were queries with a rather > large number of joins (7 to 10 or so) plus rather complex search > conditions. The join order, not the access strategy, then becomes the > deciding factor in plan performance. This certainly matches my experience (except sometimes with more joins). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jan20, 2011, at 02:41 , Nathan Boley wrote: >>>> If you think about it, it's a bit ridiculous to look at the whole table >>>> *just* to "estimate" ndistinct - if we go that far why dont we just >>>> store the full distribution and be done with it? >>> >>> - the best you could do is to average the >>> individual probabilities which gives ... well, 1/ndistinct. >> >> That is certainly *not* the best you could do in every case. The mean >> is only the best estimate in L2, which is definitely not the case >> here. > > No, not in every case. But on average it comes out best, no? In the sense of minimizing the average plan cost over all values? Definitely not. Consider the trivial case where a bitmap scan is the cost of a sequential scan + epsilon. > >> Consider a table with 10K values, 9,990 of which are 1 and the rest of >> which are 2, 3, ..., 10, versus a table that has the same 10 distinct >> values evenly distributed. For a simple equality query, in the first >> case, a bitmap scan might be best. In the second case, a sequential >> scan would always be best. > > True. But selectivity estimates alone don't show the difference there. Yet the full distribution would - supporting my argument that even in cases where we dont know a specific value, the full distribution is more informative. > > Also, in my personal experience this isn't really the area we've got > problems now. The problem cases for me always were queries with a rather > large number of joins (7 to 10 or so) plus rather complex search > conditions. The join order, not the access strategy, then becomes the > deciding factor in plan performance. And the join order *is* driven purely > off the selectivity estimates, unlike the access strategy which even today > takes other factors into account (like clusteredness, I believe) > I think I'm losing you. Why would ndistinct be complete sufficient for estimating the optimal join order? >> This is precisely the point I was trying to make in my email: the loss >> function is very complicated. Improving the ndistinct estimator could >> significantly improve the estimates of ndistinct ( in the quadratic >> loss sense ) while only marginally improving the plans. > > > The point of this exercise isn't really to improve the ndistinct estimates > for single columns. Rather, we'd like to get ndistinct estimates for > groups of columns because that allows us to use the uniform bayesian > approach to multi-column selectivity estimation that Tomas found literature > on. Which on the whole seems like it *does* have a chance of greatly > improving the plans in some cases. We could, of course, estimate multi-column > ndistinct the same way we estimate single-column ndistincts, but quite a few > people raised concerns that this wouldn't work due to the large error our > current ndistinct estimates have. Sure. But estimating multi-column ndistinct is surely not the only way to approach this problem. The comment I made, which you objected to, was that it's silly to look at the whole table and then throw away all information *except* ndistinct. If we really think that looking at the whole table is the best approach, then shouldn't we be able to get better statistics? Is this really an open question? -Nathan
On 20.01.2011 04:36, Robert Haas wrote: > ... Even better, the > code changes would be confined to ANALYZE rather than spread out all > over the system, which has positive implications for robustness and > likelihood of commit. Keep in mind that the administrator can already override the ndistinct estimate with ALTER TABLE. If he needs to manually run a special ANALYZE command to make it scan the whole table, he might as well just use ALTER TABLE to tell the system what the real (or good enough) value is. A DBA should have a pretty good feeling of what the distribution of his data is like. And how good does the estimate need to be? For a single-column, it's usually not that critical, because if the column has only a few distinct values then we'll already estimate that pretty well, and OTOH if ndistinct is large, it doesn't usually affect the plans much if it's 10% of the number of rows or 90%. It seems that the suggested multi-column selectivity estimator would be more sensitive to ndistinct of the individual columns. Is that correct? How is it biased? If we routinely under-estimate ndistinct of individual columns, for example, does the bias accumulate or cancel itself in the multi-column estimate? I'd like to see some testing of the suggested selectivity estimator with the ndistinct estimates we have. Who knows, maybe it works fine in practice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi Tomas, On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote: > No, the multi-column statistics do not require constant updating. There > are cases where a sampling is perfectly fine, although you may need a > bit larger sample. Generally if you can use a multi-dimensional > histogram, you don't need to scan the whole table. In the cases where sampling is enough, you can do that to the updates too: do a sampling on the changes, in that you only process every Nth change to make it to the estimator. If you can also dynamically tune the N to grow it as the statistics stabilize, and lower it if you detect high variance, even better. If the analyze process could be decoupled from the backends, and maybe just get the data passed over to be processed asynchronously, then that could be a feasible way to have always up to date statistics when the bottleneck is IO and CPU power is in excess. If that then leads to better plans, it could really be a win exceeding the overhead. If this analyze process (or more of them) could also just get the data from the modified buffers in a cyclic way, so that backends need nothing extra to do, then I don't see any performance disadvantage other than possible extra locking contention on the buffers and non-determinism of the actual time when a change makes it to the statistics. Then you just need to get more CPU power and higher memory bandwidth to pay for the accurate statistics. Cheers, Csaba.
Dne 20.1.2011 03:06, Nathan Boley napsal(a): >> And actually it does not depend on ndistinct for the columns only, it >> depends on ndistinct estimates for the combination of columns. So >> improving the ndistinct estimates for columns is just a necessary first >> step (and only if it works reasonably well, we can do the next step). > > I think that any approach which depends on precise estimates of > ndistinct is not practical. I'm not aware of any other approach to the 'discrete fail case' (where the multi-dimensional histograms are not applicable). If someone finds a better solution, I'll be the first one to throw away this stuff. > I am very happy that you've spent so much time on this, and I'm sorry > if my previous email came off as combative. My point was only that > simple heuristics have served us well in the past and, before we go to > the effort of new, complicated schemes, we should see how well similar > heuristics work in the multiple column case. I am worried that if the > initial plan is too complicated then nothing will happen and, even if > something does happen, it will be tough to get it committed ( check > the archives for cross column stat threads - there are a lot ). If I've leaned one thing over the years in IT, it's not to take critique personally. All the problems mentioned in this thread are valid concerns, pointing out weak points of the approach. And I'm quite happy to receive this feedback - that's why I started it. On the other hand - Jara Cimrman (a famous Czech fictional character, depicted as the best scientist/poet/teacher/traveller/... - see [1]) once said that you can't be really sure you don't get gold by blowing cigarette smoke into a basin drain, until you actually try it. So I'm blowing cigaretter smoke into the drain ... It may wery vell happen this will be a dead end, but I'll do my best to fix all the issues or to prove that the pros outweight the cons. And even if it will be eventually rejected, I hope to get -1 from TL to be eligible for that t-shirt ... [1] http://en.wikipedia.org/wiki/Jara_Cimrman regards Tomas
Dne 20.1.2011 03:36, Robert Haas napsal(a): > On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> Regarding the crash scenario - if the commit fails, just throw away the >>>> local estimator copy, it's not needed. I'm not sure how to take care of >>>> the case when commit succeeds and the write of the merged estimator >>>> fails, but I think it might be possible to write the estimator to xlog >>>> or something like that. So it would be replayed during recovery etc. Or >>>> is it a stupid idea? >>> >>> It's not stupid, in the sense that that is what you'd need to do if >>> you want to avoid ever having to rescan the table. But it is another >>> thing that I think is going to be way too expensive. >> >> Way too expensive? All you need to put into the logfile is a copy of the >> estimator, which is a few kBs. How is that 'way too expensive'? > > At this point, this is all a matter of religion, right? Neither of us > has a working implementation we've benchmarked. But yes, I believe > you're going to find that implementing some kind of streaming > estimator is going to impose a... <pulls number out of rear end> 6% > performance penalty, even after you've optimized the living daylights > out of it. Now you might say... big deal, it improves my problem > queries by 100x. OK, but if you could get the same benefit by doing > an occasional full table scan during off hours, you could have the > same performance with a *0%* performance penalty. Even better, the > code changes would be confined to ANALYZE rather than spread out all > over the system, which has positive implications for robustness and > likelihood of commit. Good point. What I was trying to do was to continuously update the estimator with new data - that was the whole idea behind the collecting of new values (which might lead to problems with memory etc. as you've pointed out) and updating a local copy of the estimator (which is a good idea I think). But this might be another option - let the user decide if he wants to continuously update the estimates (and pay the price) or do that off the hours (and pay almost nothing). That sounds as a very good solution to me. > I'm not trying to argue you out of working on this. It's obviously > your time to spend, and if works better than I think it will, great! > I'm merely offering you an opinion on what will probably happen if you > go this route - namely, it'll carry an unpalatable run-time penalty. > That opinion may be worth no more than what you paid for it, but there > you have it. Yes, and I appreciate all feedback. But I still believe this can be done so that users that don't need the feature don't pay for it. regards Tomas
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a): > It seems that the suggested multi-column selectivity estimator would be > more sensitive to ndistinct of the individual columns. Is that correct? > How is it biased? If we routinely under-estimate ndistinct of individual > columns, for example, does the bias accumulate or cancel itself in the > multi-column estimate? > > I'd like to see some testing of the suggested selectivity estimator with > the ndistinct estimates we have. Who knows, maybe it works fine in > practice. The estimator for two columns and query 'A=a AND B=b' is about 0.5 * (dist(A)/dist(A,B) * Prob(A=a) + dist(B)/dist(A,B) * Prob(B=b)) so it's quite simple. It's not that sensitive to errors or ndistinct estimates for individual columns, but the problem is in the multi-column ndistinct estimates. It's very likely that with dependent colunms (e.g. with the ZIP codes / cities) the distribution is so pathological that the sampling-based estimate will be very off. I guess this was a way too short analysis, but if you can provide more details of the expected tests etc. I'll be happy to provide that. regards Tomas
Dne 20.1.2011 11:05, Csaba Nagy napsal(a): > Hi Tomas, > > On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote: >> No, the multi-column statistics do not require constant updating. There >> are cases where a sampling is perfectly fine, although you may need a >> bit larger sample. Generally if you can use a multi-dimensional >> histogram, you don't need to scan the whole table. > > In the cases where sampling is enough, you can do that to the updates > too: do a sampling on the changes, in that you only process every Nth > change to make it to the estimator. If you can also dynamically tune the > N to grow it as the statistics stabilize, and lower it if you detect > high variance, even better. > > If the analyze process could be decoupled from the backends, and maybe > just get the data passed over to be processed asynchronously, then that > could be a feasible way to have always up to date statistics when the > bottleneck is IO and CPU power is in excess. If that then leads to > better plans, it could really be a win exceeding the overhead. OK, this sounds interesting. I'm not sure how to do that but it might be a good solution. What about transactions? If the client inserts data (and it will be sent asynchronously to update the estimator) and then rolls back, is the estimator 'rolled back' or what happens? This was exactly the reason why I initially wanted to collect all the data at the backend (and send them to the estimator at commit time). Which was then replaced by the idea to keep a local estimator copy and merge it back to the original estimator at commit time. > If this analyze process (or more of them) could also just get the data > from the modified buffers in a cyclic way, so that backends need nothing > extra to do, then I don't see any performance disadvantage other than > possible extra locking contention on the buffers and non-determinism of > the actual time when a change makes it to the statistics. Then you just > need to get more CPU power and higher memory bandwidth to pay for the > accurate statistics. Well, the possible locking contention sounds like a quite significant problem to me :-( The lag between an update and a change to the stats is not that big deal I guess - we have the same behaviour with the rest of the stats (updated by autovacuum every once a while). Tomas