Thread: Thoughts on statistics for continuously advancing columns
All, One of our clients is having query plan issues with a table with a continuously advancing timestamp column (i.e. one with default now()). The newest rows, which are the most in demand, are always estimated to be fewer than they are or even non-existant. As a result, the user has to analyze the table every hour ... and it's a very large table. I've seen this in a lot of other databases, both with timestamp columns and with SERIALs -- both of which are very common table structures. From my reading of the planner code, the problem seems to be the histgram bounds ... if a requested value is above the high bound, it's assumed to be extremely uncommon or not exist. This leads to bad plans if analyze hasn't been run very recently. My thoughts on dealing with this intelligently without a major change to statstics gathering went along these lines: 1. add columns to pg_statistic to hold estimates of upper and lower bounds growth between analyzes. 2. every time analyze is run, populate these columns with 1/2 of the proprotion of values above or below the previously stored bounds, averaged with the existing value for the new columns. 3. use this factor instead of the existing algorithm to calculate the row estimate for out-of-bounds values. This is obviously a very rough idea, but I wanted to get feedback on the general problem and my approach before going further with it. Thanks! --Josh Berkus
Josh Berkus <josh@agliodbs.com> writes: > My thoughts on dealing with this intelligently without a major change to > statstics gathering went along these lines: > 1. add columns to pg_statistic to hold estimates of upper and lower > bounds growth between analyzes. This seems like a fundamentally broken approach, first because "time between analyzes" is not even approximately a constant, and second because it assumes that we have a distance metric for all datatypes. (Note that convert_to_scalar does not assume that it can measure arbitrary distances, but only fractions *within* a histogram bucket; and even that is pretty shaky.) I don't have a better idea at the moment :-( regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> My thoughts on dealing with this intelligently without a major >> change to statstics gathering went along these lines: > >> 1. add columns to pg_statistic to hold estimates of upper and >> lower bounds growth between analyzes. > > This seems like a fundamentally broken approach > I don't have a better idea at the moment :-( It's been a while since I've been bitten by this issue -- the last time was under Sybase. The Sybase suggestion was to either add "dummy rows" [YUCK!] to set the extreme bounds or to "lie to the optimizer" by fudging the statistics after each generation. Perhaps we could do better by adding columns for high and low bounds to pg_statistic. These would not be set by ANALYZE, but user-modifiable to cover exactly this problem? NULL would mean current behavior? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't have a better idea at the moment :-( > It's been a while since I've been bitten by this issue -- the last > time was under Sybase. The Sybase suggestion was to either add > "dummy rows" [YUCK!] to set the extreme bounds or to "lie to the > optimizer" by fudging the statistics after each generation. Perhaps > we could do better by adding columns for high and low bounds to > pg_statistic. These would not be set by ANALYZE, but > user-modifiable to cover exactly this problem? NULL would mean > current behavior? Well, the problem Josh has got is exactly that a constant high bound doesn't work. What I'm wondering about is why he finds that re-running ANALYZE isn't an acceptable solution. It's supposed to be a reasonably cheap thing to do. I think the cleanest solution to this would be to make ANALYZE cheaper, perhaps by finding some way for it to work incrementally. regards, tom lane
On Wed, 30 Dec 2009 11:16:45 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I don't have a better idea at the moment :-( > >> It's been a while since I've been bitten by this issue -- the last >> time was under Sybase. The Sybase suggestion was to either add >> "dummy rows" [YUCK!] to set the extreme bounds or to "lie to the >> optimizer" by fudging the statistics after each generation. Perhaps >> we could do better by adding columns for high and low bounds to >> pg_statistic. These would not be set by ANALYZE, but >> user-modifiable to cover exactly this problem? NULL would mean >> current behavior? > > Well, the problem Josh has got is exactly that a constant high bound > doesn't work. > > What I'm wondering about is why he finds that re-running ANALYZE > isn't an acceptable solution. It's supposed to be a reasonably > cheap thing to do. What makes ANALYZE cheap is that two things: 1. It uses read only bandwidth (for the most part), which is the most bandwidth we have 2. It doesn't take a lock that bothers anything On the other hand ANALYZE also: 1. Uses lots of memory 2. Lots of processor 3. Can take a long time We normally don't notice because most sets won't incur a penalty. We got a customer who has a single table that is over 1TB in size... We notice. Granted that is the extreme but it would only take a quarter of that size (which is common) to start seeing issues. > > I think the cleanest solution to this would be to make ANALYZE > cheaper, perhaps by finding some way for it to work incrementally. > That could be interesting. What about a running statistics set that has some kind of threshold? What I mean is, we run our normal analyze but we can mark a table "HOT" (yeah bad term). If we mark the table HOT statistics are generated on the fly for the planner and updated every X interval. Perhaps then written out at a checkpoint? This is just off the top of my head. JD > regards, tom lane -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, the problem Josh has got is exactly that a constant high > bound doesn't work. I thought the problem was that the high bound in the statistics fell too far below the actual high end in the data. This tends (in my experience) to be much more painful than an artificially extended high end in the statistics. (YMMV, of course.) > What I'm wondering about is why he finds that re-running ANALYZE > isn't an acceptable solution. It's supposed to be a reasonably > cheap thing to do. Good point. We haven't hit this problem in PostgreSQL precisely because we can run ANALYZE often enough to prevent the skew from becoming pathological. > I think the cleanest solution to this would be to make ANALYZE > cheaper, perhaps by finding some way for it to work incrementally. Yeah, though as you say above, it'd be good to know why frequent ANALYZE is a problem as it stands. -Kevin
Joshua D. Drake wrote: > We normally don't notice because most sets won't incur a penalty. We got a customer who > has a single table that is over 1TB in size... We notice. Granted that is the extreme > but it would only take a quarter of that size (which is common) to start seeing issues. > Right, and the only thing that makes this case less painful is that you don't really need the stats to be updated quite as often in situations with that much data. If, say, your stats say there's 2B rows in the table but there's actually 2.5B, that's a big error, but unlikely to change the types of plans you get. Once there's millions of distinct values it's takes a big change for plans to shift, etc. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith <greg@2ndquadrant.com> writes: > Right, and the only thing that makes this case less painful is that you > don't really need the stats to be updated quite as often in situations > with that much data. If, say, your stats say there's 2B rows in the > table but there's actually 2.5B, that's a big error, but unlikely to > change the types of plans you get. Once there's millions of distinct > values it's takes a big change for plans to shift, etc. Normally, yeah. I think Josh's problem is that he's got performance-critical queries that are touching the "moving edge" of the data set, and so the part of the stats that are relevant to them is changing fast, even though in an overall sense the table contents might not be changing much. regards, tom lane
Greg Smith <greg@2ndquadrant.com> wrote: > If, say, your stats say there's 2B rows in the table but there's > actually 2.5B, that's a big error, but unlikely to change the > types of plans you get. Once there's millions of distinct values > it's takes a big change for plans to shift, etc. Well, the exception to that is if the stats say that your highest value is x, and there are actually 500 million rows with values greater than x, you can get some very bad plans for queries requiring a range of values above x. -Kevin
Tom Lane escribió: > Greg Smith <greg@2ndquadrant.com> writes: > > Right, and the only thing that makes this case less painful is that you > > don't really need the stats to be updated quite as often in situations > > with that much data. If, say, your stats say there's 2B rows in the > > table but there's actually 2.5B, that's a big error, but unlikely to > > change the types of plans you get. Once there's millions of distinct > > values it's takes a big change for plans to shift, etc. > > Normally, yeah. I think Josh's problem is that he's got > performance-critical queries that are touching the "moving edge" of the > data set, and so the part of the stats that are relevant to them is > changing fast, even though in an overall sense the table contents might > not be changing much. Maybe only tangentially related: if this was a setup partitioned by a timestamp, it would be very useful to be able to analyze only the current partition and have updated stats for the parent relation as well. However AFAICT with your proposed changes in this area this would not work, right? You'd need an analyze on the parent relation, which is painful. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> Normally, yeah. I think Josh's problem is that he's got >> performance-critical queries that are touching the "moving edge" of the >> data set, and so the part of the stats that are relevant to them is >> changing fast, even though in an overall sense the table contents might >> not be changing much. > Maybe only tangentially related: if this was a setup partitioned by a > timestamp, it would be very useful to be able to analyze only the > current partition and have updated stats for the parent relation as > well. However AFAICT with your proposed changes in this area this would > not work, right? You'd need an analyze on the parent relation, which is > painful. Yeah, I was just thinking about that myself. The parent-level ANALYZE would approximately double the work involved, assuming that your total data set is large enough to max out the number of blocks sampled. So it'd be painful but not catastrophic. Maybe the way to think about the "incremental update" problem is to find a way to let ANALYZE calculate parent-relation stats from the stats of the individual partitions. Not that I know how to do that either, but at least it's a clearly stated task. regards, tom lane
On Wed, Dec 30, 2009 at 4:31 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On the other hand ANALYZE also: > > 1. Uses lots of memory > 2. Lots of processor > 3. Can take a long time > > We normally don't notice because most sets won't incur a penalty. We got a > customer who > has a single table that is over 1TB in size... We notice. Granted that is > the extreme > but it would only take a quarter of that size (which is common) to start > seeing issues. I'm a bit puzzled by people's repeated suggestion here that large tables take a long time to analyze. The sample analyze takes to generate statistics is not heavily influenced by the size of the table. Your 1TB table should take basically the same amount of time as a 1GB table or a 1MB table (if it wasn't already in cache). Unless the reason why it's 1TB is that the columns are extremely wide rather than that it has a lot of rows? Or unless you've raised the statistics target in (a misguided*) belief that larger tables require larger statistics targets to achieve the same level of accuracy. Or unless when you say "ANALYZE" you're really running "VACUUM ANALYZE". [*] except for ndistinct estimates :( -- greg
On Wed, 30 Dec 2009 18:42:38 +0000, Greg Stark <gsstark@mit.edu> wrote: > I'm a bit puzzled by people's repeated suggestion here that large > tables take a long time to analyze. The sample analyze takes to > generate statistics is not heavily influenced by the size of the > table. Your 1TB table should take basically the same amount of time as > a 1GB table or a 1MB table (if it wasn't already in cache). No. postgres=# analyze verbose test_one_million; INFO: analyzing "public.test_one_million" INFO: "test_one_million": scanned 3000 of 4425 pages, containing 677950 live rows and 0 dead rows; 3000 rows in sample, 999976 estimated total rows ANALYZE Time: 168.009 ms postgres=# analyze verbose test_one_million; INFO: analyzing "public.test_one_million" INFO: "test_one_million": scanned 3000 of 4425 pages, containing 677950 live rows and 0 dead rows; 3000 rows in sample, 999976 estimated total rows ANALYZE Time: 104.006 ms postgres=# analyze verbose test_ten_million; INFO: analyzing "public.test_ten_million" INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000 live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total rows ANALYZE Time: 20145.148 ms postgres=# analyze verbose test_ten_million; INFO: analyzing "public.test_ten_million" INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000 live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total rows ANALYZE Time: 18481.053 ms postgres=# analyze verbose test_ten_million; INFO: analyzing "public.test_ten_million" INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000 live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total rows ANALYZE Time: 17653.006 ms The test_one_million when in cache and out is very quick. I don't think the ten million is actually able to get into cache (small box) but either way if you look at the on disk number for the one million 168ms versus the on disk number for the ten million, they are vastly different. postgres=# select pg_size_pretty(pg_total_relation_size('test_one_million'));pg_size_pretty ----------------35 MB (1 row) Time: 108.006 ms postgres=# select pg_size_pretty(pg_total_relation_size('test_ten_million'));pg_size_pretty ----------------346 MB (1 row) > > Unless the reason why it's 1TB is that the columns are extremely wide > rather than that it has a lot of rows? I should have qualified, yes they are very wide. JD -- PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote: > This seems like a fundamentally broken approach, first because "time > between analyzes" is not even approximately a constant, and second > because it assumes that we have a distance metric for all datatypes. Maybe you could compute a correlation between the column values and the transaction numbers to recognize a continuously advancing column. It wouldn't tell you much about how fast they are advancing, but at least the typical use cases of serial and current timestamp columns should clearly stick out. And then instead of assuming that a value beyond the histogram bound doesn't exist, you assume for example the average frequency, which should be pretty good for the serial and timestamp cases. (Next step: Fourier analysis ;-) )
Peter Eisentraut <peter_e@gmx.net> writes: > On tis, 2009-12-29 at 22:08 -0500, Tom Lane wrote: >> This seems like a fundamentally broken approach, first because "time >> between analyzes" is not even approximately a constant, and second >> because it assumes that we have a distance metric for all datatypes. > Maybe you could compute a correlation between the column values and the > transaction numbers to recognize a continuously advancing column. It > wouldn't tell you much about how fast they are advancing, but at least > the typical use cases of serial and current timestamp columns should > clearly stick out. And then instead of assuming that a value beyond the > histogram bound doesn't exist, you assume for example the average > frequency, which should be pretty good for the serial and timestamp > cases. (Next step: Fourier analysis ;-) ) Actually, the histogram hasn't got much of anything to do with estimates of the number of occurrences of a single value. Josh hasn't shown us his specific problem query, but I would bet that it's roughly like WHERE update_time > now() - interval 'something', that is, the estimate that's problematic is an inequality not an equality. When the range being asked for is outside the histogram bounds, it really is rather difficult to come up with a reasonable estimate --- you'd need a specific idea of how far outside the upper bound it is, how fast the upper bound has been advancing, and how long it's been since the last analyze. (I find the last bit particularly nasty, because it will mean that plans change even when "nothing is changing" in the database.) [ thinks for awhile ... ] Actually, in the problematic cases, it's interesting to consider the following strategy: when scalarineqsel notices that it's being asked for a range estimate that's outside the current histogram bounds, first try to obtain the actual current max() or min() of the column value --- this is something we can get fairly cheaply if there's a btree index on the column. If we can get it, plug it into the histogram, replacing the high or low bin boundary. Then estimate as we currently do. This would work reasonably well as long as re-analyzes happen at a time scale such that the histogram doesn't move much overall, ie, the number of insertions between analyzes isn't a lot compared to the number of rows per bin. We'd have some linear-in-the-bin-size estimation error because the modified last or first bin actually contains more rows than other bins, but it would certainly work a lot better than it does now. regards, tom lane
jd@commandprompt.com ("Joshua D. Drake") writes: > On the other hand ANALYZE also: > > 1. Uses lots of memory > 2. Lots of processor > 3. Can take a long time > > We normally don't notice because most sets won't incur a penalty. We got a > customer who > has a single table that is over 1TB in size... We notice. Granted that is > the extreme > but it would only take a quarter of that size (which is common) to start > seeing issues. I find it curious that ANALYZE *would* take a long time to run. After all, its sampling strategy means that, barring having SET STATISTICS to some ghastly high number, it shouldn't need to do materially more work to analyze a 1TB table than is required to analyze a 1GB table. With the out-of-the-box (which may have changed without my notice ;-)) default of 10 bars in the histogram, it should search for 30K rows, which, while not "free," doesn't get enormously more expensive as tables grow. -- "cbbrowne","@","gmail.com" http://linuxfinances.info/info/linuxdistributions.html Rules of the Evil Overlord #179. "I will not outsource core functions." <http://www.eviloverlord.com/>
well that's interesting because they claim to be doing exactly the same amount of I/O in terms of pages. In the first case it's reading 3/4 of the table so it's effectively doing a sequential scan. In the second case it's onlyscanning 7.5% so you would expect it to be slower but not that much slower. If as you say the rows are very wide then the other part of the equation will be TOAST table I/O though. I'm not sure whatit would look like but I bet analyze isn't optimized to handle well -- not much of postgres really knows about TOAST.It'll be accessing the same number of TOAST records but out of a much bigger TOAST table. -- greg
Chris Browne <cbbrowne@acm.org> writes: > I find it curious that ANALYZE *would* take a long time to run. > After all, its sampling strategy means that, barring having SET > STATISTICS to some ghastly high number, it shouldn't need to do > materially more work to analyze a 1TB table than is required to analyze > a 1GB table. Right. The example JD quotes in this thread compares a 35MB table to a 350MB one, and the difference is all about having crossed the threshold of what would fit in his available RAM. There isn't going to be much difference in the ANALYZE time for "big" versus "very big" tables. (There might, however, be a difference in the quality of the resulting stats :-() regards, tom lane
Joshua D. Drake wrote: > postgres=# analyze verbose test_ten_million; > INFO: analyzing "public.test_ten_million" > INFO: "test_ten_million": scanned 3000 of 44248 pages, containing 678000 > live rows and 0 dead rows; 3000 rows in sample, 10000048 estimated total > rows > ANALYZE > Time: 20145.148 ms > At an ever larger table sizes, this would turn into 3000 random seeks all over the drive, one at a time because there's no async I/O here to queue requests better than that for this access pattern. Let's say they take 10ms each, not an unrealistic amount of time on current hardware. That's 30 seconds, best case, which is similar to what JD's example is showing even on a pretty small data set. Under load it could easily take over a minute, hammering the disks the whole time, and in a TOAST situation you're doing even more work. It's not outrageous and it doesn't scale linearly with table size, but it's not something you want to happen any more than you have to either--consider the poor client who is trying to get their work done while that is going on. On smaller tables, you're both more likely to grab a useful next page via readahead, and to just have the data you need cached in RAM already. There's a couple of "shelves" in the response time to finish ANALYZE as you exceed L1/L2 CPU cache size and RAM size, then it trails downward as the seeks get longer and longer once the data you need is spread further across the disk(s). That the logical beginning of a drive is much faster than the logical end doesn't help either. I should generate that graph again one day somewhere I can release it at... -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On 31/12/2009 12:33 AM, Kevin Grittner wrote: > Tom Lane<tgl@sss.pgh.pa.us> wrote: > >> Well, the problem Josh has got is exactly that a constant high >> bound doesn't work. > > I thought the problem was that the high bound in the statistics fell > too far below the actual high end in the data. This tends (in my > experience) to be much more painful than an artificially extended > high end in the statistics. (YMMV, of course.) > >> What I'm wondering about is why he finds that re-running ANALYZE >> isn't an acceptable solution. It's supposed to be a reasonably >> cheap thing to do. > > Good point. We haven't hit this problem in PostgreSQL precisely > because we can run ANALYZE often enough to prevent the skew from > becoming pathological. While regular ANALYZE seems to be pretty good ... is it insane to suggest determining the min/max bounds of problem columns by looking at a btree index on the column in ANALYZE, instead of relying on random data sampling? An ANALYZE that didn't even have to scan the indexes but just look at the ends might be something that could be run much more frequently with less I/O and memory cost than a normal ANALYZE, just to selectively update key stats that are an issue for such continuously advancing columns. -- Craig Ringer
> While regular ANALYZE seems to be pretty good ... is it insane to > suggest determining the min/max bounds of problem columns by looking at > a btree index on the column in ANALYZE, instead of relying on random > data sampling? An ANALYZE that didn't even have to scan the indexes but > just look at the ends might be something that could be run much more > frequently with less I/O and memory cost than a normal ANALYZE, just to > selectively update key stats that are an issue for such continuously > advancing columns. ... which Tom Lane already raised in a smarter way in a message I hadn't read yet. Sorry for the noise. -- Craig Ringer
Tom Lane <tgl@sss.pgh.pa.us> writes: > Actually, in the problematic cases, it's interesting to consider the > following strategy: when scalarineqsel notices that it's being asked for > a range estimate that's outside the current histogram bounds, first try > to obtain the actual current max() or min() of the column value --- this > is something we can get fairly cheaply if there's a btree index on the > column. If we can get it, plug it into the histogram, replacing the > high or low bin boundary. Then estimate as we currently do. This would > work reasonably well as long as re-analyzes happen at a time scale such > that the histogram doesn't move much overall, ie, the number of > insertions between analyzes isn't a lot compared to the number of rows > per bin. We'd have some linear-in-the-bin-size estimation error because > the modified last or first bin actually contains more rows than other > bins, but it would certainly work a lot better than it does now. I know very little about statistics in general, but your proposal seems straigth enough for me to understand it, and looks good: +1. Regards, -- dim
On Wed, 2009-12-30 at 14:55 -0500, Tom Lane wrote: > Actually, in the problematic cases, it's interesting to consider the > following strategy: when scalarineqsel notices that it's being asked for > a range estimate that's outside the current histogram bounds, first try > to obtain the actual current max() or min() of the column value --- this > is something we can get fairly cheaply if there's a btree index on the > column. If we can get it, plug it into the histogram, replacing the > high or low bin boundary. Then estimate as we currently do. This would > work reasonably well as long as re-analyzes happen at a time scale such > that the histogram doesn't move much overall, ie, the number of > insertions between analyzes isn't a lot compared to the number of rows > per bin. We'd have some linear-in-the-bin-size estimation error because > the modified last or first bin actually contains more rows than other > bins, but it would certainly work a lot better than it does now. Histograms often move quickly, but they seldom change shape. Why not get both max() and min(), then rebase the histogram according to those values. That way the histogram can still move significantly and the technique will still work. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > Why not get both max() and min(), then rebase the histogram according to > those values. That way the histogram can still move significantly and > the technique will still work. Define "rebase", keeping in mind that this has to work on datatypes that we don't have a distance metric for. regards, tom lane
On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > Why not get both max() and min(), then rebase the histogram according to > > those values. That way the histogram can still move significantly and > > the technique will still work. > > Define "rebase", keeping in mind that this has to work on datatypes that > we don't have a distance metric for. Make it work differently according to whether we have, or not, just as we do elsewhere with stats. No point in limiting ourselves to the lowest common denominator, especially when the common case is integer keys and time datatypes. -- Simon Riggs www.2ndQuadrant.com
On Thu, 2009-12-31 at 21:29 +0000, Simon Riggs wrote: > On Thu, 2009-12-31 at 15:18 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > Why not get both max() and min(), then rebase the histogram according to > > > those values. That way the histogram can still move significantly and > > > the technique will still work. > > > > Define "rebase", keeping in mind that this has to work on datatypes that > > we don't have a distance metric for. > > Make it work differently according to whether we have, or not, just as > we do elsewhere with stats. No point in limiting ourselves to the lowest > common denominator, especially when the common case is integer keys and > time datatypes. This seemed obvious but I understand now that you meant we don't know that from the datatype definition, so we can't do as I suggested, yet. -- Simon Riggs www.2ndQuadrant.com
I wrote: > Actually, in the problematic cases, it's interesting to consider the > following strategy: when scalarineqsel notices that it's being asked for > a range estimate that's outside the current histogram bounds, first try > to obtain the actual current max() or min() of the column value --- this > is something we can get fairly cheaply if there's a btree index on the > column. If we can get it, plug it into the histogram, replacing the > high or low bin boundary. Then estimate as we currently do. This would > work reasonably well as long as re-analyzes happen at a time scale such > that the histogram doesn't move much overall, ie, the number of > insertions between analyzes isn't a lot compared to the number of rows > per bin. We'd have some linear-in-the-bin-size estimation error because > the modified last or first bin actually contains more rows than other > bins, but it would certainly work a lot better than it does now. I've applied a patch to HEAD that does the above. Can you test it to see how well it fixes your problem? Looking at the current uses of the histogram stats, there is another place that could possibly benefit from this type of on-demand index search: mergejoinscansel. That code attempts to estimate how much of a column actually needs to be scanned by a merge join, recognizing that a merge will stop reading either input once the other is exhausted. Having an accurate idea of the join keys' upper bounds is fairly important here, since the supposed cost reduction from not scanning all of a table disappears if there's even one large-valued key on the other side of the join. On the other hand, making use of index searches in mergejoinscansel would put these index searches into the standard, very heavily used join planning path, not into a corner case which is all they are right now. So I'm fairly nervous about the potential hit on join planning time. Is there anyone around who can do planner timing measurements on complex join queries involving large tables? If so, please try CVS HEAD as-is and after enabling this bit in selfuncs.c: /* * XXX It's very tempting to try to use the actual column min and max, * if we can get them relatively-cheaplywith an index probe. However, * since this function is called many times during join planning, *that could have unpleasant effects on planning speed. Need more * investigation before enabling this. */ #ifdef NOT_USED if (get_actual_variable_range(root, vardata, sortop, min, max)) return true; #endif regards, tom lane
On Wed, 2009-12-30 at 17:16 +0100, Tom Lane wrote: > I think the cleanest solution to this would be to make ANALYZE > cheaper, perhaps by finding some way for it to work incrementally. What if when inserting/deleting a tuple, some random sample of them would be passed into an auto-analyze buffer ? Then a special process (the auto-analyze daemon) would process them and update the statistics incrementally based on the new values found (which might or might not be mathematically feasible). The overhead for each backend process would be kept in limits by the rate at which you randomly send or not send the change to the analyze buffer. The processing overhead would be kept in limits by the processing rate of the auto-analyze process, which can be made to periodically sleep or it could be made to span multiple processes (on multiprocessor systems). If the buffer is full, then you skip putting in it... so it also could autotune itself to a sustainable rate. Of course as with all my other posts on hackers, this is all mostly hand-waving, I have no clue about the feasibility of all this with regard to the current state of the code (which I didn't read, I unfortunately found myself hating reading C code beyond reason, and writing any of it till now resumed to copy-paste-modify). Cheers, Csaba. Csaba Nagy Software Engineer eCircle P: +49 (0)89 / 120 09-783 | F: +49 (0)89 / 120 09-750 E: c.nagy@ecircle.com Nymphenburger Str. 86, 80636 M�nchen Stay in touch Web: www.ecircle.com/de | Newsletter: www.ecircle.com/index.php?id=63&L=0 F�r Hilfe mit dem eC-messenger wenden Sie sich bitte an unseren Support: support-de@ecircle.com. Neuste Untersuchungen Ein unschlagbares Doppel: E-mail-Marketing & Webanalyse Download Whitepaper: www.ecircle.com/index.php?id=61&L=0 eCircle AG, HRB 136 334, Handelsregister M�nchen Vorstand: Volker Wiewer (Vorsitzender), Thomas Wilke, Lars W�ssner, Alexander Meyer Vorsitzender des Aufsichtsrates: Dr. Mark W�ssner
Josh Berkus <josh@agliodbs.com> writes: >> I've applied a patch to HEAD that does the above. Can you test it to >> see how well it fixes your problem? > Sure. It'll take us a while to set up a test environment; the database > is 1TB in size so converting it to 8.5 isn't quick. Great. When you have it set up, you might want to play with enabling the mergejoinscansel change as well, and see if that is a net plus or minus for you. regards, tom lane
> I've applied a patch to HEAD that does the above. Can you test it to > see how well it fixes your problem? Sure. It'll take us a while to set up a test environment; the database is 1TB in size so converting it to 8.5 isn't quick. Will report back. --Josh
> Great. When you have it set up, you might want to play with enabling > the mergejoinscansel change as well, and see if that is a net plus or > minus for you. How would I *disable* it? --Josh
Josh Berkus <josh@agliodbs.com> writes: >> Great. When you have it set up, you might want to play with enabling >> the mergejoinscansel change as well, and see if that is a net plus or >> minus for you. > How would I *disable* it? It's #ifdef NOT_USED in CVS at the moment. regards, tom lane
Hi,
My suggestion is to keep two sets of histograms. One which is generated by running ANALYZE and
the other which is dynamically generated histograms using the entries from logging (that is done
in insert/update/delete operations).
I am not sure how difficult is it to read such record details from logs.
Basically from the details mentioned here what i understand is that the table data (timestamp) is added
in incremental way, ie existing data is not modified to great extent and the new data is
merely appended to old data.
In this case, the only work for analyse/statistics generation is to merge the histograms of newly added records to old histograms.
if we can treat this case as similar to that of merging of histograms in case of joins involving 2 tables and generating the histograms for the cartesian (result) node, then all we need to do is somehow generate temporary histogram for the new set of records and merge them with the old histogram.
The information of interesting columns from the new records can be read from the logging section.
We must be already having the part of merging of histograms and I hope that it wont be very costly
to make these calls so as to effect planner.
(Further my opinion is to calculate this cost of histogram generation and use it in costing in some way)
Further we can put some threshold limit to make this merge happen automatically. Say if the temporary histograms reach some set threshold, only then these will be merged with the older histograms.
Please pass on your inputs.
Regards,
Chetan
My suggestion is to keep two sets of histograms. One which is generated by running ANALYZE and
the other which is dynamically generated histograms using the entries from logging (that is done
in insert/update/delete operations).
I am not sure how difficult is it to read such record details from logs.
Basically from the details mentioned here what i understand is that the table data (timestamp) is added
in incremental way, ie existing data is not modified to great extent and the new data is
merely appended to old data.
In this case, the only work for analyse/statistics generation is to merge the histograms of newly added records to old histograms.
if we can treat this case as similar to that of merging of histograms in case of joins involving 2 tables and generating the histograms for the cartesian (result) node, then all we need to do is somehow generate temporary histogram for the new set of records and merge them with the old histogram.
The information of interesting columns from the new records can be read from the logging section.
We must be already having the part of merging of histograms and I hope that it wont be very costly
to make these calls so as to effect planner.
(Further my opinion is to calculate this cost of histogram generation and use it in costing in some way)
Further we can put some threshold limit to make this merge happen automatically. Say if the temporary histograms reach some set threshold, only then these will be merged with the older histograms.
Please pass on your inputs.
Regards,
Chetan
On Wed, Dec 30, 2009 at 8:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Josh Berkus <josh@agliodbs.com> writes:This seems like a fundamentally broken approach, first because "time
> My thoughts on dealing with this intelligently without a major change to
> statstics gathering went along these lines:
> 1. add columns to pg_statistic to hold estimates of upper and lower
> bounds growth between analyzes.
between analyzes" is not even approximately a constant, and second
because it assumes that we have a distance metric for all datatypes.
(Note that convert_to_scalar does not assume that it can measure
arbitrary distances, but only fractions *within* a histogram bucket;
and even that is pretty shaky.)
I don't have a better idea at the moment :-(
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 5, 2010 at 7:49 AM, Chetan Suttraway <chetan.suttraway@enterprisedb.com> wrote: > if we can treat this case as similar to that of merging of histograms in > case of joins involving 2 tables and generating the histograms for the > cartesian (result) node, ...which you can't, because it's totally different, so I think the rest of this is a dead end. ...Robert