Thread: Re: [COMMITTERS] pgsql: Implement width_bucket() for the float8 data
Re: [COMMITTERS] pgsql: Implement width_bucket() for the float8 data
From
Stefan Kaltenbrunner
Date:
Neil Conway wrote: > Log Message: > ----------- > Implement width_bucket() for the float8 data type. this seems to require an alternative regression output file on windows: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-17%2006:30:00 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bandicoot&dt=2007-01-17%2002:15:02 curiously it seems that these boxes boxes fail the stats test too - maybe fallout from the autovacuum changes ? Stefan
On Wed, 2007-01-17 at 08:51 +0100, Stefan Kaltenbrunner wrote: > this seems to require an alternative regression output file on windows Hmm, right. Easiest fix seems to be just removing the platform-dependent output from the regression test, since it wasn't necessary -- committed to CVS HEAD. (I placed the regression tests for the float8 version of width_bucket() in numeric.sql to keep them close to the tests for the numeric version of width_bucket(). Push come to shove, I can always move the former over to float8.sql and then add platform-dependent output as necessary.) > curiously it seems that these boxes boxes fail the stats test too Seems an unrelated problem, I think. -Neil
Neil Conway wrote: > On Wed, 2007-01-17 at 08:51 +0100, Stefan Kaltenbrunner wrote: >> this seems to require an alternative regression output file on windows > > Hmm, right. Easiest fix seems to be just removing the platform-dependent > output from the regression test, since it wasn't necessary -- committed > to CVS HEAD. (I placed the regression tests for the float8 version of > width_bucket() in numeric.sql to keep them close to the tests for the > numeric version of width_bucket(). Push come to shove, I can always move > the former over to float8.sql and then add platform-dependent output as > necessary.) > >> curiously it seems that these boxes boxes fail the stats test too > > Seems an unrelated problem, I think. yeah - looks like it's the autovacuum change - snake is now passing the numeric-test but still fails the stats one ... Stefan
Stefan Kaltenbrunner wrote: > Neil Conway wrote: > > On Wed, 2007-01-17 at 08:51 +0100, Stefan Kaltenbrunner wrote: > >> this seems to require an alternative regression output file on windows > > > > Hmm, right. Easiest fix seems to be just removing the platform-dependent > > output from the regression test, since it wasn't necessary -- committed > > to CVS HEAD. (I placed the regression tests for the float8 version of > > width_bucket() in numeric.sql to keep them close to the tests for the > > numeric version of width_bucket(). Push come to shove, I can always move > > the former over to float8.sql and then add platform-dependent output as > > necessary.) > > > >> curiously it seems that these boxes boxes fail the stats test too > > > > Seems an unrelated problem, I think. > > yeah - looks like it's the autovacuum change - snake is now passing the > numeric-test but still fails the stats one ... Interesting -- both yak and snake are failing in a very similar way. I'll investigate it tomorrow if no one beats me to it. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Stefan Kaltenbrunner wrote: > > yeah - looks like it's the autovacuum change - snake is now passing the > > numeric-test but still fails the stats one ... > > Interesting -- both yak and snake are failing in a very similar way. > I'll investigate it tomorrow if no one beats me to it. All our Windows buildfarm machines are failing. AFAICT, the first failure was on Yak, http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20 and the last successful run just before that seems to come from Snake, http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00 The only changes that went in in that period are the patch that enabled autovacuum by default, an information_schema fix and a TODO file change. The only that could cause this problem seems to be the autovacuum enable bit. The failures are all exactly alike: *** ./expected/stats.out Thu Jan 18 08:48:12 2007 --- ./results/stats.out Thu Jan 18 09:02:53 2007 *************** *** 51,57 **** WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? | ?column? | ?column? ----------+----------+----------+---------- ! t | t | t | t (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, --- 51,57 ---- WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? | ?column? | ?column? ----------+----------+----------+---------- ! f | f | f | f (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, *************** *** 60,66 **** WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? ----------+---------- ! t | t (1 row) -- End of Stats Test --- 60,66 ---- WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? ----------+---------- ! f | f (1 row) -- End of Stats Test The full failing queries are these: -- check effects SELECT st.seq_scan >= pr.seq_scan + 1, st.seq_tup_read >= pr.seq_tup_read + cl.reltuples, st.idx_scan >= pr.idx_scan+ 1, st.idx_tup_fetch >= pr.idx_tup_fetch + 1 FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats ASprWHERE st.relname='tenk2' AND cl.relname='tenk2';?column? | ?column? | ?column? | ?column? ----------+----------+----------+----------t | t | t | t (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, st.idx_blks_read + st.idx_blks_hit >= pr.idx_blks+ 1 FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS prWHERE st.relname='tenk2' AND cl.relname='tenk2';?column?| ?column? ----------+----------t | t (1 row) The six booleans are false on Windows. What could be the reason for this change? The only thing that occurs to me is that autovacuum is firing just when running that test, it processes that table and increments the counters before the final SQL is run. Now, if some Windows-enabled person could step forward so that we can suggest some tests to run, that would be great. Perhaps the solution to the problem is to relax the conditions a little, so that two scans are accepted on that table instead of only one; but it would be good to confirm whether the stat system is really working and it's really still counting stuff as it's supposed to do. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Now, if some Windows-enabled person could step forward so that we can > suggest some tests to run, that would be great. Perhaps the solution to > the problem is to relax the conditions a little, so that two scans are > accepted on that table instead of only one; but it would be good to > confirm whether the stat system is really working and it's really still > counting stuff as it's supposed to do. No, you misread it: the check is for at least one new event, not exactly one. We've been seeing this intermittently for a long time, but it sure seems that autovac has raised the probability greatly. That's pretty odd. If it's a timing thing, why are all and only the Windows machines affected? Could it be that autovac is sucking all the spare cycles and keeping the stats collector from running? (Does autovac use vacuum_cost_delay by default? It probably should if not.) I noticed today on my own machine several strange pauses while running the serial regression tests --- the machine didn't seem to be hitting the disk nor sucking lots of CPU, it just sat there for several seconds and then picked up again. I wonder if that's related. It sure seems it must be due to autovac being on now. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Now, if some Windows-enabled person could step forward so that we can > > suggest some tests to run, that would be great. Perhaps the solution to > > the problem is to relax the conditions a little, so that two scans are > > accepted on that table instead of only one; but it would be good to > > confirm whether the stat system is really working and it's really still > > counting stuff as it's supposed to do. > > No, you misread it: the check is for at least one new event, not exactly > one. Doh :-( > We've been seeing this intermittently for a long time, but it sure seems > that autovac has raised the probability greatly. That's pretty odd. > If it's a timing thing, why are all and only the Windows machines > affected? Could it be that autovac is sucking all the spare cycles > and keeping the stats collector from running? Hmm, that could explain it, but it's strange that only Windows machines are affected. Maybe it's a scheduler issue, and the Unix machines are able to let pgstat do some work but Windows are not. > (Does autovac use vacuum_cost_delay by default? It probably should if > not.) The default autovacuum_vacuum_cost_delay is -1, which means "use the system default", which in turn is 0. So it's off by default. > I noticed today on my own machine several strange pauses while running > the serial regression tests --- the machine didn't seem to be hitting > the disk nor sucking lots of CPU, it just sat there for several seconds > and then picked up again. I wonder if that's related. It sure seems it > must be due to autovac being on now. Hmm, strange; I ran the tests several times today testing Magnus changes, and I didn't notice any pause. It was mostly the parallel tests though; I'll try serial. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane wrote: > I noticed today on my own machine several strange pauses while running > the serial regression tests --- the machine didn't seem to be hitting > the disk nor sucking lots of CPU, it just sat there for several seconds > and then picked up again. I wonder if that's related. It sure seems it > must be due to autovac being on now. The only pauses I see are are in the "stats" and the "prepared_xacts" tests. The latter is due to a test that uses statement_timeout to detect a lock, and the stat test does a pg_sleep(2.0) call. Do those explain what you are seeing? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Alvaro Herrera wrote: >> Stefan Kaltenbrunner wrote: > >>> yeah - looks like it's the autovacuum change - snake is now passing the >>> numeric-test but still fails the stats one ... >> Interesting -- both yak and snake are failing in a very similar way. >> I'll investigate it tomorrow if no one beats me to it. > > All our Windows buildfarm machines are failing. AFAICT, the first > failure was on Yak, > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20 > > and the last successful run just before that seems to come from Snake, > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00 > > The only changes that went in in that period are the patch that enabled > autovacuum by default, an information_schema fix and a TODO file change. > The only that could cause this problem seems to be the autovacuum enable > bit. I think this one: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bear&dt=2007-01-19%2006:06:02 is fallout from the autovacuum changes too - it seems that initdb is picking a low value (20) for max_connections on that box and autovacuum is acting as an additional client that will cause the maximum of allowed connections to exceed during the parallel tests and therefor resulting in the failure. Stefan
Stefan Kaltenbrunner wrote: > Alvaro Herrera wrote: > >> Alvaro Herrera wrote: >> >>> Stefan Kaltenbrunner wrote: >>> >>>> yeah - looks like it's the autovacuum change - snake is now passing the >>>> numeric-test but still fails the stats one ... >>>> >>> Interesting -- both yak and snake are failing in a very similar way. >>> I'll investigate it tomorrow if no one beats me to it. >>> >> All our Windows buildfarm machines are failing. AFAICT, the first >> failure was on Yak, >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20 >> >> and the last successful run just before that seems to come from Snake, >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00 >> >> The only changes that went in in that period are the patch that enabled >> autovacuum by default, an information_schema fix and a TODO file change. >> The only that could cause this problem seems to be the autovacuum enable >> bit. >> > > I think this one: > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bear&dt=2007-01-19%2006:06:02 > > is fallout from the autovacuum changes too - it seems that initdb is > picking a low value (20) for max_connections on that box and autovacuum > is acting as an additional client that will cause the maximum of allowed > connections to exceed during the parallel tests and therefor resulting > in the failure. > > > > If so, that's a case of driver error, I think. The buildfarm member should set MAX_CONNECTIONS => '10' or similar in the build_env stanza of the config file. cheers andrew
Stefan Kaltenbrunner wrote: > Alvaro Herrera wrote: > > All our Windows buildfarm machines are failing. AFAICT, the first > > failure was on Yak, > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20 > > > > and the last successful run just before that seems to come from Snake, > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00 > > I think this one: > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bear&dt=2007-01-19%2006:06:02 > > is fallout from the autovacuum changes too - it seems that initdb is > picking a low value (20) for max_connections on that box and autovacuum > is acting as an additional client that will cause the maximum of allowed > connections to exceed during the parallel tests and therefor resulting > in the failure. Sorry, I forgot to mention that I specifically skipped those errors not directly related to the problem at hand. This problem is clearly something else (as well as the Mac OS X failures due to readline misconfiguration, the ECPG-check failures, etc). I concur with Andrew's suggestion that it's really pilot error. Maybe what we really ought to do is pick an internal max_connections value that exceeds what the max_connections GUC parameter say, adjusting per autovacuum configuration. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I noticed today on my own machine several strange pauses while running >> the serial regression tests --- > Do those explain what you are seeing? No, those are expected. I'm having a hard time reproducing the behavior right now, but IIRC the delays were in the vacuum and/or sanity_check tests. It's not unlikely that the foreground VACUUM was blocking on a lock while autovac did the same work, except that that doesn't explain the length of the pause, nor the lack of disk activity. But I can't make it happen right now, so nevermind until I figure out how to reproduce it ... regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Maybe what we really ought to do is pick an internal max_connections > value that exceeds what the max_connections GUC parameter say, adjusting > per autovacuum configuration. That's just cosmetic; it doesn't address the real issue, which is that if SHMMAX or other kernel settings are too small, initdb will pick a max_connections too low to allow the parallel regression tests to run. The fact that the regression tests try to exercise 20 concurrent sessions by default isn't just an accident; the thought was that if you had a configuration too small to allow a reasonable number of concurrent sessions, the tests ought to point it out to you. (Indeed, these days we probably oughta try to exercise more than 20 sessions.) But this is somewhat in conflict with our desire that buildfarm members not fall over for random reasons --- and we've seen it happen more than once that a test run's initdb picks a smaller-than-normal max_connections because of transient system loads. Perhaps we could extend pg_regress to allow "--max-connections=auto" which would instruct it to set its connection limit to the server's actual max_connections minus superuser reserved slots (and probably minus a couple more to allow for backend shutdown time etc). Then the buildfarm could use that, while we'd leave the behavior alone for normal manual regression tests. regards, tom lane
Tom Lane wrote: > > Perhaps we could extend pg_regress to allow "--max-connections=auto" > which would instruct it to set its connection limit to the server's > actual max_connections minus superuser reserved slots (and probably > minus a couple more to allow for backend shutdown time etc). Then the > buildfarm could use that, while we'd leave the behavior alone for normal > manual regression tests. > > This seems needlessly complex. We can tolerate occasional intermittent failures on buildfarm, and if they are persistent there is already a configurable rate limiting mechanism available. cheers andrew
Alvaro Herrera wrote: > Tom Lane wrote: >> Alvaro Herrera <alvherre@commandprompt.com> writes: >>> Now, if some Windows-enabled person could step forward so that we can >>> suggest some tests to run, that would be great. Perhaps the solution to >>> the problem is to relax the conditions a little, so that two scans are >>> accepted on that table instead of only one; but it would be good to >>> confirm whether the stat system is really working and it's really still >>> counting stuff as it's supposed to do. >> No, you misread it: the check is for at least one new event, not exactly >> one. > > Doh :-( > >> We've been seeing this intermittently for a long time, but it sure seems >> that autovac has raised the probability greatly. That's pretty odd. >> If it's a timing thing, why are all and only the Windows machines >> affected? Could it be that autovac is sucking all the spare cycles >> and keeping the stats collector from running? > > Hmm, that could explain it, but it's strange that only Windows machines > are affected. Maybe it's a scheduler issue, and the Unix machines are > able to let pgstat do some work but Windows are not. maybe not only windows boxes: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebra&dt=2007-01-20%2015:25:05 Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Alvaro Herrera wrote: >> Hmm, that could explain it, but it's strange that only Windows machines >> are affected. Maybe it's a scheduler issue, and the Unix machines are >> able to let pgstat do some work but Windows are not. > maybe not only windows boxes: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebra&dt=2007-01-20%2015:25:05 That one's interesting because only the first of the two queries failed. I suppose that must mean that the stats file did update, but between those two queries. Maybe we just need to lengthen the sleep() even more? regards, tom lane
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > maybe not only windows boxes: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebra&dt=2007-01-20%2015:25:05 Wow, I just saw the stats failure on my own machine, for the first time ever. Conclusions: 1. Enabling autovac has definitely raised the probability of failure. 2. It's not Windows-only, but the probability of failure is much higher on Windows. Not sure what that tells us, though ... regards, tom lane