Thread: Re: [COMMITTERS] pgsql: Implement width_bucket() for the float8 data

Re: [COMMITTERS] pgsql: Implement width_bucket() for the float8 data

From
Stefan Kaltenbrunner
Date:
Neil Conway wrote:
> Log Message:
> -----------
> Implement width_bucket() for the float8 data type.

this seems to require an alternative regression output file on windows:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-17%2006:30:00
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bandicoot&dt=2007-01-17%2002:15:02

curiously it seems that these boxes boxes fail the stats test too -
maybe fallout from the autovacuum changes ?


Stefan


Re: [COMMITTERS] pgsql: Implement width_bucket() for the

From
Neil Conway
Date:
On Wed, 2007-01-17 at 08:51 +0100, Stefan Kaltenbrunner wrote:
> this seems to require an alternative regression output file on windows

Hmm, right. Easiest fix seems to be just removing the platform-dependent
output from the regression test, since it wasn't necessary -- committed
to CVS HEAD. (I placed the regression tests for the float8 version of
width_bucket() in numeric.sql to keep them close to the tests for the
numeric version of width_bucket(). Push come to shove, I can always move
the former over to float8.sql and then add platform-dependent output as
necessary.)

> curiously it seems that these boxes boxes fail the stats test too

Seems an unrelated problem, I think.

-Neil




Re: [COMMITTERS] pgsql: Implement width_bucket() for the

From
Stefan Kaltenbrunner
Date:
Neil Conway wrote:
> On Wed, 2007-01-17 at 08:51 +0100, Stefan Kaltenbrunner wrote:
>> this seems to require an alternative regression output file on windows
> 
> Hmm, right. Easiest fix seems to be just removing the platform-dependent
> output from the regression test, since it wasn't necessary -- committed
> to CVS HEAD. (I placed the regression tests for the float8 version of
> width_bucket() in numeric.sql to keep them close to the tests for the
> numeric version of width_bucket(). Push come to shove, I can always move
> the former over to float8.sql and then add platform-dependent output as
> necessary.)
> 
>> curiously it seems that these boxes boxes fail the stats test too
> 
> Seems an unrelated problem, I think.

yeah - looks like it's the autovacuum change - snake is now passing the
numeric-test but still fails the stats one ...


Stefan


Re: [COMMITTERS] pgsql: Implement width_bucket() for the

From
Alvaro Herrera
Date:
Stefan Kaltenbrunner wrote:
> Neil Conway wrote:
> > On Wed, 2007-01-17 at 08:51 +0100, Stefan Kaltenbrunner wrote:
> >> this seems to require an alternative regression output file on windows
> > 
> > Hmm, right. Easiest fix seems to be just removing the platform-dependent
> > output from the regression test, since it wasn't necessary -- committed
> > to CVS HEAD. (I placed the regression tests for the float8 version of
> > width_bucket() in numeric.sql to keep them close to the tests for the
> > numeric version of width_bucket(). Push come to shove, I can always move
> > the former over to float8.sql and then add platform-dependent output as
> > necessary.)
> > 
> >> curiously it seems that these boxes boxes fail the stats test too
> > 
> > Seems an unrelated problem, I think.
> 
> yeah - looks like it's the autovacuum change - snake is now passing the
> numeric-test but still fails the stats one ...

Interesting -- both yak and snake are failing in a very similar way.
I'll investigate it tomorrow if no one beats me to it.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Windows buildfarm failures

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Stefan Kaltenbrunner wrote:

> > yeah - looks like it's the autovacuum change - snake is now passing the
> > numeric-test but still fails the stats one ...
> 
> Interesting -- both yak and snake are failing in a very similar way.
> I'll investigate it tomorrow if no one beats me to it.

All our Windows buildfarm machines are failing.  AFAICT, the first
failure was on Yak, 
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20

and the last successful run just before that seems to come from Snake,
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00

The only changes that went in in that period are the patch that enabled
autovacuum by default, an information_schema fix and a TODO file change.
The only that could cause this problem seems to be the autovacuum enable
bit.

The failures are all exactly alike:

*** ./expected/stats.out    Thu Jan 18 08:48:12 2007
--- ./results/stats.out    Thu Jan 18 09:02:53 2007
***************
*** 51,57 ****  WHERE st.relname='tenk2' AND cl.relname='tenk2';  ?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
!  t        | t        | t        | t (1 row)  SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks +
cl.relpages,
--- 51,57 ----  WHERE st.relname='tenk2' AND cl.relname='tenk2';  ?column? | ?column? | ?column? | ?column?
----------+----------+----------+----------
!  f        | f        | f        | f (1 row)  SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks +
cl.relpages,
***************
*** 60,66 ****  WHERE st.relname='tenk2' AND cl.relname='tenk2';  ?column? | ?column?  ----------+----------
!  t        | t (1 row)  -- End of Stats Test
--- 60,66 ----  WHERE st.relname='tenk2' AND cl.relname='tenk2';  ?column? | ?column?  ----------+----------
!  f        | f (1 row)  -- End of Stats Test


The full failing queries are these:

-- check effects
SELECT st.seq_scan >= pr.seq_scan + 1,      st.seq_tup_read >= pr.seq_tup_read + cl.reltuples,      st.idx_scan >=
pr.idx_scan+ 1,      st.idx_tup_fetch >= pr.idx_tup_fetch + 1 FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats
ASprWHERE st.relname='tenk2' AND cl.relname='tenk2';?column? | ?column? | ?column? | ?column? 
 
----------+----------+----------+----------t        | t        | t        | t
(1 row)

SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,      st.idx_blks_read + st.idx_blks_hit >=
pr.idx_blks+ 1 FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS prWHERE st.relname='tenk2' AND
cl.relname='tenk2';?column?| ?column? 
 
----------+----------t        | t
(1 row)

The six booleans are false on Windows.

What could be the reason for this change?  The only thing that occurs to
me is that autovacuum is firing just when running that test, it
processes that table and increments the counters before the final SQL is
run.

Now, if some Windows-enabled person could step forward so that we can
suggest some tests to run, that would be great.  Perhaps the solution to
the problem is to relax the conditions a little, so that two scans are
accepted on that table instead of only one; but it would be good to
confirm whether the stat system is really working and it's really still
counting stuff as it's supposed to do.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Windows buildfarm failures

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Now, if some Windows-enabled person could step forward so that we can
> suggest some tests to run, that would be great.  Perhaps the solution to
> the problem is to relax the conditions a little, so that two scans are
> accepted on that table instead of only one; but it would be good to
> confirm whether the stat system is really working and it's really still
> counting stuff as it's supposed to do.

No, you misread it: the check is for at least one new event, not exactly
one.

We've been seeing this intermittently for a long time, but it sure seems
that autovac has raised the probability greatly.  That's pretty odd.
If it's a timing thing, why are all and only the Windows machines
affected?  Could it be that autovac is sucking all the spare cycles
and keeping the stats collector from running?  (Does autovac use
vacuum_cost_delay by default?  It probably should if not.)

I noticed today on my own machine several strange pauses while running
the serial regression tests --- the machine didn't seem to be hitting
the disk nor sucking lots of CPU, it just sat there for several seconds
and then picked up again.  I wonder if that's related.  It sure seems it
must be due to autovac being on now.
        regards, tom lane


Re: Windows buildfarm failures

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Now, if some Windows-enabled person could step forward so that we can
> > suggest some tests to run, that would be great.  Perhaps the solution to
> > the problem is to relax the conditions a little, so that two scans are
> > accepted on that table instead of only one; but it would be good to
> > confirm whether the stat system is really working and it's really still
> > counting stuff as it's supposed to do.
> 
> No, you misread it: the check is for at least one new event, not exactly
> one.

Doh :-(

> We've been seeing this intermittently for a long time, but it sure seems
> that autovac has raised the probability greatly.  That's pretty odd.
> If it's a timing thing, why are all and only the Windows machines
> affected?  Could it be that autovac is sucking all the spare cycles
> and keeping the stats collector from running?

Hmm, that could explain it, but it's strange that only Windows machines
are affected.  Maybe it's a scheduler issue, and the Unix machines are
able to let pgstat do some work but Windows are not.

> (Does autovac use vacuum_cost_delay by default?  It probably should if
> not.)

The default autovacuum_vacuum_cost_delay is -1, which means "use the
system default", which in turn is 0.  So it's off by default.

> I noticed today on my own machine several strange pauses while running
> the serial regression tests --- the machine didn't seem to be hitting
> the disk nor sucking lots of CPU, it just sat there for several seconds
> and then picked up again.  I wonder if that's related.  It sure seems it
> must be due to autovac being on now.

Hmm, strange; I ran the tests several times today testing Magnus
changes, and I didn't notice any pause.  It was mostly the parallel
tests though; I'll try serial.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Windows buildfarm failures

From
Alvaro Herrera
Date:
Tom Lane wrote:

> I noticed today on my own machine several strange pauses while running
> the serial regression tests --- the machine didn't seem to be hitting
> the disk nor sucking lots of CPU, it just sat there for several seconds
> and then picked up again.  I wonder if that's related.  It sure seems it
> must be due to autovac being on now.

The only pauses I see are are in the "stats" and the "prepared_xacts"
tests.  The latter is due to a test that uses statement_timeout to
detect a lock, and the stat test does a pg_sleep(2.0) call.

Do those explain what you are seeing?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Windows buildfarm failures

From
Stefan Kaltenbrunner
Date:
Alvaro Herrera wrote:
> Alvaro Herrera wrote:
>> Stefan Kaltenbrunner wrote:
> 
>>> yeah - looks like it's the autovacuum change - snake is now passing the
>>> numeric-test but still fails the stats one ...
>> Interesting -- both yak and snake are failing in a very similar way.
>> I'll investigate it tomorrow if no one beats me to it.
> 
> All our Windows buildfarm machines are failing.  AFAICT, the first
> failure was on Yak, 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20
> 
> and the last successful run just before that seems to come from Snake,
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00
> 
> The only changes that went in in that period are the patch that enabled
> autovacuum by default, an information_schema fix and a TODO file change.
> The only that could cause this problem seems to be the autovacuum enable
> bit.

I think this one:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bear&dt=2007-01-19%2006:06:02

is fallout from the autovacuum changes too - it seems that initdb is
picking a low value (20) for max_connections on that box and autovacuum
is acting as an additional client that will cause the maximum of allowed
connections to exceed during the parallel tests and therefor resulting
in the failure.


Stefan


Re: Windows buildfarm failures

From
Andrew Dunstan
Date:
Stefan Kaltenbrunner wrote:
> Alvaro Herrera wrote:
>   
>> Alvaro Herrera wrote:
>>     
>>> Stefan Kaltenbrunner wrote:
>>>       
>>>> yeah - looks like it's the autovacuum change - snake is now passing the
>>>> numeric-test but still fails the stats one ...
>>>>         
>>> Interesting -- both yak and snake are failing in a very similar way.
>>> I'll investigate it tomorrow if no one beats me to it.
>>>       
>> All our Windows buildfarm machines are failing.  AFAICT, the first
>> failure was on Yak, 
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20
>>
>> and the last successful run just before that seems to come from Snake,
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00
>>
>> The only changes that went in in that period are the patch that enabled
>> autovacuum by default, an information_schema fix and a TODO file change.
>> The only that could cause this problem seems to be the autovacuum enable
>> bit.
>>     
>
> I think this one:
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bear&dt=2007-01-19%2006:06:02
>
> is fallout from the autovacuum changes too - it seems that initdb is
> picking a low value (20) for max_connections on that box and autovacuum
> is acting as an additional client that will cause the maximum of allowed
> connections to exceed during the parallel tests and therefor resulting
> in the failure.
>
>
>
>   

If so, that's a case of driver error, I think. The buildfarm member 
should set MAX_CONNECTIONS => '10' or similar in the build_env stanza of 
the config file.

cheers

andrew



Re: Windows buildfarm failures

From
Alvaro Herrera
Date:
Stefan Kaltenbrunner wrote:
> Alvaro Herrera wrote:

> > All our Windows buildfarm machines are failing.  AFAICT, the first
> > failure was on Yak, 
> > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yak&dt=2007-01-16%2021:55:20
> > 
> > and the last successful run just before that seems to come from Snake,
> > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snake&dt=2007-01-16%2014:30:00
> 
> I think this one:
> 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bear&dt=2007-01-19%2006:06:02
> 
> is fallout from the autovacuum changes too - it seems that initdb is
> picking a low value (20) for max_connections on that box and autovacuum
> is acting as an additional client that will cause the maximum of allowed
> connections to exceed during the parallel tests and therefor resulting
> in the failure.

Sorry, I forgot to mention that I specifically skipped those errors not
directly related to the problem at hand.  This problem is clearly
something else (as well as the Mac OS X failures due to readline
misconfiguration, the ECPG-check failures, etc).  I concur with Andrew's
suggestion that it's really pilot error.

Maybe what we really ought to do is pick an internal max_connections
value that exceeds what the max_connections GUC parameter say, adjusting
per autovacuum configuration.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Windows buildfarm failures

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I noticed today on my own machine several strange pauses while running
>> the serial regression tests ---

> Do those explain what you are seeing?

No, those are expected.  I'm having a hard time reproducing the behavior
right now, but IIRC the delays were in the vacuum and/or sanity_check
tests.  It's not unlikely that the foreground VACUUM was blocking on a
lock while autovac did the same work, except that that doesn't explain
the length of the pause, nor the lack of disk activity.

But I can't make it happen right now, so nevermind until I figure out
how to reproduce it ...
        regards, tom lane


Re: Windows buildfarm failures

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Maybe what we really ought to do is pick an internal max_connections
> value that exceeds what the max_connections GUC parameter say, adjusting
> per autovacuum configuration.

That's just cosmetic; it doesn't address the real issue, which is that
if SHMMAX or other kernel settings are too small, initdb will pick a
max_connections too low to allow the parallel regression tests to run.

The fact that the regression tests try to exercise 20 concurrent
sessions by default isn't just an accident; the thought was that if you
had a configuration too small to allow a reasonable number of concurrent
sessions, the tests ought to point it out to you.  (Indeed, these days
we probably oughta try to exercise more than 20 sessions.)

But this is somewhat in conflict with our desire that buildfarm members
not fall over for random reasons --- and we've seen it happen more than
once that a test run's initdb picks a smaller-than-normal
max_connections because of transient system loads.

Perhaps we could extend pg_regress to allow "--max-connections=auto"
which would instruct it to set its connection limit to the server's
actual max_connections minus superuser reserved slots (and probably
minus a couple more to allow for backend shutdown time etc).  Then the
buildfarm could use that, while we'd leave the behavior alone for normal
manual regression tests.
        regards, tom lane


Re: Windows buildfarm failures

From
Andrew Dunstan
Date:
Tom Lane wrote:
>
> Perhaps we could extend pg_regress to allow "--max-connections=auto"
> which would instruct it to set its connection limit to the server's
> actual max_connections minus superuser reserved slots (and probably
> minus a couple more to allow for backend shutdown time etc).  Then the
> buildfarm could use that, while we'd leave the behavior alone for normal
> manual regression tests.
>
>   

This seems needlessly complex. We can tolerate occasional intermittent 
failures on buildfarm, and if they are persistent there is already a 
configurable rate limiting mechanism available.

cheers

andrew



Re: Windows buildfarm failures

From
Stefan Kaltenbrunner
Date:
Alvaro Herrera wrote:
> Tom Lane wrote:
>> Alvaro Herrera <alvherre@commandprompt.com> writes:
>>> Now, if some Windows-enabled person could step forward so that we can
>>> suggest some tests to run, that would be great.  Perhaps the solution to
>>> the problem is to relax the conditions a little, so that two scans are
>>> accepted on that table instead of only one; but it would be good to
>>> confirm whether the stat system is really working and it's really still
>>> counting stuff as it's supposed to do.
>> No, you misread it: the check is for at least one new event, not exactly
>> one.
> 
> Doh :-(
> 
>> We've been seeing this intermittently for a long time, but it sure seems
>> that autovac has raised the probability greatly.  That's pretty odd.
>> If it's a timing thing, why are all and only the Windows machines
>> affected?  Could it be that autovac is sucking all the spare cycles
>> and keeping the stats collector from running?
> 
> Hmm, that could explain it, but it's strange that only Windows machines
> are affected.  Maybe it's a scheduler issue, and the Unix machines are
> able to let pgstat do some work but Windows are not.

maybe not only windows boxes:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebra&dt=2007-01-20%2015:25:05


Stefan


Re: Windows buildfarm failures

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Alvaro Herrera wrote:
>> Hmm, that could explain it, but it's strange that only Windows machines
>> are affected.  Maybe it's a scheduler issue, and the Unix machines are
>> able to let pgstat do some work but Windows are not.

> maybe not only windows boxes:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebra&dt=2007-01-20%2015:25:05

That one's interesting because only the first of the two queries failed.
I suppose that must mean that the stats file did update, but between
those two queries.

Maybe we just need to lengthen the sleep() even more?
        regards, tom lane


Re: Windows buildfarm failures

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> maybe not only windows boxes:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebra&dt=2007-01-20%2015:25:05

Wow, I just saw the stats failure on my own machine, for the first time
ever.  Conclusions:
1. Enabling autovac has definitely raised the probability of failure.
2. It's not Windows-only, but the probability of failure is much higher
on Windows.

Not sure what that tells us, though ...
        regards, tom lane