Thread: Stats collector frozen?

Stats collector frozen?

From
"Jeremy Haile"
Date:
I've noticed that my tables are not being auto vacuumed or analyzed
regularly, even though I have very aggressive autovacuum settings.

The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol.  However, I never notice it using
I/O or CPU usage.  Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.

I see error such as these in the log every now and then - not sure if
they are related.  These have been discussed at length in other posts
and seems to have something to do with PG holding onto old file handles
(Windows specific):
2007-01-24 06:24:16 ERROR:  could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:17 ERROR:  could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:18 ERROR:  could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:19 ERROR:  could not open relation 1663/16404/333779:
Permission denied

I'm running PG 8.2.1 on Windows.  Here is some of the output from "show
all":
"autovacuum";"on"
"autovacuum_analyze_scale_factor";"0.02"
"autovacuum_analyze_threshold";"250"
"autovacuum_freeze_max_age";"200000000"
"autovacuum_naptime";"1min"
"autovacuum_vacuum_cost_delay";"-1"
"autovacuum_vacuum_cost_limit";"-1"
"autovacuum_vacuum_scale_factor";"0.08"
"autovacuum_vacuum_threshold";"500"
"stats_block_level";"on"
"stats_command_string";"on"
"stats_reset_on_server_start";"off"
"stats_row_level";"on"
"stats_start_collector";"on"
"vacuum_cost_delay";"20ms"
"vacuum_cost_limit";"200"
"vacuum_cost_page_dirty";"20"
"vacuum_cost_page_hit";"1"
"vacuum_cost_page_miss";"10"
"vacuum_freeze_min_age";"100000000"

Re: Stats collector frozen?

From
Tom Lane
Date:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
> The stats collector appears to still be running, since I can see a
> postgres.exe process with -forkcol.  However, I never notice it using
> I/O or CPU usage.  Also, querying the pg_stat_user_tables view shows no
> change in table stats even for tables that change very frequently.

Is $PGDATA/global/pgstat.stat getting updated?  (watch the file mod time
at least)  Do you see any pgstat.tmp file in there?

            regards, tom lane

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.

Coincidentally (I think not) - the last auto-analyze was performed at
2007-01-22 12:24:11.424-05.

The logs for 1/22 are empty - so no errors or anything like that to give
clues...

Thanks!
Jeremy Haile


On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
> "Jeremy Haile" <jhaile@fastmail.fm> writes:
> > The stats collector appears to still be running, since I can see a
> > postgres.exe process with -forkcol.  However, I never notice it using
> > I/O or CPU usage.  Also, querying the pg_stat_user_tables view shows no
> > change in table stats even for tables that change very frequently.
>
> Is $PGDATA/global/pgstat.stat getting updated?  (watch the file mod time
> at least)  Do you see any pgstat.tmp file in there?
>
>             regards, tom lane

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
Searching the archives, I found a couple of 2006 posts that seem
somewhat related to my problem (although I don't see any solutions
listed...):
http://archives.postgresql.org/pgsql-hackers/2006-09/msg00415.php
http://archives.postgresql.org/pgsql-hackers/2006-04/msg00127.php

Tom, since you were involved in these - did you ever figure out how to
resolve the issues of the stats collector getting stuck in Windows?

Thanks, Jeremy Haile

On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:
> pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.
>
> Coincidentally (I think not) - the last auto-analyze was performed at
> 2007-01-22 12:24:11.424-05.
>
> The logs for 1/22 are empty - so no errors or anything like that to give
> clues...
>
> Thanks!
> Jeremy Haile
>
>
> On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
> > "Jeremy Haile" <jhaile@fastmail.fm> writes:
> > > The stats collector appears to still be running, since I can see a
> > > postgres.exe process with -forkcol.  However, I never notice it using
> > > I/O or CPU usage.  Also, querying the pg_stat_user_tables view shows no
> > > change in table stats even for tables that change very frequently.
> >
> > Is $PGDATA/global/pgstat.stat getting updated?  (watch the file mod time
> > at least)  Do you see any pgstat.tmp file in there?
> >
> >             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
Tom,

Did this information shed any light on what the problem might be?  Any
solution or workaround?

Thanks!
Jeremy Haile

On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:
> pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.
>
> Coincidentally (I think not) - the last auto-analyze was performed at
> 2007-01-22 12:24:11.424-05.
>
> The logs for 1/22 are empty - so no errors or anything like that to give
> clues...
>
> Thanks!
> Jeremy Haile
>
>
> On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
> > "Jeremy Haile" <jhaile@fastmail.fm> writes:
> > > The stats collector appears to still be running, since I can see a
> > > postgres.exe process with -forkcol.  However, I never notice it using
> > > I/O or CPU usage.  Also, querying the pg_stat_user_tables view shows no
> > > change in table stats even for tables that change very frequently.
> >
> > Is $PGDATA/global/pgstat.stat getting updated?  (watch the file mod time
> > at least)  Do you see any pgstat.tmp file in there?
> >
> >             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: Stats collector frozen?

From
Tom Lane
Date:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
> Did this information shed any light on what the problem might be?

It seems to buttress Magnus' theory that the intermittent (or not so
intermittent) stats-test buildfarm failures we've been seeing have to
do with the stats collector actually freezing up, rather than just
not reacting fast enough as most of us (or me anyway) thought.  But
why that is happening remains anyone's guess.  I don't suppose you
have debugging tools that would let you get a stack trace from the
collector process?

            regards, tom lane

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved.  I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).

Please let me know if there is anything else I can do to help debug this
problem.

Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?

Thanks,
Jeremy Haile


On Thu, 25 Jan 2007 12:42:11 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
> "Jeremy Haile" <jhaile@fastmail.fm> writes:
> > Did this information shed any light on what the problem might be?
>
> It seems to buttress Magnus' theory that the intermittent (or not so
> intermittent) stats-test buildfarm failures we've been seeing have to
> do with the stats collector actually freezing up, rather than just
> not reacting fast enough as most of us (or me anyway) thought.  But
> why that is happening remains anyone's guess.  I don't suppose you
> have debugging tools that would let you get a stack trace from the
> collector process?
>
>             regards, tom lane

Re: Stats collector frozen?

From
Tom Lane
Date:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
> Unfortunately I don't have any debugging tools installed that would work
> against postgres - although I'd be glad to do something if you could
> tell me the steps involved.  I can reproduce the issue quite easily on
> two different Windows machines (one is XP, the other is 2003).

Sorry, I don't know anything about Windows debugging either.  Can you
put together a test case that would let one of the Windows-using hackers
reproduce it?

> Do you know of any workaround other than restarting the whole server?
> Can the collector be restarted individually?

On Unix you can just 'kill -TERM' the collector process and the postmaster
will start a new one without engaging in a database panic cycle.  Dunno
what the equivalent is on Windows but it's probably possible.

            regards, tom lane

Re: Stats collector frozen?

From
Magnus Hagander
Date:
Tom Lane wrote:
> "Jeremy Haile" <jhaile@fastmail.fm> writes:
>> Unfortunately I don't have any debugging tools installed that would work
>> against postgres - although I'd be glad to do something if you could
>> tell me the steps involved.  I can reproduce the issue quite easily on
>> two different Windows machines (one is XP, the other is 2003).
>
> Sorry, I don't know anything about Windows debugging either.  Can you
> put together a test case that would let one of the Windows-using hackers
> reproduce it?

That would help a lot.


>> Do you know of any workaround other than restarting the whole server?
>> Can the collector be restarted individually?
>
> On Unix you can just 'kill -TERM' the collector process and the postmaster
> will start a new one without engaging in a database panic cycle.  Dunno
> what the equivalent is on Windows but it's probably possible.

You can use pg_ctl to send the int signal. If it's completely hung, that
may not work. In that case you can kill it from task manager, but that's
equiv of a kill -9, which means that the postmaster will restart all
backends.

//Magnus

Re: Stats collector frozen?

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
>> "Jeremy Haile" <jhaile@fastmail.fm> writes:
>>> Do you know of any workaround other than restarting the whole server?
>>> Can the collector be restarted individually?

> You can use pg_ctl to send the int signal. If it's completely hung, that
> may not work. In that case you can kill it from task manager, but that's
> equiv of a kill -9, which means that the postmaster will restart all
> backends.

No, the postmaster does not care how badly the stats collector croaks,
because the collector's not connected to shared memory, so there's no
risk of collateral damage.  It'll just start a new one without bothering
the backends.  So you can do whatever you have to do to kill the stuck
collector --- I was just not sure how to do that under Windows.

            regards, tom lane

Re: Stats collector frozen?

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>>> "Jeremy Haile" <jhaile@fastmail.fm> writes:
>>>> Do you know of any workaround other than restarting the whole server?
>>>> Can the collector be restarted individually?
>
>> You can use pg_ctl to send the int signal. If it's completely hung, that
>> may not work. In that case you can kill it from task manager, but that's
>> equiv of a kill -9, which means that the postmaster will restart all
>> backends.
>
> No, the postmaster does not care how badly the stats collector croaks,
> because the collector's not connected to shared memory, so there's no
> risk of collateral damage.  It'll just start a new one without bothering
> the backends.  So you can do whatever you have to do to kill the stuck
> collector --- I was just not sure how to do that under Windows.

Oh, ok.
Then just pick it up in Task Manager or Process Explorer or whatever and
kill it off. Just make sure you pick the right process.

//Magnus

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
> Then just pick it up in Task Manager or Process Explorer or whatever and
> kill it off. Just make sure you pick the right process.

I mentioned earlier that killing off the collector didn't work - however
I was wrong.  I just wasn't giving it enough time.  If I kill the
"postgres.exe -forkcol" process, it does gets restarted, although
sometimes it takes a minute.

Since it only seems to update pgstat.stat once after restarting, I'd
need to kill it once-a-minute to keep my statistics up to date =)  So,
unfortunately it's not a great workaround to my problem.


Jeremy Haile

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
I'll try to put together a test case for hackers, although I'm not sure
what exactly causes it.

Basically, when I fire up PostgreSQL - after about a minute the stats
collector runs once (pgstat.stat is updated, autovacuum fires up, etc.)
- and then the collector seems to hang.  If I watch it's performance
information, it does not read or write to disk again and pgstat.stat is
never updated again.  It never updates pgstat.stat more than once after
restart.  There are no errors in the log

I tried killing the collector a variety of ways on Windows, but it seems
to terminate indefinitely.  I don't see a kill program for windows that
lets me specify the signal to use.  So other than restarting PostgreSQL,
I'm not sure how to workaround this problem.

If anyone else is experiencing similar problems, please post your
situation.

On Thu, 25 Jan 2007 12:51:31 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
> "Jeremy Haile" <jhaile@fastmail.fm> writes:
> > Unfortunately I don't have any debugging tools installed that would work
> > against postgres - although I'd be glad to do something if you could
> > tell me the steps involved.  I can reproduce the issue quite easily on
> > two different Windows machines (one is XP, the other is 2003).
>
> Sorry, I don't know anything about Windows debugging either.  Can you
> put together a test case that would let one of the Windows-using hackers
> reproduce it?
>
> > Do you know of any workaround other than restarting the whole server?
> > Can the collector be restarted individually?
>
> On Unix you can just 'kill -TERM' the collector process and the
> postmaster
> will start a new one without engaging in a database panic cycle.  Dunno
> what the equivalent is on Windows but it's probably possible.
>
>             regards, tom lane

Re: Stats collector frozen?

From
Alvaro Herrera
Date:
Jeremy Haile wrote:
> I'll try to put together a test case for hackers, although I'm not sure
> what exactly causes it.
>
> Basically, when I fire up PostgreSQL - after about a minute the stats
> collector runs once (pgstat.stat is updated, autovacuum fires up, etc.)
> - and then the collector seems to hang.  If I watch it's performance
> information, it does not read or write to disk again and pgstat.stat is
> never updated again.  It never updates pgstat.stat more than once after
> restart.  There are no errors in the log
>
> I tried killing the collector a variety of ways on Windows, but it seems
> to terminate indefinitely.  I don't see a kill program for windows that
> lets me specify the signal to use.  So other than restarting PostgreSQL,
> I'm not sure how to workaround this problem.
>
> If anyone else is experiencing similar problems, please post your
> situation.

All the Windows buildfarm machines are, apparently.

We verified this with Magnus.  He found that the tenk2 table does not
seem to get stat updates -- the numbers are all zero, at all times.  I
thought I had blogged about this ... oh yeah, it's here

http://www.advogato.org/person/alvherre/diary.html?start=11

AFAIR (Magnus can surely confirm) there were some other tables that
weren't showing stats as all zeros -- but there's no way to know whether
those numbers were put there before the collector had "frozen" (if
that's really what's happening).

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Stats collector frozen?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Jeremy Haile wrote:
>> If anyone else is experiencing similar problems, please post your
>> situation.

> All the Windows buildfarm machines are, apparently.

Can't anyone with a debugger duplicate this and get a stack trace for
us?  If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.

            regards, tom lane

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
> AFAIR (Magnus can surely confirm) there were some other tables that
> weren't showing stats as all zeros -- but there's no way to know whether
> those numbers were put there before the collector had "frozen" (if
> that's really what's happening).

Yeah - I have numbers that updated before the stats collector started
freezing.  Do you know which version of PG this started with?  I have
upgraded 8.1.3, 8.1.4, 8.2, and 8.2.1 in the past months and I didn't
have the collector enabled until 8.2.1 - so I'm not sure how long this
has been a problem.

I might try rolling back to a previous version - it's either that or
setup a scheduled vacuum analyze until we figure out this problem.  I'm
having to manually run it every day now... =)  I think this is a pretty
critical problem since it cripples autovacuum on Windows.

Are you guys in a position to debug the collector process and see where
it is freezing (ala Tom's earlier comment)?  Anything I can do to help
debug this problem faster?

Re: Stats collector frozen?

From
Magnus Hagander
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Jeremy Haile wrote:
>>> If anyone else is experiencing similar problems, please post your
>>> situation.
>
>> All the Windows buildfarm machines are, apparently.
>
> Can't anyone with a debugger duplicate this and get a stack trace for
> us?  If the stats collector is indeed freezing up, a stack trace showing
> where it's stuck would be exceedingly helpful.

Must've been asleep when reading and writing in this thread. Didn't
realize it was the same issue as the  buildfarm-killer. Will do the
debugger+stacktrace tomorrow on my VC install.

//Magnus

Re: Stats collector frozen?

From
Magnus Hagander
Date:
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Jeremy Haile wrote:
> >> If anyone else is experiencing similar problems, please post your
> >> situation.
>
> > All the Windows buildfarm machines are, apparently.
>
> Can't anyone with a debugger duplicate this and get a stack trace for
> us?  If the stats collector is indeed freezing up, a stack trace showing
> where it's stuck would be exceedingly helpful.

Done some checking here. What happens is that suddenly the pgstats
socket stops receiving data. select() (pgstat.c line 1802) returns after
timeout, so got_data is always zero.

Interesting note: I just ran the serial regression tests, and they pass
fine. With the parallel tests, it always stops receiving data somewhere
during the first parallel group.

//Magnus

Re: Stats collector frozen?

From
Magnus Hagander
Date:
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Jeremy Haile wrote:
> >> If anyone else is experiencing similar problems, please post your
> >> situation.
>
> > All the Windows buildfarm machines are, apparently.
>
> Can't anyone with a debugger duplicate this and get a stack trace for
> us?  If the stats collector is indeed freezing up, a stack trace showing
> where it's stuck would be exceedingly helpful.

Another update - what broke it was not the enabling of autovacuum, it
was the enabling of row level stats. If I disable stats_row_level,
parallel tests pass again.

That doesn't actually tell us *why* it's broken, I think, but it does
tell us why the autovac patch caused it.

I don't have 8.1 or 8.2 around on win32, but it'd be interesting to see if the
same issue happens if you run the tests on that with stats_row_level
enabled. Most likely the same thing happens.

//Magnus

Re: Stats collector frozen?

From
Magnus Hagander
Date:
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Jeremy Haile wrote:
> >> If anyone else is experiencing similar problems, please post your
> >> situation.
>
> > All the Windows buildfarm machines are, apparently.
>
> Can't anyone with a debugger duplicate this and get a stack trace for
> us?  If the stats collector is indeed freezing up, a stack trace showing
> where it's stuck would be exceedingly helpful.

Maybe I should finish testing before I send my emails.

Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
if I put a #undef select right before the select in pgstat.c, the
regression tests pass.

I guess the bug is shown because with row level stats we simply have
more data to process. And it appears only to happen on UDP sockets from
what I can tell.

Now, what Iwould *like* to do is to re-implement that part of the code
using the Win32 APIs instead of going through select(). Since it's very
isolated code. I'm going to try that and see how invasive it is, then
see if it'll get accepted :-)

(This would of course give us better performance in general in that
codepath, since all the emulation stuff wouldn't be needed, so there's a
point to doing that other than finding the obscure UDP-related bug in
pgwin23_select)

//Magnus

Re: Stats collector frozen?

From
Teodor Sigaev
Date:
> Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
> if I put a #undef select right before the select in pgstat.c, the
> regression tests pass.

May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ?
WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to write,
so, may be there is symmetrical problem with read? Or pgwin32_select() is used
for waiting write too?



--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Stats collector frozen?

From
Magnus Hagander
Date:
On Fri, Jan 26, 2007 at 01:11:00PM +0300, Teodor Sigaev wrote:
> >Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
> >if I put a #undef select right before the select in pgstat.c, the
> >regression tests pass.
>
> May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ?
> WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to
> write, so, may be there is symmetrical problem with read? Or
> pgwin32_select() is used for waiting write too?


pgwin32_waitforsinglesocket() appears to work fine. And we only use
FD_READ, so it's not affected by your patch from what I can tell.

I've got it passnig tests with select replaced with waitforsinglesocket
- now I just need to implement timeout in that one :-)

//Magnus

Re: Stats collector frozen?

From
Magnus Hagander
Date:
On Fri, Jan 26, 2007 at 12:26:58PM +0100, Magnus Hagander wrote:
> On Fri, Jan 26, 2007 at 01:11:00PM +0300, Teodor Sigaev wrote:
> > >Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
> > >if I put a #undef select right before the select in pgstat.c, the
> > >regression tests pass.
> >
> > May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ?
> > WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to
> > write, so, may be there is symmetrical problem with read? Or
> > pgwin32_select() is used for waiting write too?
>
>
> pgwin32_waitforsinglesocket() appears to work fine. And we only use
> FD_READ, so it's not affected by your patch from what I can tell.
>
> I've got it passnig tests with select replaced with waitforsinglesocket
> - now I just need to implement timeout in that one :-)

Attached patch seems to solve the problem on my machine at least. Uses
pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes
pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this
is why it touches files outside of the stats area).

//Magnus

Attachment

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
> > > >Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
> > > >if I put a #undef select right before the select in pgstat.c, the
> > > >regression tests pass.
> > >
> > > May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ?
> > > WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to
> > > write, so, may be there is symmetrical problem with read? Or
> > > pgwin32_select() is used for waiting write too?
> >
> >
> > pgwin32_waitforsinglesocket() appears to work fine. And we only use
> > FD_READ, so it's not affected by your patch from what I can tell.
> >
> > I've got it passnig tests with select replaced with waitforsinglesocket
> > - now I just need to implement timeout in that one :-)
>
> Attached patch seems to solve the problem on my machine at least. Uses
> pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes
> pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this
> is why it touches files outside of the stats area).

Magnus - thanks for your investigation and work!  Any chance I could get
a patched exe for win32 and test on my servers?

Re: Stats collector frozen?

From
Alvaro Herrera
Date:
Jeremy Haile wrote:

> Magnus - thanks for your investigation and work!  Any chance I could get
> a patched exe for win32 and test on my servers?

In the meantime, could you please confirm that turning row_level_stats
off makes the regression test pass?  And whatever tests you were doing?
(Note that by doing that, autovacuum won't work at all, because it needs
those stats.)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Stats collector frozen?

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
> if I put a #undef select right before the select in pgstat.c, the
> regression tests pass.
> I guess the bug is shown because with row level stats we simply have
> more data to process. And it appears only to happen on UDP sockets from
> what I can tell.

Hmm ... if this theory is correct, then statistics collection has
never worked at all on Windows, at least not under more than the most
marginal load; and hence neither has autovacuum.

Does that conclusion agree with reality?  You'd think we'd have heard
a whole lot of complaints about it, not just Jeremy's; and I don't
remember it being a sore point.  (But then again I just woke up.)
What seems somewhat more likely is that we broke pgwin32_select
recently, in which case we oughta find out why.  Or else remove it
entirely (does your patch make that possible?).

Keep in mind also that we have seen the stats-test failure on
non-Windows machines, so we still need to explain that ...

            regards, tom lane

Re: Stats collector frozen?

From
Magnus Hagander
Date:
On Fri, Jan 26, 2007 at 09:55:39AM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
> > if I put a #undef select right before the select in pgstat.c, the
> > regression tests pass.
> > I guess the bug is shown because with row level stats we simply have
> > more data to process. And it appears only to happen on UDP sockets from
> > what I can tell.
>
> Hmm ... if this theory is correct, then statistics collection has
> never worked at all on Windows, at least not under more than the most
> marginal load; and hence neither has autovacuum.

We have had lots of reports of issues with the stats collector on
Windows. Some were definitly fixed by the patch by O&T, but I don't
think all.
The thing is, since it didn't give any error messages at all, most users
wouldn't notice. Other than their tables getting bloated, in which case
they would do a manual vacuum and conlcude autovacuum wasn't good
enough. Or something.


> Does that conclusion agree with reality?  You'd think we'd have heard
> a whole lot of complaints about it, not just Jeremy's; and I don't
> remember it being a sore point.  (But then again I just woke up.)
> What seems somewhat more likely is that we broke pgwin32_select
> recently, in which case we oughta find out why.  Or else remove it
> entirely (does your patch make that possible?).

AFAIK, it only affects UDP connections, and this patch takes
pgwin32_select out of the loop for all UDP stuff.
But if we get this in, pgwin32_select is only used in the postmaster
accept-new-connections loop (from what I can tell by a quick look), so
I'd definitly want to rewrite that one as well to use a better way than
select-emulation. Then it could go away completely.


> Keep in mind also that we have seen the stats-test failure on
> non-Windows machines, so we still need to explain that ...

Yeah. But it *could* be two different stats issues lurking. Perhaps the
issue we've seen on non-windows can be fixed by the settings Alvaro had
me try (increasing autovacuum_vacuum_cost_delay or the delay in the
regression test).

//Magnus

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
> We have had lots of reports of issues with the stats collector on
> Windows. Some were definitly fixed by the patch by O&T, but I don't
> think all.

Here were a couple of other reports I found:
http://archives.postgresql.org/pgsql-hackers/2006-09/msg00415.php
http://archives.postgresql.org/pgsql-hackers/2006-04/msg00127.php


> The thing is, since it didn't give any error messages at all, most users
> wouldn't notice. Other than their tables getting bloated, in which case
> they would do a manual vacuum and conlcude autovacuum wasn't good
> enough. Or something.

This is indeed what I assumed at first.  I started running "vacuum
analyze" hourly and turned off autovacuum.  Later, I decided to try the
autovacuum route again and investigated why it wasn't working well.

Magnus - could you send me a patched exe to try in my environment?
Would it be compatible with 8.2.1?

Re: Stats collector frozen?

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Attached patch seems to solve the problem on my machine at least. Uses
> pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes
> pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this
> is why it touches files outside of the stats area).

Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go
green, we should probably consider back-porting this to 8.1 and 8.0.

            regards, tom lane

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
> Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go
> green, we should probably consider back-porting this to 8.1 and 8.0.

Not trying to be a nuisance, but I'd really like to try this out in my
environment and see if my problems disappear.  Is there anyone out there
who could provide me with a patched exe for Win32?  If not, I could try
to get my system setup to build for Windows, but I'm not sure what all
that would involve.

Thanks,
Jeremy Haile

Re: Stats collector frozen?

From
Magnus Hagander
Date:
Jeremy Haile wrote:
>> Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go
>> green, we should probably consider back-porting this to 8.1 and 8.0.
>
> Not trying to be a nuisance, but I'd really like to try this out in my
> environment and see if my problems disappear.  Is there anyone out there
> who could provide me with a patched exe for Win32?  If not, I could try
> to get my system setup to build for Windows, but I'm not sure what all
> that would involve.

I'll see if I can build you something tomorrow. You're on 8.2, right? Do
you use any features like Kerberos, SSL or NLS? I don't think I have
them set up properly in my mingw build env, so it'd be easier if I could
build without them.

Unless beaten by someone who has a complete env ;-)

//Magnus

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
Using standard build (none of the things you mentioned) on 8.2.1
currently.

I really appreciate it!


On Fri, 26 Jan 2007 21:24:09 +0100, "Magnus Hagander"
<magnus@hagander.net> said:
> Jeremy Haile wrote:
> >> Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go
> >> green, we should probably consider back-porting this to 8.1 and 8.0.
> >
> > Not trying to be a nuisance, but I'd really like to try this out in my
> > environment and see if my problems disappear.  Is there anyone out there
> > who could provide me with a patched exe for Win32?  If not, I could try
> > to get my system setup to build for Windows, but I'm not sure what all
> > that would involve.
>
> I'll see if I can build you something tomorrow. You're on 8.2, right? Do
> you use any features like Kerberos, SSL or NLS? I don't think I have
> them set up properly in my mingw build env, so it'd be easier if I could
> build without them.
>
> Unless beaten by someone who has a complete env ;-)
>
> //Magnus

Re: Stats collector frozen?

From
Magnus Hagander
Date:
Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Attached patch seems to solve the problem on my machine at least. Uses
>> pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes
>> pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this
>> is why it touches files outside of the stats area).
>
> Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go
> green, we should probably consider back-porting this to 8.1 and 8.0.

Seems HEAD now passes thie check stage on at least snake and bandicoot.
They're now exposing another problem with ECPG though...

//Magnus

Re: Stats collector frozen?

From
Magnus Hagander
Date:
Jeremy Haile wrote:
> Using standard build (none of the things you mentioned) on 8.2.1
> currently.
>
> I really appreciate it!

Ok. I've built a binary for you at
http://www.hagander.net/download/pgsql_8.2.1_win32select.zip

Note that while I named the file 8.2.1 it really isn't - it's current
head of the REL_8_2_STABLE branch. Which means it has a few other fixes
as well.

I think it *should* work to just replace the postgres.exe file and
ignore the rest, so I didn't put them in the ZIP. But as always, make
sure you take a backup first :-)

//Magnus

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
Will do - thanks Magnus!  I'll test it for a while and post the results
here.

Jeremy Haile

On Sat, 27 Jan 2007 17:21:23 +0100, "Magnus Hagander"
<magnus@hagander.net> said:
> Jeremy Haile wrote:
> > Using standard build (none of the things you mentioned) on 8.2.1
> > currently.
> >
> > I really appreciate it!
>
> Ok. I've built a binary for you at
> http://www.hagander.net/download/pgsql_8.2.1_win32select.zip
>
> Note that while I named the file 8.2.1 it really isn't - it's current
> head of the REL_8_2_STABLE branch. Which means it has a few other fixes
> as well.
>
> I think it *should* work to just replace the postgres.exe file and
> ignore the rest, so I didn't put them in the ZIP. But as always, make
> sure you take a backup first :-)
>
> //Magnus

Re: Stats collector frozen?

From
"Jeremy Haile"
Date:
So far, the patched binary seems to have fixed the statistics issue in
my environment.  pgstat.stat is now updating constantly,
pg_stat_user_tables is updating correctly, and autovacuum is running!

Thanks for your quick response to this issue!  Let me know if there is
anything else I can do to assist.

Jeremy Haile


On Sat, 27 Jan 2007 11:23:39 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:
> Will do - thanks Magnus!  I'll test it for a while and post the results
> here.
>
> Jeremy Haile
>
> On Sat, 27 Jan 2007 17:21:23 +0100, "Magnus Hagander"
> <magnus@hagander.net> said:
> > Jeremy Haile wrote:
> > > Using standard build (none of the things you mentioned) on 8.2.1
> > > currently.
> > >
> > > I really appreciate it!
> >
> > Ok. I've built a binary for you at
> > http://www.hagander.net/download/pgsql_8.2.1_win32select.zip
> >
> > Note that while I named the file 8.2.1 it really isn't - it's current
> > head of the REL_8_2_STABLE branch. Which means it has a few other fixes
> > as well.
> >
> > I think it *should* work to just replace the postgres.exe file and
> > ignore the rest, so I didn't put them in the ZIP. But as always, make
> > sure you take a backup first :-)
> >
> > //Magnus
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

Re: Stats collector frozen?

From
Michael Meskes
Date:
On Sat, Jan 27, 2007 at 03:17:14PM +0100, Magnus Hagander wrote:
> Seems HEAD now passes thie check stage on at least snake and bandicoot.
> They're now exposing another problem with ECPG though...

This should be fixed now. I just missed to change MinGW expected files
when I changed the tests. That's why I prefer to have no platform
specific expected files.

Michael
--
Michael Meskes
Email: Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: meskes@jabber.org
Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!

Re: Stats collector frozen?

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Fri, Jan 26, 2007 at 09:55:39AM -0500, Tom Lane wrote:
>> Keep in mind also that we have seen the stats-test failure on
>> non-Windows machines, so we still need to explain that ...

> Yeah. But it *could* be two different stats issues lurking. Perhaps the
> issue we've seen on non-windows can be fixed by the settings Alvaro had
> me try (increasing autovacuum_vacuum_cost_delay or the delay in the
> regression test).

I had a sudden thought about that: the stats machinery is designed to be
non-reliable, ie, drop messages under load.  Maybe the occasional stats
failures we see are just an artifact of that happening.  It would be
pretty unfortunate if the stats test and autovacuum together were
sufficient load to cause message drops, but I doubt that's the
explanation.  I think the important change here has been the default
enablement of stats_row_level.  That means that some of the tests
terminating just before the stats test starts may still be trying to
dump statistics out to the collector at the same time the stats test is.
(Keep in mind that psql does not wait around for the backend to be
actually gone before it exits, hence backend-exit cleanup is very likely
to happen in parallel with the start of the next test.)  This idea
explains why we mostly see the failure in parallel tests not serial:
in the serial schedule there's no opportunity to have a gang of backends
all exiting at the critical time.

If this theory is correct, then we can improve the reliability of the
stats test a good deal if we put a sleep() at the *start* of the test,
to let any old backends get out of the way.  It seems worth a try
anyway.  I'll add this to HEAD and if the stats failure noise seems to
go down, we can back-port it.

            regards, tom lane