Thread: Stats collector frozen?
I've noticed that my tables are not being auto vacuumed or analyzed regularly, even though I have very aggressive autovacuum settings. The stats collector appears to still be running, since I can see a postgres.exe process with -forkcol. However, I never notice it using I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no change in table stats even for tables that change very frequently. I see error such as these in the log every now and then - not sure if they are related. These have been discussed at length in other posts and seems to have something to do with PG holding onto old file handles (Windows specific): 2007-01-24 06:24:16 ERROR: could not open relation 1663/16404/333779: Permission denied 2007-01-24 06:24:17 ERROR: could not open relation 1663/16404/333779: Permission denied 2007-01-24 06:24:18 ERROR: could not open relation 1663/16404/333779: Permission denied 2007-01-24 06:24:19 ERROR: could not open relation 1663/16404/333779: Permission denied I'm running PG 8.2.1 on Windows. Here is some of the output from "show all": "autovacuum";"on" "autovacuum_analyze_scale_factor";"0.02" "autovacuum_analyze_threshold";"250" "autovacuum_freeze_max_age";"200000000" "autovacuum_naptime";"1min" "autovacuum_vacuum_cost_delay";"-1" "autovacuum_vacuum_cost_limit";"-1" "autovacuum_vacuum_scale_factor";"0.08" "autovacuum_vacuum_threshold";"500" "stats_block_level";"on" "stats_command_string";"on" "stats_reset_on_server_start";"off" "stats_row_level";"on" "stats_start_collector";"on" "vacuum_cost_delay";"20ms" "vacuum_cost_limit";"200" "vacuum_cost_page_dirty";"20" "vacuum_cost_page_hit";"1" "vacuum_cost_page_miss";"10" "vacuum_freeze_min_age";"100000000"
"Jeremy Haile" <jhaile@fastmail.fm> writes: > The stats collector appears to still be running, since I can see a > postgres.exe process with -forkcol. However, I never notice it using > I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no > change in table stats even for tables that change very frequently. Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time at least) Do you see any pgstat.tmp file in there? regards, tom lane
pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp. Coincidentally (I think not) - the last auto-analyze was performed at 2007-01-22 12:24:11.424-05. The logs for 1/22 are empty - so no errors or anything like that to give clues... Thanks! Jeremy Haile On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said: > "Jeremy Haile" <jhaile@fastmail.fm> writes: > > The stats collector appears to still be running, since I can see a > > postgres.exe process with -forkcol. However, I never notice it using > > I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no > > change in table stats even for tables that change very frequently. > > Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time > at least) Do you see any pgstat.tmp file in there? > > regards, tom lane
Searching the archives, I found a couple of 2006 posts that seem somewhat related to my problem (although I don't see any solutions listed...): http://archives.postgresql.org/pgsql-hackers/2006-09/msg00415.php http://archives.postgresql.org/pgsql-hackers/2006-04/msg00127.php Tom, since you were involved in these - did you ever figure out how to resolve the issues of the stats collector getting stuck in Windows? Thanks, Jeremy Haile On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm> said: > pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp. > > Coincidentally (I think not) - the last auto-analyze was performed at > 2007-01-22 12:24:11.424-05. > > The logs for 1/22 are empty - so no errors or anything like that to give > clues... > > Thanks! > Jeremy Haile > > > On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said: > > "Jeremy Haile" <jhaile@fastmail.fm> writes: > > > The stats collector appears to still be running, since I can see a > > > postgres.exe process with -forkcol. However, I never notice it using > > > I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no > > > change in table stats even for tables that change very frequently. > > > > Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time > > at least) Do you see any pgstat.tmp file in there? > > > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
Tom, Did this information shed any light on what the problem might be? Any solution or workaround? Thanks! Jeremy Haile On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm> said: > pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp. > > Coincidentally (I think not) - the last auto-analyze was performed at > 2007-01-22 12:24:11.424-05. > > The logs for 1/22 are empty - so no errors or anything like that to give > clues... > > Thanks! > Jeremy Haile > > > On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said: > > "Jeremy Haile" <jhaile@fastmail.fm> writes: > > > The stats collector appears to still be running, since I can see a > > > postgres.exe process with -forkcol. However, I never notice it using > > > I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no > > > change in table stats even for tables that change very frequently. > > > > Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time > > at least) Do you see any pgstat.tmp file in there? > > > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
"Jeremy Haile" <jhaile@fastmail.fm> writes: > Did this information shed any light on what the problem might be? It seems to buttress Magnus' theory that the intermittent (or not so intermittent) stats-test buildfarm failures we've been seeing have to do with the stats collector actually freezing up, rather than just not reacting fast enough as most of us (or me anyway) thought. But why that is happening remains anyone's guess. I don't suppose you have debugging tools that would let you get a stack trace from the collector process? regards, tom lane
Unfortunately I don't have any debugging tools installed that would work against postgres - although I'd be glad to do something if you could tell me the steps involved. I can reproduce the issue quite easily on two different Windows machines (one is XP, the other is 2003). Please let me know if there is anything else I can do to help debug this problem. Do you know of any workaround other than restarting the whole server? Can the collector be restarted individually? Thanks, Jeremy Haile On Thu, 25 Jan 2007 12:42:11 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said: > "Jeremy Haile" <jhaile@fastmail.fm> writes: > > Did this information shed any light on what the problem might be? > > It seems to buttress Magnus' theory that the intermittent (or not so > intermittent) stats-test buildfarm failures we've been seeing have to > do with the stats collector actually freezing up, rather than just > not reacting fast enough as most of us (or me anyway) thought. But > why that is happening remains anyone's guess. I don't suppose you > have debugging tools that would let you get a stack trace from the > collector process? > > regards, tom lane
"Jeremy Haile" <jhaile@fastmail.fm> writes: > Unfortunately I don't have any debugging tools installed that would work > against postgres - although I'd be glad to do something if you could > tell me the steps involved. I can reproduce the issue quite easily on > two different Windows machines (one is XP, the other is 2003). Sorry, I don't know anything about Windows debugging either. Can you put together a test case that would let one of the Windows-using hackers reproduce it? > Do you know of any workaround other than restarting the whole server? > Can the collector be restarted individually? On Unix you can just 'kill -TERM' the collector process and the postmaster will start a new one without engaging in a database panic cycle. Dunno what the equivalent is on Windows but it's probably possible. regards, tom lane
Tom Lane wrote: > "Jeremy Haile" <jhaile@fastmail.fm> writes: >> Unfortunately I don't have any debugging tools installed that would work >> against postgres - although I'd be glad to do something if you could >> tell me the steps involved. I can reproduce the issue quite easily on >> two different Windows machines (one is XP, the other is 2003). > > Sorry, I don't know anything about Windows debugging either. Can you > put together a test case that would let one of the Windows-using hackers > reproduce it? That would help a lot. >> Do you know of any workaround other than restarting the whole server? >> Can the collector be restarted individually? > > On Unix you can just 'kill -TERM' the collector process and the postmaster > will start a new one without engaging in a database panic cycle. Dunno > what the equivalent is on Windows but it's probably possible. You can use pg_ctl to send the int signal. If it's completely hung, that may not work. In that case you can kill it from task manager, but that's equiv of a kill -9, which means that the postmaster will restart all backends. //Magnus
Magnus Hagander <magnus@hagander.net> writes: >> "Jeremy Haile" <jhaile@fastmail.fm> writes: >>> Do you know of any workaround other than restarting the whole server? >>> Can the collector be restarted individually? > You can use pg_ctl to send the int signal. If it's completely hung, that > may not work. In that case you can kill it from task manager, but that's > equiv of a kill -9, which means that the postmaster will restart all > backends. No, the postmaster does not care how badly the stats collector croaks, because the collector's not connected to shared memory, so there's no risk of collateral damage. It'll just start a new one without bothering the backends. So you can do whatever you have to do to kill the stuck collector --- I was just not sure how to do that under Windows. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >>> "Jeremy Haile" <jhaile@fastmail.fm> writes: >>>> Do you know of any workaround other than restarting the whole server? >>>> Can the collector be restarted individually? > >> You can use pg_ctl to send the int signal. If it's completely hung, that >> may not work. In that case you can kill it from task manager, but that's >> equiv of a kill -9, which means that the postmaster will restart all >> backends. > > No, the postmaster does not care how badly the stats collector croaks, > because the collector's not connected to shared memory, so there's no > risk of collateral damage. It'll just start a new one without bothering > the backends. So you can do whatever you have to do to kill the stuck > collector --- I was just not sure how to do that under Windows. Oh, ok. Then just pick it up in Task Manager or Process Explorer or whatever and kill it off. Just make sure you pick the right process. //Magnus
> Then just pick it up in Task Manager or Process Explorer or whatever and > kill it off. Just make sure you pick the right process. I mentioned earlier that killing off the collector didn't work - however I was wrong. I just wasn't giving it enough time. If I kill the "postgres.exe -forkcol" process, it does gets restarted, although sometimes it takes a minute. Since it only seems to update pgstat.stat once after restarting, I'd need to kill it once-a-minute to keep my statistics up to date =) So, unfortunately it's not a great workaround to my problem. Jeremy Haile
I'll try to put together a test case for hackers, although I'm not sure what exactly causes it. Basically, when I fire up PostgreSQL - after about a minute the stats collector runs once (pgstat.stat is updated, autovacuum fires up, etc.) - and then the collector seems to hang. If I watch it's performance information, it does not read or write to disk again and pgstat.stat is never updated again. It never updates pgstat.stat more than once after restart. There are no errors in the log I tried killing the collector a variety of ways on Windows, but it seems to terminate indefinitely. I don't see a kill program for windows that lets me specify the signal to use. So other than restarting PostgreSQL, I'm not sure how to workaround this problem. If anyone else is experiencing similar problems, please post your situation. On Thu, 25 Jan 2007 12:51:31 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said: > "Jeremy Haile" <jhaile@fastmail.fm> writes: > > Unfortunately I don't have any debugging tools installed that would work > > against postgres - although I'd be glad to do something if you could > > tell me the steps involved. I can reproduce the issue quite easily on > > two different Windows machines (one is XP, the other is 2003). > > Sorry, I don't know anything about Windows debugging either. Can you > put together a test case that would let one of the Windows-using hackers > reproduce it? > > > Do you know of any workaround other than restarting the whole server? > > Can the collector be restarted individually? > > On Unix you can just 'kill -TERM' the collector process and the > postmaster > will start a new one without engaging in a database panic cycle. Dunno > what the equivalent is on Windows but it's probably possible. > > regards, tom lane
Jeremy Haile wrote: > I'll try to put together a test case for hackers, although I'm not sure > what exactly causes it. > > Basically, when I fire up PostgreSQL - after about a minute the stats > collector runs once (pgstat.stat is updated, autovacuum fires up, etc.) > - and then the collector seems to hang. If I watch it's performance > information, it does not read or write to disk again and pgstat.stat is > never updated again. It never updates pgstat.stat more than once after > restart. There are no errors in the log > > I tried killing the collector a variety of ways on Windows, but it seems > to terminate indefinitely. I don't see a kill program for windows that > lets me specify the signal to use. So other than restarting PostgreSQL, > I'm not sure how to workaround this problem. > > If anyone else is experiencing similar problems, please post your > situation. All the Windows buildfarm machines are, apparently. We verified this with Magnus. He found that the tenk2 table does not seem to get stat updates -- the numbers are all zero, at all times. I thought I had blogged about this ... oh yeah, it's here http://www.advogato.org/person/alvherre/diary.html?start=11 AFAIR (Magnus can surely confirm) there were some other tables that weren't showing stats as all zeros -- but there's no way to know whether those numbers were put there before the collector had "frozen" (if that's really what's happening). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Jeremy Haile wrote: >> If anyone else is experiencing similar problems, please post your >> situation. > All the Windows buildfarm machines are, apparently. Can't anyone with a debugger duplicate this and get a stack trace for us? If the stats collector is indeed freezing up, a stack trace showing where it's stuck would be exceedingly helpful. regards, tom lane
> AFAIR (Magnus can surely confirm) there were some other tables that > weren't showing stats as all zeros -- but there's no way to know whether > those numbers were put there before the collector had "frozen" (if > that's really what's happening). Yeah - I have numbers that updated before the stats collector started freezing. Do you know which version of PG this started with? I have upgraded 8.1.3, 8.1.4, 8.2, and 8.2.1 in the past months and I didn't have the collector enabled until 8.2.1 - so I'm not sure how long this has been a problem. I might try rolling back to a previous version - it's either that or setup a scheduled vacuum analyze until we figure out this problem. I'm having to manually run it every day now... =) I think this is a pretty critical problem since it cripples autovacuum on Windows. Are you guys in a position to debug the collector process and see where it is freezing (ala Tom's earlier comment)? Anything I can do to help debug this problem faster?
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> Jeremy Haile wrote: >>> If anyone else is experiencing similar problems, please post your >>> situation. > >> All the Windows buildfarm machines are, apparently. > > Can't anyone with a debugger duplicate this and get a stack trace for > us? If the stats collector is indeed freezing up, a stack trace showing > where it's stuck would be exceedingly helpful. Must've been asleep when reading and writing in this thread. Didn't realize it was the same issue as the buildfarm-killer. Will do the debugger+stacktrace tomorrow on my VC install. //Magnus
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Jeremy Haile wrote: > >> If anyone else is experiencing similar problems, please post your > >> situation. > > > All the Windows buildfarm machines are, apparently. > > Can't anyone with a debugger duplicate this and get a stack trace for > us? If the stats collector is indeed freezing up, a stack trace showing > where it's stuck would be exceedingly helpful. Done some checking here. What happens is that suddenly the pgstats socket stops receiving data. select() (pgstat.c line 1802) returns after timeout, so got_data is always zero. Interesting note: I just ran the serial regression tests, and they pass fine. With the parallel tests, it always stops receiving data somewhere during the first parallel group. //Magnus
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Jeremy Haile wrote: > >> If anyone else is experiencing similar problems, please post your > >> situation. > > > All the Windows buildfarm machines are, apparently. > > Can't anyone with a debugger duplicate this and get a stack trace for > us? If the stats collector is indeed freezing up, a stack trace showing > where it's stuck would be exceedingly helpful. Another update - what broke it was not the enabling of autovacuum, it was the enabling of row level stats. If I disable stats_row_level, parallel tests pass again. That doesn't actually tell us *why* it's broken, I think, but it does tell us why the autovac patch caused it. I don't have 8.1 or 8.2 around on win32, but it'd be interesting to see if the same issue happens if you run the tests on that with stats_row_level enabled. Most likely the same thing happens. //Magnus
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Jeremy Haile wrote: > >> If anyone else is experiencing similar problems, please post your > >> situation. > > > All the Windows buildfarm machines are, apparently. > > Can't anyone with a debugger duplicate this and get a stack trace for > us? If the stats collector is indeed freezing up, a stack trace showing > where it's stuck would be exceedingly helpful. Maybe I should finish testing before I send my emails. Apparantly there is a bug lurking somewhere in pgwin32_select(). Because if I put a #undef select right before the select in pgstat.c, the regression tests pass. I guess the bug is shown because with row level stats we simply have more data to process. And it appears only to happen on UDP sockets from what I can tell. Now, what Iwould *like* to do is to re-implement that part of the code using the Win32 APIs instead of going through select(). Since it's very isolated code. I'm going to try that and see how invasive it is, then see if it'll get accepted :-) (This would of course give us better performance in general in that codepath, since all the emulation stuff wouldn't be needed, so there's a point to doing that other than finding the obscure UDP-related bug in pgwin23_select) //Magnus
> Apparantly there is a bug lurking somewhere in pgwin32_select(). Because > if I put a #undef select right before the select in pgstat.c, the > regression tests pass. May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ? WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to write, so, may be there is symmetrical problem with read? Or pgwin32_select() is used for waiting write too? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Fri, Jan 26, 2007 at 01:11:00PM +0300, Teodor Sigaev wrote: > >Apparantly there is a bug lurking somewhere in pgwin32_select(). Because > >if I put a #undef select right before the select in pgstat.c, the > >regression tests pass. > > May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ? > WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to > write, so, may be there is symmetrical problem with read? Or > pgwin32_select() is used for waiting write too? pgwin32_waitforsinglesocket() appears to work fine. And we only use FD_READ, so it's not affected by your patch from what I can tell. I've got it passnig tests with select replaced with waitforsinglesocket - now I just need to implement timeout in that one :-) //Magnus
On Fri, Jan 26, 2007 at 12:26:58PM +0100, Magnus Hagander wrote: > On Fri, Jan 26, 2007 at 01:11:00PM +0300, Teodor Sigaev wrote: > > >Apparantly there is a bug lurking somewhere in pgwin32_select(). Because > > >if I put a #undef select right before the select in pgstat.c, the > > >regression tests pass. > > > > May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ? > > WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to > > write, so, may be there is symmetrical problem with read? Or > > pgwin32_select() is used for waiting write too? > > > pgwin32_waitforsinglesocket() appears to work fine. And we only use > FD_READ, so it's not affected by your patch from what I can tell. > > I've got it passnig tests with select replaced with waitforsinglesocket > - now I just need to implement timeout in that one :-) Attached patch seems to solve the problem on my machine at least. Uses pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this is why it touches files outside of the stats area). //Magnus
Attachment
> > > >Apparantly there is a bug lurking somewhere in pgwin32_select(). Because > > > >if I put a #undef select right before the select in pgstat.c, the > > > >regression tests pass. > > > > > > May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ? > > > WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to > > > write, so, may be there is symmetrical problem with read? Or > > > pgwin32_select() is used for waiting write too? > > > > > > pgwin32_waitforsinglesocket() appears to work fine. And we only use > > FD_READ, so it's not affected by your patch from what I can tell. > > > > I've got it passnig tests with select replaced with waitforsinglesocket > > - now I just need to implement timeout in that one :-) > > Attached patch seems to solve the problem on my machine at least. Uses > pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes > pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this > is why it touches files outside of the stats area). Magnus - thanks for your investigation and work! Any chance I could get a patched exe for win32 and test on my servers?
Jeremy Haile wrote: > Magnus - thanks for your investigation and work! Any chance I could get > a patched exe for win32 and test on my servers? In the meantime, could you please confirm that turning row_level_stats off makes the regression test pass? And whatever tests you were doing? (Note that by doing that, autovacuum won't work at all, because it needs those stats.) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Magnus Hagander <magnus@hagander.net> writes: > Apparantly there is a bug lurking somewhere in pgwin32_select(). Because > if I put a #undef select right before the select in pgstat.c, the > regression tests pass. > I guess the bug is shown because with row level stats we simply have > more data to process. And it appears only to happen on UDP sockets from > what I can tell. Hmm ... if this theory is correct, then statistics collection has never worked at all on Windows, at least not under more than the most marginal load; and hence neither has autovacuum. Does that conclusion agree with reality? You'd think we'd have heard a whole lot of complaints about it, not just Jeremy's; and I don't remember it being a sore point. (But then again I just woke up.) What seems somewhat more likely is that we broke pgwin32_select recently, in which case we oughta find out why. Or else remove it entirely (does your patch make that possible?). Keep in mind also that we have seen the stats-test failure on non-Windows machines, so we still need to explain that ... regards, tom lane
On Fri, Jan 26, 2007 at 09:55:39AM -0500, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: > > Apparantly there is a bug lurking somewhere in pgwin32_select(). Because > > if I put a #undef select right before the select in pgstat.c, the > > regression tests pass. > > I guess the bug is shown because with row level stats we simply have > > more data to process. And it appears only to happen on UDP sockets from > > what I can tell. > > Hmm ... if this theory is correct, then statistics collection has > never worked at all on Windows, at least not under more than the most > marginal load; and hence neither has autovacuum. We have had lots of reports of issues with the stats collector on Windows. Some were definitly fixed by the patch by O&T, but I don't think all. The thing is, since it didn't give any error messages at all, most users wouldn't notice. Other than their tables getting bloated, in which case they would do a manual vacuum and conlcude autovacuum wasn't good enough. Or something. > Does that conclusion agree with reality? You'd think we'd have heard > a whole lot of complaints about it, not just Jeremy's; and I don't > remember it being a sore point. (But then again I just woke up.) > What seems somewhat more likely is that we broke pgwin32_select > recently, in which case we oughta find out why. Or else remove it > entirely (does your patch make that possible?). AFAIK, it only affects UDP connections, and this patch takes pgwin32_select out of the loop for all UDP stuff. But if we get this in, pgwin32_select is only used in the postmaster accept-new-connections loop (from what I can tell by a quick look), so I'd definitly want to rewrite that one as well to use a better way than select-emulation. Then it could go away completely. > Keep in mind also that we have seen the stats-test failure on > non-Windows machines, so we still need to explain that ... Yeah. But it *could* be two different stats issues lurking. Perhaps the issue we've seen on non-windows can be fixed by the settings Alvaro had me try (increasing autovacuum_vacuum_cost_delay or the delay in the regression test). //Magnus
> We have had lots of reports of issues with the stats collector on > Windows. Some were definitly fixed by the patch by O&T, but I don't > think all. Here were a couple of other reports I found: http://archives.postgresql.org/pgsql-hackers/2006-09/msg00415.php http://archives.postgresql.org/pgsql-hackers/2006-04/msg00127.php > The thing is, since it didn't give any error messages at all, most users > wouldn't notice. Other than their tables getting bloated, in which case > they would do a manual vacuum and conlcude autovacuum wasn't good > enough. Or something. This is indeed what I assumed at first. I started running "vacuum analyze" hourly and turned off autovacuum. Later, I decided to try the autovacuum route again and investigated why it wasn't working well. Magnus - could you send me a patched exe to try in my environment? Would it be compatible with 8.2.1?
Magnus Hagander <magnus@hagander.net> writes: > Attached patch seems to solve the problem on my machine at least. Uses > pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes > pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this > is why it touches files outside of the stats area). Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go green, we should probably consider back-porting this to 8.1 and 8.0. regards, tom lane
> Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go > green, we should probably consider back-porting this to 8.1 and 8.0. Not trying to be a nuisance, but I'd really like to try this out in my environment and see if my problems disappear. Is there anyone out there who could provide me with a patched exe for Win32? If not, I could try to get my system setup to build for Windows, but I'm not sure what all that would involve. Thanks, Jeremy Haile
Jeremy Haile wrote: >> Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go >> green, we should probably consider back-porting this to 8.1 and 8.0. > > Not trying to be a nuisance, but I'd really like to try this out in my > environment and see if my problems disappear. Is there anyone out there > who could provide me with a patched exe for Win32? If not, I could try > to get my system setup to build for Windows, but I'm not sure what all > that would involve. I'll see if I can build you something tomorrow. You're on 8.2, right? Do you use any features like Kerberos, SSL or NLS? I don't think I have them set up properly in my mingw build env, so it'd be easier if I could build without them. Unless beaten by someone who has a complete env ;-) //Magnus
Using standard build (none of the things you mentioned) on 8.2.1 currently. I really appreciate it! On Fri, 26 Jan 2007 21:24:09 +0100, "Magnus Hagander" <magnus@hagander.net> said: > Jeremy Haile wrote: > >> Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go > >> green, we should probably consider back-porting this to 8.1 and 8.0. > > > > Not trying to be a nuisance, but I'd really like to try this out in my > > environment and see if my problems disappear. Is there anyone out there > > who could provide me with a patched exe for Win32? If not, I could try > > to get my system setup to build for Windows, but I'm not sure what all > > that would involve. > > I'll see if I can build you something tomorrow. You're on 8.2, right? Do > you use any features like Kerberos, SSL or NLS? I don't think I have > them set up properly in my mingw build env, so it'd be easier if I could > build without them. > > Unless beaten by someone who has a complete env ;-) > > //Magnus
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Attached patch seems to solve the problem on my machine at least. Uses >> pgwin32_waitforsinglesocket() instead of pgwin32_select(). Changes >> pgwin32_waitforsinglesocket() to accept the timeout as a parameter (this >> is why it touches files outside of the stats area). > > Applied to HEAD and 8.2 --- assuming the Windows buildfarm machines go > green, we should probably consider back-porting this to 8.1 and 8.0. Seems HEAD now passes thie check stage on at least snake and bandicoot. They're now exposing another problem with ECPG though... //Magnus
Jeremy Haile wrote: > Using standard build (none of the things you mentioned) on 8.2.1 > currently. > > I really appreciate it! Ok. I've built a binary for you at http://www.hagander.net/download/pgsql_8.2.1_win32select.zip Note that while I named the file 8.2.1 it really isn't - it's current head of the REL_8_2_STABLE branch. Which means it has a few other fixes as well. I think it *should* work to just replace the postgres.exe file and ignore the rest, so I didn't put them in the ZIP. But as always, make sure you take a backup first :-) //Magnus
Will do - thanks Magnus! I'll test it for a while and post the results here. Jeremy Haile On Sat, 27 Jan 2007 17:21:23 +0100, "Magnus Hagander" <magnus@hagander.net> said: > Jeremy Haile wrote: > > Using standard build (none of the things you mentioned) on 8.2.1 > > currently. > > > > I really appreciate it! > > Ok. I've built a binary for you at > http://www.hagander.net/download/pgsql_8.2.1_win32select.zip > > Note that while I named the file 8.2.1 it really isn't - it's current > head of the REL_8_2_STABLE branch. Which means it has a few other fixes > as well. > > I think it *should* work to just replace the postgres.exe file and > ignore the rest, so I didn't put them in the ZIP. But as always, make > sure you take a backup first :-) > > //Magnus
So far, the patched binary seems to have fixed the statistics issue in my environment. pgstat.stat is now updating constantly, pg_stat_user_tables is updating correctly, and autovacuum is running! Thanks for your quick response to this issue! Let me know if there is anything else I can do to assist. Jeremy Haile On Sat, 27 Jan 2007 11:23:39 -0500, "Jeremy Haile" <jhaile@fastmail.fm> said: > Will do - thanks Magnus! I'll test it for a while and post the results > here. > > Jeremy Haile > > On Sat, 27 Jan 2007 17:21:23 +0100, "Magnus Hagander" > <magnus@hagander.net> said: > > Jeremy Haile wrote: > > > Using standard build (none of the things you mentioned) on 8.2.1 > > > currently. > > > > > > I really appreciate it! > > > > Ok. I've built a binary for you at > > http://www.hagander.net/download/pgsql_8.2.1_win32select.zip > > > > Note that while I named the file 8.2.1 it really isn't - it's current > > head of the REL_8_2_STABLE branch. Which means it has a few other fixes > > as well. > > > > I think it *should* work to just replace the postgres.exe file and > > ignore the rest, so I didn't put them in the ZIP. But as always, make > > sure you take a backup first :-) > > > > //Magnus > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
On Sat, Jan 27, 2007 at 03:17:14PM +0100, Magnus Hagander wrote: > Seems HEAD now passes thie check stage on at least snake and bandicoot. > They're now exposing another problem with ECPG though... This should be fixed now. I just missed to change MinGW expected files when I changed the tests. That's why I prefer to have no platform specific expected files. Michael -- Michael Meskes Email: Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) ICQ: 179140304, AIM/Yahoo: michaelmeskes, Jabber: meskes@jabber.org Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!
Magnus Hagander <magnus@hagander.net> writes: > On Fri, Jan 26, 2007 at 09:55:39AM -0500, Tom Lane wrote: >> Keep in mind also that we have seen the stats-test failure on >> non-Windows machines, so we still need to explain that ... > Yeah. But it *could* be two different stats issues lurking. Perhaps the > issue we've seen on non-windows can be fixed by the settings Alvaro had > me try (increasing autovacuum_vacuum_cost_delay or the delay in the > regression test). I had a sudden thought about that: the stats machinery is designed to be non-reliable, ie, drop messages under load. Maybe the occasional stats failures we see are just an artifact of that happening. It would be pretty unfortunate if the stats test and autovacuum together were sufficient load to cause message drops, but I doubt that's the explanation. I think the important change here has been the default enablement of stats_row_level. That means that some of the tests terminating just before the stats test starts may still be trying to dump statistics out to the collector at the same time the stats test is. (Keep in mind that psql does not wait around for the backend to be actually gone before it exits, hence backend-exit cleanup is very likely to happen in parallel with the start of the next test.) This idea explains why we mostly see the failure in parallel tests not serial: in the serial schedule there's no opportunity to have a gang of backends all exiting at the critical time. If this theory is correct, then we can improve the reliability of the stats test a good deal if we put a sleep() at the *start* of the test, to let any old backends get out of the way. It seems worth a try anyway. I'll add this to HEAD and if the stats failure noise seems to go down, we can back-port it. regards, tom lane