Thread: Unresolved Win32 bug reports
Folks, my mailbox is filling with unresolved Win32 bug reports, specifically: integer divisionshared memorystatistics collectorrenamefsync I have put the emails at the bottom of the patches_hold queue: http://momjian.postgresql.org/cgi-bin/pgpatches_hold -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Here's one to add to the list: running pgbench with a moderately heavy load on an SMP box likes to trigger a state where the database (or pgbench) just stops doing work (CPU usage drops to nothing, as does disk activity). I've been able to repro this on 2 Intel boxes (one a 2 way, one a 4 way), and a dual Opteron, all running the latest windows binary. A 50 connection test running 1000 transactions is pretty much ensured to fail. I've been unable to produce the same behavior on a single-proc machine. Please let me know if there's any more info that would be helpful. On Thu, Apr 20, 2006 at 07:02:01AM -0400, Bruce Momjian wrote: > Folks, my mailbox is filling with unresolved Win32 bug reports, > specifically: > > integer division > shared memory > statistics collector > rename > fsync > > I have put the emails at the bottom of the patches_hold queue: > > http://momjian.postgresql.org/cgi-bin/pgpatches_hold > > -- > Bruce Momjian http://candle.pha.pa.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Thu, Apr 20, 2006 at 12:17:07PM -0500, Jim C. Nasby wrote: > Here's one to add to the list: running pgbench with a moderately heavy > load on an SMP box likes to trigger a state where the database (or > pgbench) just stops doing work (CPU usage drops to nothing, as does disk > activity). I've been able to repro this on 2 Intel boxes (one a 2 way, > one a 4 way), and a dual Opteron, all running the latest windows binary. > A 50 connection test running 1000 transactions is pretty much ensured to > fail. Well, this sounds like a dead-lock, the obvious step would be to attached gdb to both and get a stack-trace... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Thu, Apr 20, 2006 at 07:25:15PM +0200, Martijn van Oosterhout wrote: > On Thu, Apr 20, 2006 at 12:17:07PM -0500, Jim C. Nasby wrote: > > Here's one to add to the list: running pgbench with a moderately heavy > > load on an SMP box likes to trigger a state where the database (or > > pgbench) just stops doing work (CPU usage drops to nothing, as does disk > > activity). I've been able to repro this on 2 Intel boxes (one a 2 way, > > one a 4 way), and a dual Opteron, all running the latest windows binary. > > A 50 connection test running 1000 transactions is pretty much ensured to > > fail. > > Well, this sounds like a dead-lock, the obvious step would be to > attached gdb to both and get a stack-trace... Any pointers on how to get that setup? IS gdb part of the mingw runtime? BTW, this appears to be readily reproducable, so it might be a lot more productive for one of the windows hackers to test this themselves... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Martijn van Oosterhout <kleptog@svana.org> writes: > On Thu, Apr 20, 2006 at 12:17:07PM -0500, Jim C. Nasby wrote: >> Here's one to add to the list: running pgbench with a moderately heavy >> load on an SMP box likes to trigger a state where the database (or >> pgbench) just stops doing work (CPU usage drops to nothing, as does disk >> activity). > Well, this sounds like a dead-lock, the obvious step would be to > attached gdb to both and get a stack-trace... Yeah, I wonder if it's related to that apparent bug Qingqing saw in the windows semaphore code? It's clearly windows-specific since no one's ever reported any such thing on Unixen. regards, tom lane
> > > pgbench) just stops doing work (CPU usage drops to > nothing, as does > > > disk activity). I've been able to repro this on 2 Intel > boxes (one a > > > 2 way, one a 4 way), and a dual Opteron, all running the > latest windows binary. > > > A 50 connection test running 1000 transactions is pretty much > > > ensured to fail. > > > > Well, this sounds like a dead-lock, the obvious step would be to > > attached gdb to both and get a stack-trace... > > Any pointers on how to get that setup? IS gdb part of the > mingw runtime? Yes. It's quite crappy compared to on unix though - I've never been able to make it do the right thing all the way :-( > BTW, this appears to be readily reproducable, so it might be > a lot more productive for one of the windows hackers to test > this themselves... It reuqires a multi-CPU box, right? I don't hav eone with pgwin32 on ATM. Do you know if it's enough with hyperthreading? //Magnus
On Thu, Apr 20, 2006 at 08:06:30PM +0200, Magnus Hagander wrote: > It reuqires a multi-CPU box, right? I don't hav eone with pgwin32 on > ATM. Do you know if it's enough with hyperthreading? Hrm... not sure. Let me see if I can find a box with HT here and test it. Running the following batch file with arguments of 40 40 1000 is almost guaranteed to trigger the problem, though... @echo off dropdb bench createdb bench pgbench -i -s %1 bench pgbench -t %3 -c %2 -n bench -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > On Thu, Apr 20, 2006 at 08:06:30PM +0200, Magnus Hagander wrote: >> It reuqires a multi-CPU box, right? I don't hav eone with pgwin32 on >> ATM. Do you know if it's enough with hyperthreading? > > Hrm... not sure. Let me see if I can find a box with HT here and test > it. Running the following batch file with arguments of 40 40 1000 is > almost guaranteed to trigger the problem, though... > > @echo off > dropdb bench > createdb bench > pgbench -i -s %1 bench > pgbench -t %3 -c %2 -n bench It seems to hang up just fine on my XPSP2, PG 8.1.2 HTT box. :( LER -- Larry Rosenman Database Support Engineer PERVASIVE SOFTWARE. INC. 12365B RIATA TRACE PKWY 3015 AUSTIN TX 78727-6531 Tel: 512.231.6173 Fax: 512.231.6597 Email: Larry.Rosenman@pervasive.com Web: www.pervasive.com
Larry Rosenman wrote: > Jim C. Nasby wrote: >> On Thu, Apr 20, 2006 at 08:06:30PM +0200, Magnus Hagander wrote: >>> It reuqires a multi-CPU box, right? I don't hav eone with pgwin32 on >>> ATM. Do you know if it's enough with hyperthreading? >> >> Hrm... not sure. Let me see if I can find a box with HT here and test >> it. Running the following batch file with arguments of 40 40 1000 is >> almost guaranteed to trigger the problem, though... >> >> @echo off >> dropdb bench >> createdb bench >> pgbench -i -s %1 bench >> pgbench -t %3 -c %2 -n bench > > It seems to hang up just fine on my XPSP2, PG 8.1.2 HTT box. > > :( > > LER I may have spoken too soon :( More in a bit. LER -- Larry Rosenman Database Support Engineer PERVASIVE SOFTWARE. INC. 12365B RIATA TRACE PKWY 3015 AUSTIN TX 78727-6531 Tel: 512.231.6173 Fax: 512.231.6597 Email: Larry.Rosenman@pervasive.com Web: www.pervasive.com
On Thu, Apr 20, 2006 at 02:17:35PM -0500, Larry Rosenman wrote: > > It seems to hang up just fine on my XPSP2, PG 8.1.2 HTT box. > > > > :( > > > > LER > > I may have spoken too soon :( I took a look and in fact the machine was just disk bound, so it appears that either HT doesn't exhibit this behavior, or XP doesn't exhibit it (all the machines I produced the error on are running w2k3 server). I'll try and pin down better exactly what hardware/software will reproduce this. In the meantime, if anyone has any good info for getting a dump of one of these processes... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Some of the SysInternals tools might be a start. ProcessExplorer provides information about processes: http://www.sysinternals.com/Utilities/ProcessExplorer.html DebugView shows Debugging output (not sure if PG uses this): http://www.sysinternals.com/Utilities/DebugView.html Also, I haven't used it, but this looks like the Windows equivalent of gdb: http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Jim C. Nasby > Sent: Thursday, April 20, 2006 4:14 PM > To: Larry Rosenman > Cc: Magnus Hagander; Martijn van Oosterhout; Bruce Momjian; > PostgreSQL-development > Subject: Re: [HACKERS] Unresolved Win32 bug reports > > On Thu, Apr 20, 2006 at 02:17:35PM -0500, Larry Rosenman wrote: > > > It seems to hang up just fine on my XPSP2, PG 8.1.2 HTT box. > > > > > > :( > > > > > > LER > > > > I may have spoken too soon :( > > I took a look and in fact the machine was just disk bound, so > it appears > that either HT doesn't exhibit this behavior, or XP doesn't exhibit it > (all the machines I produced the error on are running w2k3 server). > > I'll try and pin down better exactly what hardware/software will > reproduce this. In the meantime, if anyone has any good info > for getting > a dump of one of these processes... > -- > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > Pervasive Software http://pervasive.com work: 512-231-6117 > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Bruce Momjian said: > Folks, my mailbox is filling with unresolved Win32 bug reports, > specifically: > > integer division > shared memory > statistics collector > rename > fsync > > I have put the emails at the bottom of the patches_hold queue: > > http://momjian.postgresql.org/cgi-bin/pgpatches_hold > There's also a pg_config buglet that David Fetter found that still needs to be fixed. I am currently travelling on family business, but when I return home in a couple of weeks will be working on getting my new machine built, and installing a permanent Windows VM (among others), which will make it easier for me to look at Windows issues within my realm of competence. cheers andrew
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > Martijn van Oosterhout <kleptog@svana.org> writes: > > On Thu, Apr 20, 2006 at 12:17:07PM -0500, Jim C. Nasby wrote: > >> Here's one to add to the list: running pgbench with a moderately heavy > >> load on an SMP box likes to trigger a state where the database (or > >> pgbench) just stops doing work (CPU usage drops to nothing, as does disk > >> activity). > > > Well, this sounds like a dead-lock, the obvious step would be to > > attached gdb to both and get a stack-trace... > > Yeah, I wonder if it's related to that apparent bug Qingqing saw in the > windows semaphore code? It's clearly windows-specific since no one's > ever reported any such thing on Unixen. > I also suspect the EAGAIN error reports are related to the semaphore code. So if possible, I suggest we patch the code and test it. Regards, Qingqing
On Mon, Apr 24, 2006 at 10:23:07AM +0800, Qingqing Zhou wrote: > > "Tom Lane" <tgl@sss.pgh.pa.us> wrote > > Martijn van Oosterhout <kleptog@svana.org> writes: > > > On Thu, Apr 20, 2006 at 12:17:07PM -0500, Jim C. Nasby wrote: > > >> Here's one to add to the list: running pgbench with a moderately heavy > > >> load on an SMP box likes to trigger a state where the database (or > > >> pgbench) just stops doing work (CPU usage drops to nothing, as does > disk > > >> activity). > > > > > Well, this sounds like a dead-lock, the obvious step would be to > > > attached gdb to both and get a stack-trace... > > > > Yeah, I wonder if it's related to that apparent bug Qingqing saw in the > > windows semaphore code? It's clearly windows-specific since no one's > > ever reported any such thing on Unixen. > > > > I also suspect the EAGAIN error reports are related to the semaphore code. > So if possible, I suggest we patch the code and test it. There a patched build available for testing? (I'd rather not have to figure out how to get windows builds working, unless there's some kind of instructions somewhere...) -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
""Jim C. Nasby"" <jnasby@pervasive.com> wrote > > There a patched build available for testing? (I'd rather not have to > figure out how to get windows builds working, unless there's some kind > of instructions somewhere...) > -- Not yet - the patch is still pending. Regards, Qingqing