Thread: Re: [ADMIN] does wal archiving block the current client connection?
On Fri, 19 May 2006, Tom Lane wrote: > Well, there's our smoking gun. IIRC, all the failures you showed us are > consistent with race conditions caused by multiple archiver processes > all trying to do the same tasks concurrently. > > Do you frequently stop and restart the postmaster? Because I don't see > how you could get into this state without having done so. > > I've just been looking at the code, and the archiver does commit > hara-kiri when it notices its parent postmaster is dead; but it only > checks that in the outer loop. Given sufficiently long delays in the > archive_command, that could be a long time after the postmaster died; > and in the meantime, successive executions of the archive_command could > be conflicting with those launched by a later archiver incarnation. Hurray! Unfortunately, the postmaster on the original troubled server almost never gets restarted, and in fact only has only one archiver process running right now. Drat! I guess I'll have to try and catch it in the act again the next time the NAS gets wedged so I can debug a little more (it was caught by one of the windows folks last time) and gather some useful data. Let me know if you want me to test a patch since I've already got this test case setup. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
Jeff Frost <jeff@frostconsultingllc.com> writes: > Hurray! Unfortunately, the postmaster on the original troubled server almost > never gets restarted, and in fact only has only one archiver process running > right now. Drat! Well, the fact that there's only one archiver *now* doesn't mean there wasn't more than one when the problem happened. The orphaned archiver would eventually quit. Do you have logs that would let you check when the production postmaster was restarted? regards, tom lane
I wrote: > Well, the fact that there's only one archiver *now* doesn't mean there > wasn't more than one when the problem happened. The orphaned archiver > would eventually quit. But, actually, nevermind: we have explained the failures you were seeing in the test setup, but a multiple-active-archiver situation still doesn't explain the original situation of incoming connections getting blocked. What I'd suggest is resuming the test after making sure you've killed off any old archivers, and seeing if you can make any progress on reproducing the original problem. We definitely need a multiple-archiver interlock, but I think that must be unrelated to your real problem. regards, tom lane
On Fri, 2006-05-19 at 12:20 -0400, Tom Lane wrote: > I wrote: > > Well, the fact that there's only one archiver *now* doesn't mean there > > wasn't more than one when the problem happened. The orphaned archiver > > would eventually quit. > > But, actually, nevermind: we have explained the failures you were seeing > in the test setup, but a multiple-active-archiver situation still > doesn't explain the original situation of incoming connections getting > blocked. Agreed. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 19 May 2006, Tom Lane wrote: > Well, the fact that there's only one archiver *now* doesn't mean there > wasn't more than one when the problem happened. The orphaned archiver > would eventually quit. > > Do you have logs that would let you check when the production postmaster > was restarted? I looked through /var/log/messages* and there wasn't a restart prior to the problem in the logs. They go back to April 16. The postmaster was restarted on May 15th (this Monday), but that was after the reported problem. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
On Fri, 19 May 2006, Tom Lane wrote: > What I'd suggest is resuming the test after making sure you've killed > off any old archivers, and seeing if you can make any progress on > reproducing the original problem. We definitely need a > multiple-archiver interlock, but I think that must be unrelated to your > real problem. Ok, so I've got the old archivers gone (and btw, after a restart I ended up with 3 of them - so I stopped postmaster, and killed them all individually and started postmaster again). Now I can run my same pg_bench, or do you guys have any other suggestions on attempting to reproduce the problem? -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
On Fri, 2006-05-19 at 09:36 -0700, Jeff Frost wrote: > On Fri, 19 May 2006, Tom Lane wrote: > > > What I'd suggest is resuming the test after making sure you've killed > > off any old archivers, and seeing if you can make any progress on > > reproducing the original problem. We definitely need a > > multiple-archiver interlock, but I think that must be unrelated to your > > real problem. > > Ok, so I've got the old archivers gone (and btw, after a restart I ended up > with 3 of them - so I stopped postmaster, and killed them all individually and > started postmaster again). Thats good. > Now I can run my same pg_bench, or do you guys > have any other suggestions on attempting to reproduce the problem? No. We're back on track to try to reproduce the original error. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 19 May 2006, Simon Riggs wrote: >> Now I can run my same pg_bench, or do you guys >> have any other suggestions on attempting to reproduce the problem? > > No. We're back on track to try to reproduce the original error. I've been futzing with trying to reproduce the original problem for a few days and so far postgres seems to be just fine with a long delay on archiving, so now I'm rather at a loss. In fact, I currently have 1,234 xlog files in pg_xlog, but the archiver is happily archiving one every 5 minutes. Perhaps I'll try a long delay followed by a failure to see if that could be it. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
On Sun, 2006-05-21 at 14:16 -0700, Jeff Frost wrote: > On Fri, 19 May 2006, Simon Riggs wrote: > > >> Now I can run my same pg_bench, or do you guys > >> have any other suggestions on attempting to reproduce the problem? > > > > No. We're back on track to try to reproduce the original error. > > I've been futzing with trying to reproduce the original problem for a few days > and so far postgres seems to be just fine with a long delay on archiving, so > now I'm rather at a loss. In fact, I currently have 1,234 xlog files in > pg_xlog, but the archiver is happily archiving one every 5 minutes. Perhaps > I'll try a long delay followed by a failure to see if that could be it. So the chances of the original problem being archiver related are receding... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Sun, 21 May 2006, Simon Riggs wrote: >> I've been futzing with trying to reproduce the original problem for a few days >> and so far postgres seems to be just fine with a long delay on archiving, so >> now I'm rather at a loss. In fact, I currently have 1,234 xlog files in >> pg_xlog, but the archiver is happily archiving one every 5 minutes. Perhaps >> I'll try a long delay followed by a failure to see if that could be it. > > So the chances of the original problem being archiver related are > receding... This is possible, but I guess I should try and reproduce the actual problem with the same archive_command script and a CIFS mount just to see what happens. Perhaps the real root of the problem is elsewhere, it just seems strange since the archive_command is the only postgres related process that accesses the CIFS share. More later. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
On Sun, 21 May 2006, Jeff Frost wrote: >> So the chances of the original problem being archiver related are >> receding... > > This is possible, but I guess I should try and reproduce the actual problem > with the same archive_command script and a CIFS mount just to see what > happens. Perhaps the real root of the problem is elsewhere, it just seems > strange since the archive_command is the only postgres related process that > accesses the CIFS share. More later. I tried both pulling the plug on the CIFS server and unsharing the CIFS share, but pgbench continued completely unconcerned. I guess the failure mode of the NAS device in the customer colo must be something different that I don't yet know how to simulate. I suspect I'll have to wait till it happens again and try to gather some more data before restarting the NAS device. Thanks for all your suggestions guys! -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
Jeff Frost <jeff@frostconsultingllc.com> writes: > I tried both pulling the plug on the CIFS server and unsharing the CIFS share, > but pgbench continued completely unconcerned. I guess the failure mode of the > NAS device in the customer colo must be something different that I don't yet > know how to simulate. I suspect I'll have to wait till it happens again and > try to gather some more data before restarting the NAS device. Thanks for all > your suggestions guys! I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/ is on the NAS device. Please double-check the file locations. regards, tom lane
On Tue, 23 May 2006, Tom Lane wrote: > I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/ > is on the NAS device. Please double-check the file locations. I know that seems like an excellent candidate, but it really isn't, I swear. In fact, you almost had me convinced the last time you asked me to check.. I thought some helpful admin had moved something, but no: postgres 9194 0.0 0.4 486568 16464 ? S May16 0:11 /usr/local/pgsql/bin/postmaster -p 5432 -D /usr/local/pgsql/data db3:~/data $ pwd /usr/local/pgsql/data db3:~/data $ ls -l total 64 -rw------- 1 postgres postgres 4 Feb 13 20:13 PG_VERSION drwx------ 6 postgres postgres 4096 Feb 13 21:00 base drwx------ 2 postgres postgres 4096 May 22 21:03 global drwx------ 2 postgres postgres 4096 May 22 17:45 pg_clog -rw------- 1 postgres postgres 3575 Feb 13 20:13 pg_hba.conf -rw------- 1 postgres postgres 1460 Feb 13 20:13 pg_ident.conf drwx------ 4 postgres postgres 4096 Feb 13 20:13 pg_multixact drwx------ 2 postgres postgres 4096 May 22 20:45 pg_subtrans drwx------ 2 postgres postgres 4096 Feb 13 20:13 pg_tblspc drwx------ 2 postgres postgres 4096 Feb 13 20:13 pg_twophase lrwxrwxrwx 1 postgres postgres 9 Feb 13 22:10 pg_xlog -> /pg_xlog/ -rw------- 1 postgres postgres 13688 May 16 17:50 postgresql.conf -rw------- 1 postgres postgres 63 May 16 17:54 postmaster.opts -rw------- 1 postgres postgres 47 May 16 17:54 postmaster.pid db3:~/data $ mount /dev/sda2 on / type ext3 (rw) none on /proc type proc (rw) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) usbfs on /proc/bus/usb type usbfs (rw) /dev/sda1 on /boot type ext3 (rw) none on /dev/shm type tmpfs (rw) /dev/sdb1 on /usr/local/pgsql type ext3 (rw,data=writeback) /dev/sda5 on /var type ext3 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) //10.1.1.28/pgbackup on /mnt/pgbackup type cifs (rw,mand,noexec,nosuid,nodev) So, no..I wish it was that easy. :-/ If you have any other suggestions or inspirations, I'm all ears! -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954