Thread: Re: [ADMIN] does wal archiving block the current client connection?

Re: [ADMIN] does wal archiving block the current client connection?

From

Jeff Frost

Date:

19 May 2006, 13:08:59

On Fri, 19 May 2006, Tom Lane wrote:

> Well, there's our smoking gun.  IIRC, all the failures you showed us are
> consistent with race conditions caused by multiple archiver processes
> all trying to do the same tasks concurrently.
>
> Do you frequently stop and restart the postmaster?  Because I don't see
> how you could get into this state without having done so.
>
> I've just been looking at the code, and the archiver does commit
> hara-kiri when it notices its parent postmaster is dead; but it only
> checks that in the outer loop.  Given sufficiently long delays in the
> archive_command, that could be a long time after the postmaster died;
> and in the meantime, successive executions of the archive_command could
> be conflicting with those launched by a later archiver incarnation.

Hurray!  Unfortunately, the postmaster on the original troubled server almost
never gets restarted, and in fact only has only one archiver process running
right now.  Drat!

I guess I'll have to try and catch it in the act again the next time the NAS
gets wedged so I can debug a little more (it was caught by one of the windows
folks last time) and gather some useful data.

Let me know if you want me to test a patch since I've already got this test
case setup.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: [ADMIN] does wal archiving block the current client connection?

From

Tom Lane

Date:

19 May 2006, 13:17:57

Jeff Frost <jeff@frostconsultingllc.com> writes:
> Hurray!  Unfortunately, the postmaster on the original troubled server almost
> never gets restarted, and in fact only has only one archiver process running
> right now.  Drat!

Well, the fact that there's only one archiver *now* doesn't mean there
wasn't more than one when the problem happened.  The orphaned archiver
would eventually quit.

Do you have logs that would let you check when the production postmaster
was restarted?

            regards, tom lane

Re: [ADMIN] does wal archiving block the current client connection?

From

Tom Lane

Date:

19 May 2006, 13:21:07

I wrote:
> Well, the fact that there's only one archiver *now* doesn't mean there
> wasn't more than one when the problem happened.  The orphaned archiver
> would eventually quit.

But, actually, nevermind: we have explained the failures you were seeing
in the test setup, but a multiple-active-archiver situation still
doesn't explain the original situation of incoming connections getting
blocked.

What I'd suggest is resuming the test after making sure you've killed
off any old archivers, and seeing if you can make any progress on
reproducing the original problem.  We definitely need a
multiple-archiver interlock, but I think that must be unrelated to your
real problem.

            regards, tom lane

Re: [ADMIN] does wal archiving block the current client connection?

From

Simon Riggs

Date:

19 May 2006, 13:25:47

On Fri, 2006-05-19 at 12:20 -0400, Tom Lane wrote:
> I wrote:
> > Well, the fact that there's only one archiver *now* doesn't mean there
> > wasn't more than one when the problem happened.  The orphaned archiver
> > would eventually quit.
>
> But, actually, nevermind: we have explained the failures you were seeing
> in the test setup, but a multiple-active-archiver situation still
> doesn't explain the original situation of incoming connections getting
> blocked.

Agreed.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: [ADMIN] does wal archiving block the current client connection?

From

Jeff Frost

Date:

19 May 2006, 13:32:57

On Fri, 19 May 2006, Tom Lane wrote:

> Well, the fact that there's only one archiver *now* doesn't mean there
> wasn't more than one when the problem happened.  The orphaned archiver
> would eventually quit.
>
> Do you have logs that would let you check when the production postmaster
> was restarted?

I looked through /var/log/messages* and there wasn't a restart prior to the
problem in the logs.  They go back to April 16.  The postmaster was restarted
on May 15th (this Monday), but that was after the reported problem.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: [ADMIN] does wal archiving block the current client connection?

From

Jeff Frost

Date:

19 May 2006, 13:37:00

On Fri, 19 May 2006, Tom Lane wrote:

> What I'd suggest is resuming the test after making sure you've killed
> off any old archivers, and seeing if you can make any progress on
> reproducing the original problem.  We definitely need a
> multiple-archiver interlock, but I think that must be unrelated to your
> real problem.

Ok, so I've got the old archivers gone (and btw, after a restart I ended up
with 3 of them - so I stopped postmaster, and killed them all individually and
started postmaster again).  Now I can run my same pg_bench, or do you guys
have any other suggestions on attempting to reproduce the problem?

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: [ADMIN] does wal archiving block the current client connection?

From

Simon Riggs

Date:

19 May 2006, 13:41:55

On Fri, 2006-05-19 at 09:36 -0700, Jeff Frost wrote:
> On Fri, 19 May 2006, Tom Lane wrote:
>
> > What I'd suggest is resuming the test after making sure you've killed
> > off any old archivers, and seeing if you can make any progress on
> > reproducing the original problem.  We definitely need a
> > multiple-archiver interlock, but I think that must be unrelated to your
> > real problem.
>
> Ok, so I've got the old archivers gone (and btw, after a restart I ended up
> with 3 of them - so I stopped postmaster, and killed them all individually and
> started postmaster again).

Thats good.

> Now I can run my same pg_bench, or do you guys
> have any other suggestions on attempting to reproduce the problem?

No. We're back on track to try to reproduce the original error.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: [ADMIN] does wal archiving block the current client connection?

From

Jeff Frost

Date:

21 May 2006, 18:16:32

On Fri, 19 May 2006, Simon Riggs wrote:

>> Now I can run my same pg_bench, or do you guys
>> have any other suggestions on attempting to reproduce the problem?
>
> No. We're back on track to try to reproduce the original error.

I've been futzing with trying to reproduce the original problem for a few days
and so far postgres seems to be just fine with a long delay on archiving, so
now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in
pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps
I'll try a long delay followed by a failure to see if that could be it.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: [ADMIN] does wal archiving block the current client connection?

From

Simon Riggs

Date:

21 May 2006, 18:28:47

On Sun, 2006-05-21 at 14:16 -0700, Jeff Frost wrote:
> On Fri, 19 May 2006, Simon Riggs wrote:
>
> >> Now I can run my same pg_bench, or do you guys
> >> have any other suggestions on attempting to reproduce the problem?
> >
> > No. We're back on track to try to reproduce the original error.
>
> I've been futzing with trying to reproduce the original problem for a few days
> and so far postgres seems to be just fine with a long delay on archiving, so
> now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in
> pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps
> I'll try a long delay followed by a failure to see if that could be it.

So the chances of the original problem being archiver related are
receding...

--
  Simon Riggs
  EnterpriseDB          http://www.enterprisedb.com

Re: [ADMIN] does wal archiving block the current client connection?

From

Jeff Frost

Date:

21 May 2006, 18:39:14

On Sun, 21 May 2006, Simon Riggs wrote:

>> I've been futzing with trying to reproduce the original problem for a few days
>> and so far postgres seems to be just fine with a long delay on archiving, so
>> now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in
>> pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps
>> I'll try a long delay followed by a failure to see if that could be it.
>
> So the chances of the original problem being archiver related are
> receding...

This is possible, but I guess I should try and reproduce the actual problem
with the same archive_command script and a CIFS mount just to see what
happens.  Perhaps the real root of the problem is elsewhere, it just seems
strange since the archive_command is the only postgres related process that
accesses the CIFS share.  More later.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: [ADMIN] does wal archiving block the current client connection?

From

Jeff Frost

Date:

23 May 2006, 00:59:49

On Sun, 21 May 2006, Jeff Frost wrote:

>> So the chances of the original problem being archiver related are
>> receding...
>
> This is possible, but I guess I should try and reproduce the actual problem
> with the same archive_command script and a CIFS mount just to see what
> happens.  Perhaps the real root of the problem is elsewhere, it just seems
> strange since the archive_command is the only postgres related process that
> accesses the CIFS share.  More later.

I tried both pulling the plug on the CIFS server and unsharing the CIFS share,
but pgbench continued completely unconcerned.  I guess the failure mode of the
NAS device in the customer colo must be something different that I don't yet
know how to simulate.  I suspect I'll have to wait till it happens again and
try to gather some more data before restarting the NAS device.  Thanks for all
your suggestions guys!

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: [ADMIN] does wal archiving block the current client connection?

From

Tom Lane

Date:

23 May 2006, 01:02:47

Jeff Frost <jeff@frostconsultingllc.com> writes:
> I tried both pulling the plug on the CIFS server and unsharing the CIFS share,
> but pgbench continued completely unconcerned.  I guess the failure mode of the
> NAS device in the customer colo must be something different that I don't yet
> know how to simulate.  I suspect I'll have to wait till it happens again and
> try to gather some more data before restarting the NAS device.  Thanks for all
> your suggestions guys!

I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/
is on the NAS device.  Please double-check the file locations.

            regards, tom lane

Re: [ADMIN] does wal archiving block the current client connection?

From

Jeff Frost

Date:

23 May 2006, 01:11:10

On Tue, 23 May 2006, Tom Lane wrote:

> I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/
> is on the NAS device.  Please double-check the file locations.

I know that seems like an excellent candidate, but it really isn't, I swear.
In fact, you almost had me convinced the last time you asked me to check.. I
thought some helpful admin had moved something, but no:

postgres  9194  0.0  0.4 486568 16464 ?      S    May16   0:11
/usr/local/pgsql/bin/postmaster -p 5432 -D /usr/local/pgsql/data

db3:~/data $ pwd
/usr/local/pgsql/data

db3:~/data $ ls -l
total 64
-rw-------  1 postgres postgres     4 Feb 13 20:13 PG_VERSION
drwx------  6 postgres postgres  4096 Feb 13 21:00 base
drwx------  2 postgres postgres  4096 May 22 21:03 global
drwx------  2 postgres postgres  4096 May 22 17:45 pg_clog
-rw-------  1 postgres postgres  3575 Feb 13 20:13 pg_hba.conf
-rw-------  1 postgres postgres  1460 Feb 13 20:13 pg_ident.conf
drwx------  4 postgres postgres  4096 Feb 13 20:13 pg_multixact
drwx------  2 postgres postgres  4096 May 22 20:45 pg_subtrans
drwx------  2 postgres postgres  4096 Feb 13 20:13 pg_tblspc
drwx------  2 postgres postgres  4096 Feb 13 20:13 pg_twophase
lrwxrwxrwx  1 postgres postgres     9 Feb 13 22:10 pg_xlog -> /pg_xlog/
-rw-------  1 postgres postgres 13688 May 16 17:50 postgresql.conf
-rw-------  1 postgres postgres    63 May 16 17:54 postmaster.opts
-rw-------  1 postgres postgres    47 May 16 17:54 postmaster.pid

db3:~/data $ mount
/dev/sda2 on / type ext3 (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sda1 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/sdb1 on /usr/local/pgsql type ext3 (rw,data=writeback)
/dev/sda5 on /var type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
//10.1.1.28/pgbackup on /mnt/pgbackup type cifs (rw,mand,noexec,nosuid,nodev)

So, no..I wish it was that easy. :-/

If you have any other suggestions or inspirations, I'm all ears!

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954