Thread: does wal archiving block the current client connection?

does wal archiving block the current client connection?

From
Jeff Frost
Date:
I've run into a problem with a PITR setup at a client.  The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problems, it seems that the current client connection gets blocked and this
eventually builds up to a "sorry, too many clients already" error.  I'm
wondering if this is expected behavior with the archive command and if I
should build in some more smarts to my archive script.  Maybe I should fork
and waitpid such that I can use a manual timeout shorter than whatever the
CIFS timeout is so that I can return an error in a reasonable amount of time?

Has anyone else seen this problem?  Restarting the NAS device fixes the
problem but it would be much preferable if postges could soldier along without
the NAS for a little while before we resuscitate it.  We don't have an NFS or
rsync server available in this environment currently, though I suppose setting
up an rsync server for windows on the NAS would be a possibility.

Any suggestions much appreciated.

Currently the script is fairly simple and just does a 'cp' and then a 'gzip'
although we do use cp -f to copy over a possible previosly failed 'cp'.
Script is below:

. /usr/local/lib/includes.sh

FULLPATH="$1"
FILENAME="$2"

#
# Make sure we have pgbackup dir mounted
#
checkpgbackupmount

/bin/cp -f "$FULLPATH" "$PITRDESTDIR/$FILENAME"
if [ $? -ne 0 ]; then
         die "Could not cp $FULLPATH to $PITRDESTDIR/$FILENAME"
fi

/usr/bin/gzip -f "$PITRDESTDIR/$FILENAME"
#
# Make sure it worked, otherwise roll back
#
if [ $? -ne 0 ]; then
         /bin/rm -f "$PITRDESTDIR/$FILENAME"
         die "Could not /usr/bin/gzip $PITRDESTDIR/$FILENAME"
fi

exit 0


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Arthurs
Date:
What might be more bullet proof would be to make the archive command
copy the file to an intermediate local directory, then have a
daemon/cron job that wakes up once a minute or so, check for new files,
then copy them to the network mount.  You may want to use something like
lofs to make sure the archive command has finished and closed the file
before moving it to the network drive.

This is what I do, and I've never had a failure of the archive command.
-- Had a few network errors on the network drive (I use nfs) which I
fixed at my leisure, with no problems for the postgresql server.

Jeff Frost wrote:
> I've run into a problem with a PITR setup at a client.  The problem is
> that whenever the CIFS NAS device that we're mounting at /mnt/pgbackup
> has problems, it seems that the current client connection gets blocked
> and this eventually builds up to a "sorry, too many clients already"
> error.  I'm wondering if this is expected behavior with the archive
> command and if I should build in some more smarts to my archive
> script.  Maybe I should fork and waitpid such that I can use a manual
> timeout shorter than whatever the CIFS timeout is so that I can return
> an error in a reasonable amount of time?
>
> Has anyone else seen this problem?  Restarting the NAS device fixes
> the problem but it would be much preferable if postges could soldier
> along without the NAS for a little while before we resuscitate it.  We
> don't have an NFS or rsync server available in this environment
> currently, though I suppose setting up an rsync server for windows on
> the NAS would be a possibility.
>
> Any suggestions much appreciated.
>
> Currently the script is fairly simple and just does a 'cp' and then a
> 'gzip' although we do use cp -f to copy over a possible previosly
> failed 'cp'. Script is below:
>
> . /usr/local/lib/includes.sh
>
> FULLPATH="$1"
> FILENAME="$2"
>
> #
> # Make sure we have pgbackup dir mounted
> #
> checkpgbackupmount
>
> /bin/cp -f "$FULLPATH" "$PITRDESTDIR/$FILENAME"
> if [ $? -ne 0 ]; then
>         die "Could not cp $FULLPATH to $PITRDESTDIR/$FILENAME"
> fi
>
> /usr/bin/gzip -f "$PITRDESTDIR/$FILENAME"
> #
> # Make sure it worked, otherwise roll back
> #
> if [ $? -ne 0 ]; then
>         /bin/rm -f "$PITRDESTDIR/$FILENAME"
>         die "Could not /usr/bin/gzip $PITRDESTDIR/$FILENAME"
> fi
>
> exit 0
>
>

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> I've run into a problem with a PITR setup at a client.  The problem is that
> whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
> problems, it seems that the current client connection gets blocked and this
> eventually builds up to a "sorry, too many clients already" error.  I'm
> wondering if this is expected behavior with the archive command

No, I can't see what the connection should be there.  It's supposed to
be designed so that the archive command can take its sweet old time and
nothing happens except that a backlog of WAL files builds up in pg_xlog.

What PG version is this exactly?  Are you sure that the only connection
of the NAS device to PG operations is through the archive script, ie,
you don't have any part of the normal PG data directory mounted on it?

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Mon, 15 May 2006, Tom Lane wrote:

> No, I can't see what the connection should be there.  It's supposed to
> be designed so that the archive command can take its sweet old time and
> nothing happens except that a backlog of WAL files builds up in pg_xlog.

That's what I thought, but that doesn't seem to be what I'm observing.  Of
course the NAS device only gets wedged about once every month or two, so it's
difficult to reproduce.  Probably would need to build another system similar
and hit it with pg_bench or something similar, then pull the plug on the NAS
device to reproduce it and see if it acts similarly.

>
> What PG version is this exactly?  Are you sure that the only connection
> of the NAS device to PG operations is through the archive script, ie,
> you don't have any part of the normal PG data directory mounted on it?

It's postgresql-8.1.3 compiled from source.

PGDATA is in /usr/local/pgsql/data which is a local RAID10 array, and /pg_xlog
is mounted on a local RAID1.

Things that touch the /mnt/pgbackup directory are the pitr base backup scripts
which run twice monthly and the archive_command script.



--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> On Mon, 15 May 2006, Tom Lane wrote:
>> No, I can't see what the connection should be there.  It's supposed to
>> be designed so that the archive command can take its sweet old time and
>> nothing happens except that a backlog of WAL files builds up in pg_xlog.

> That's what I thought, but that doesn't seem to be what I'm observing.  Of
> course the NAS device only gets wedged about once every month or two, so it's
> difficult to reproduce.

If it's really a PG bug, it should be trivial to reproduce: put a long
sleep in the archive-command script.

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Mon, 15 May 2006, Tom Lane wrote:

>> That's what I thought, but that doesn't seem to be what I'm observing.  Of
>> course the NAS device only gets wedged about once every month or two, so it's
>> difficult to reproduce.
>
> If it's really a PG bug, it should be trivial to reproduce: put a long
> sleep in the archive-command script.

Thanks Tom.  I'll try and reproduce on one of the other PG servers I have
available when I have a chance and get back to you if it looks like a possible
bug.


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
> I've run into a problem with a PITR setup at a client.  The problem is that
> whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
> problems

What kind of problems?

> , it seems that the current client connection gets blocked and this
> eventually builds up to a "sorry, too many clients already" error.

This sounds like the archiver keeps waking up and trying the command,
but it fails, yet that request is causing a resource leak on the NAS.
Eventually, archiver retrying the command eventually fails. Or am I
misunderstanding your issues?

> I'm
> wondering if this is expected behavior with the archive command and if I
> should build in some more smarts to my archive script.  Maybe I should fork
> and waitpid such that I can use a manual timeout shorter than whatever the
> CIFS timeout is so that I can return an error in a reasonable amount of time?

The archiver is designed around the thought that *attempting* to archive
is a task that it can do indefinitely without a problem; its up to you
to spot that the link is down.

We can put something in to make the retry period elongate, but you'd
need to put a reasonable case for how that would increase robustness.

--
  Simon Riggs
  EnterpriseDB          http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Mon, 15 May 2006, Simon Riggs wrote:

> On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
>> I've run into a problem with a PITR setup at a client.  The problem is that
>> whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
>> problems
>
> What kind of problems?

It becomes unwritable for whatever reason CIFS shares become unwritable.  It's
a windows 2003 NAS device and a reboot solves the problem, but it leaves no
event logs on the windows side of things, so difficult to determine the root
cause.

>
>> , it seems that the current client connection gets blocked and this
>> eventually builds up to a "sorry, too many clients already" error.
>
> This sounds like the archiver keeps waking up and trying the command,
> but it fails, yet that request is causing a resource leak on the NAS.
> Eventually, archiver retrying the command eventually fails. Or am I
> misunderstanding your issues?

that's possible.  Does the archiver use a DB connection whenever it tries to
run archive_command?  If so, then that's almost certainly the problem.  I
suspect a faster timeout on the CIFS mount would fix the issue as well, but I
didn't see any such options in the mount.cifs manpage.

> The archiver is designed around the thought that *attempting* to archive
> is a task that it can do indefinitely without a problem; its up to you
> to spot that the link is down.
>
> We can put something in to make the retry period elongate, but you'd
> need to put a reasonable case for how that would increase robustness.

That all sounds perfectly reasonable.  If the archiver is using up a
connection for each archive_command issued, then I suspect that's our problem,
as there were also lots of debug logs showing that the db was trying to
archive several WAL files at near the same time, likely pushing us over our
100 connection limit.  If the archiver does not use up a connection, then I
suppose I don't know what's actually going on unless postgres blocks the
commit of the transaction which triggered the archive_command until the
archive command finishes (or fails).

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Scott Marlowe
Date:
On Mon, 2006-05-15 at 16:29, Jeff Frost wrote:
> On Mon, 15 May 2006, Simon Riggs wrote:
>
> > On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
> >> I've run into a problem with a PITR setup at a client.  The problem is that
> >> whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
> >> problems
> >
> > What kind of problems?
>
> It becomes unwritable for whatever reason CIFS shares become unwritable.  It's
> a windows 2003 NAS device and a reboot solves the problem, but it leaves no
> event logs on the windows side of things, so difficult to determine the root
> cause.

I don't think you need logs to figure out that the problem is that it's
a Windows based NAS device.  :)

Seriously, we had BSD based NAS devices one place I worked, and the
company making them switched to Win2k3 and after we had one or two do
exactly what you're talking about, we just started building our own
based on RedHat or BSD.  Same hardware (we just formatted the drives on
one of the ones that came with w2k3) and we never had another moment's
problem with them.

In other news... Does the archiver actually connect to the database?

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Mon, 2006-05-15 at 14:29 -0700, Jeff Frost wrote:
> On Mon, 15 May 2006, Simon Riggs wrote:
>
> > On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
> >> I've run into a problem with a PITR setup at a client.  The problem is that
> >> whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
> >> problems
> >
> > What kind of problems?
>
> It becomes unwritable for whatever reason CIFS shares become unwritable.  It's
> a windows 2003 NAS device and a reboot solves the problem, but it leaves no
> event logs on the windows side of things, so difficult to determine the root
> cause.

You should be able to re-create this problem without the database being
involved. Just set up a driver program over the top of the archive
script so it flies in a tighter loop than the archiver would make it. If
you still get the Windows NAS error... well, I'll leave that to you.

> >> , it seems that the current client connection gets blocked and this
> >> eventually builds up to a "sorry, too many clients already" error.

Tell us more about what the blockage looks like. We may yet thank
Windows for finding a bug, but I'm not sure yet.

> > This sounds like the archiver keeps waking up and trying the command,
> > but it fails, yet that request is causing a resource leak on the NAS.
> > Eventually, archiver retrying the command eventually fails. Or am I
> > misunderstanding your issues?
>
> that's possible.  Does the archiver use a DB connection whenever it tries to
> run archive_command?

Not at all.

>  If so, then that's almost certainly the problem.  I
> suspect a faster timeout on the CIFS mount would fix the issue as well, but I
> didn't see any such options in the mount.cifs manpage.
>
> > The archiver is designed around the thought that *attempting* to archive
> > is a task that it can do indefinitely without a problem; its up to you
> > to spot that the link is down.
> >
> > We can put something in to make the retry period elongate, but you'd
> > need to put a reasonable case for how that would increase robustness.
>
> That all sounds perfectly reasonable.  If the archiver is using up a
> connection for each archive_command issued, then I suspect that's our problem,
> as there were also lots of debug logs showing that the db was trying to
> archive several WAL files at near the same time, likely pushing us over our
> 100 connection limit.

Oh, you mean database clients cannot connect. I thought you meant you
were getting a CIFS client connection error from the archiver. That's
wierd.

>  If the archiver does not use up a connection, then I
> suppose I don't know what's actually going on unless postgres blocks the
> commit of the transaction which triggered the archive_command until the
> archive command finishes (or fails).

I think you need to show the database log covering the period in error.

Are you running out of disk space in the database directory? Can you
check again that pg_xlog and pg_xlog/archive_status is definitely not on
the NAS?

--
  Simon Riggs
  EnterpriseDB          http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Mon, 2006-05-15 at 16:58 -0700, Jeff Frost wrote:

> The log is below.  Note that the problem began around 2a.m. around the time
> the complaint about checkpoint segments happens.  After a bit of research it
> appears that the checkpoint complaint happens when our db maintenance job
> kicks off at 2:00a.m. The crisis ended around 3:45a.m. with the restart of the
> NAS device and normal operations resumed just before the end of this log.
> There were also lots of CIFS errors of the following type in dmesg and in the
> log:
>
>   CIFS VFS: Send error in read = -9
>   CIFS VFS: Send error in read = -9

I don't see much evidence for a connection between archiver and these
issues. The problems start after autovacuum of "vb_web" at 02:08. That
seems much more likely to have something to do with client connections
than the archiver - which is really nothing too special. However,
archiver doesn't start processing files again after 02:06. When is the
next "archived transaction log file" message?

Whatever happened between 02:08 and 02:14 seems important. Do you have
any other logs to cross-correlate for different events. What do the
clients think happened? What did they do during that time?

> May 15 01:57:31 db3 postgres[12723]: [1-1] LOG:  autovacuum: processing database "vb_web"
> May 15 01:58:03 db3 crond(pam_unix)[12755]: session closed for user postgres
> May 15 01:58:31 db3 postgres[12824]: [1-1] LOG:  autovacuum: processing database "template1"
> May 15 01:59:01 db3 crond(pam_unix)[12858]: session closed for user postgres
> May 15 01:59:31 db3 postgres[12928]: [1-1] LOG:  autovacuum: processing database "postgres"
> May 15 02:00:10 db3 crond(pam_unix)[12964]: session closed for user postgres
> May 15 02:00:11 db3 postgres[22008]: [1090-1] LOG:  checkpoints are occurring too frequently (10 seconds apart)
> May 15 02:00:11 db3 postgres[22008]: [1090-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:00:14 db3 postgres[22009]: [4886-1] LOG:  archived transaction log file "00000001000000160000007F"
> May 15 02:00:23 db3 postgres[22009]: [4887-1] LOG:  archived transaction log file "000000010000001600000080"
> May 15 02:00:23 db3 postgres[22008]: [1091-1] LOG:  checkpoints are occurring too frequently (9 seconds apart)
> May 15 02:00:23 db3 postgres[22008]: [1091-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:00:31 db3 postgres[22009]: [4888-1] LOG:  archived transaction log file "000000010000001600000081"
> May 15 02:00:36 db3 postgres[13747]: [1-1] LOG:  autovacuum: processing database "vb_web"
> May 15 02:00:36 db3 postgres[22008]: [1092-1] LOG:  checkpoints are occurring too frequently (13 seconds apart)
> May 15 02:00:36 db3 postgres[22008]: [1092-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:00:38 db3 postgres[22009]: [4889-1] LOG:  archived transaction log file "000000010000001600000082"
> May 15 02:00:45 db3 postgres[22009]: [4890-1] LOG:  archived transaction log file "000000010000001600000083"
> May 15 02:00:53 db3 postgres[22009]: [4891-1] LOG:  archived transaction log file "000000010000001600000084"
> May 15 02:00:54 db3 postgres[22008]: [1093-1] LOG:  checkpoints are occurring too frequently (18 seconds apart)
> May 15 02:00:54 db3 postgres[22008]: [1093-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:01:00 db3 postgres[22009]: [4892-1] LOG:  archived transaction log file "000000010000001600000085"
> May 15 02:01:05 db3 crond(pam_unix)[13815]: session closed for user postgres
> May 15 02:01:07 db3 postgres[22009]: [4893-1] LOG:  archived transaction log file "000000010000001600000086"
> May 15 02:01:13 db3 postgres[22008]: [1094-1] LOG:  checkpoints are occurring too frequently (19 seconds apart)
> May 15 02:01:13 db3 postgres[22008]: [1094-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:01:14 db3 postgres[22009]: [4894-1] LOG:  archived transaction log file "000000010000001600000087"
> May 15 02:01:23 db3 postgres[22009]: [4895-1] LOG:  archived transaction log file "000000010000001600000088"
> May 15 02:01:25 db3 postgres[22008]: [1095-1] LOG:  checkpoints are occurring too frequently (12 seconds apart)
> May 15 02:01:25 db3 postgres[22008]: [1095-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:01:31 db3 postgres[22009]: [4896-1] LOG:  archived transaction log file "000000010000001600000089"
> May 15 02:01:35 db3 postgres[22008]: [1096-1] LOG:  checkpoints are occurring too frequently (10 seconds apart)
> May 15 02:01:35 db3 postgres[22008]: [1096-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:01:39 db3 postgres[22009]: [4897-1] LOG:  archived transaction log file "00000001000000160000008A"
> May 15 02:01:45 db3 postgres[22008]: [1097-1] LOG:  checkpoints are occurring too frequently (10 seconds apart)
> May 15 02:01:45 db3 postgres[22008]: [1097-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
> May 15 02:01:47 db3 postgres[22009]: [4898-1] LOG:  archived transaction log file "00000001000000160000008B"
> May 15 02:01:47 db3 postgres[13964]: [1-1] LOG:  autovacuum: processing database "template1"
> May 15 02:01:54 db3 postgres[22009]: [4899-1] LOG:  archived transaction log file "00000001000000160000008C"
> May 15 02:02:00 db3 postgres[22009]: [4900-1] LOG:  archived transaction log file "00000001000000160000008D"
> May 15 02:02:07 db3 crond(pam_unix)[14000]: session closed for user postgres
> May 15 02:02:08 db3 postgres[22009]: [4901-1] LOG:  archived transaction log file "00000001000000160000008E"
> May 15 02:02:15 db3 postgres[22009]: [4902-1] LOG:  archived transaction log file "00000001000000160000008F"
> May 15 02:02:21 db3 postgres[22009]: [4903-1] LOG:  archived transaction log file "000000010000001600000090"
> May 15 02:02:27 db3 postgres[22009]: [4904-1] LOG:  archived transaction log file "000000010000001600000091"
> May 15 02:02:34 db3 postgres[22009]: [4905-1] LOG:  archived transaction log file "000000010000001600000092"
> May 15 02:02:42 db3 postgres[22009]: [4906-1] LOG:  archived transaction log file "000000010000001600000093"
> May 15 02:02:51 db3 postgres[14150]: [1-1] LOG:  autovacuum: processing database "postgres"
> May 15 02:02:52 db3 postgres[22009]: [4907-1] LOG:  archived transaction log file "000000010000001600000094"
> May 15 02:03:00 db3 postgres[22009]: [4908-1] LOG:  archived transaction log file "000000010000001600000095"
> May 15 02:03:05 db3 crond(pam_unix)[14182]: session closed for user postgres
> May 15 02:03:08 db3 postgres[22009]: [4909-1] LOG:  archived transaction log file "000000010000001600000096"
> May 15 02:03:15 db3 postgres[22009]: [4910-1] LOG:  archived transaction log file "000000010000001600000097"
> May 15 02:03:21 db3 postgres[22009]: [4911-1] LOG:  archived transaction log file "000000010000001600000098"
> May 15 02:03:51 db3 postgres[14292]: [1-1] LOG:  autovacuum: processing database "vb_web"
> May 15 02:04:08 db3 crond(pam_unix)[14305]: session closed for user postgres
> May 15 02:04:51 db3 postgres[22009]: [4912-1] LOG:  archived transaction log file "000000010000001600000099"
> May 15 02:05:05 db3 crond(pam_unix)[14415]: session closed for user postgres
> May 15 02:06:05 db3 crond(pam_unix)[14517]: session closed for user postgres
> May 15 02:06:24 db3 postgres[14582]: [1-1] LOG:  autovacuum: processing database "template1"
> May 15 02:06:43 db3 postgres[22009]: [4913-1] LOG:  archived transaction log file "00000001000000160000009A"
> May 15 02:07:07 db3 crond(pam_unix)[14632]: session closed for user postgres
> May 15 02:07:33 db3 postgres[14702]: [1-1] LOG:  autovacuum: processing database "postgres"
> May 15 02:08:48 db3 postgres[14724]: [1-1] LOG:  autovacuum: processing database "vb_web"
> May 15 02:13:03 db3 postgres[14781]: [1-1] LOG:  unexpected EOF on client connection
> May 15 02:14:42 db3 postgres[14796]: [1-1] LOG:  unexpected EOF on client connection
> May 15 02:15:53 db3 postgres[14755]: [1-1] LOG:  unexpected EOF on client connection
> May 15 02:16:11 db3 postgres[14856]: [1-1] LOG:  unexpected EOF on client connection
> May 15 02:19:52 db3 postgres[14869]: [1-1] LOG:  unexpected EOF on client connection

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Tue, 16 May 2006, Simon Riggs wrote:

> I don't see much evidence for a connection between archiver and these
> issues. The problems start after autovacuum of "vb_web" at 02:08. That
> seems much more likely to have something to do with client connections
> than the archiver - which is really nothing too special. However,
> archiver doesn't start processing files again after 02:06. When is the
> next "archived transaction log file" message?
>
> Whatever happened between 02:08 and 02:14 seems important. Do you have
> any other logs to cross-correlate for different events. What do the
> clients think happened? What did they do during that time?

I have the logs and after reviewing /var/log/messages for that time period,
there is no other activity besides postgres.  The next archived transaction
log file messages are:

May 14 02:24:54 db3 postgres[22009]: [4876-1] LOG:  archived transaction log
fil
e "000000010000001600000078"
May 14 02:34:05 db3 postgres[22009]: [4877-1] LOG:  archived transaction log
fil
e "000000010000001600000079"
May 14 02:35:49 db3 postgres[22009]: [4878-1] LOG:  archived transaction log
fil
e "00000001000000160000007A"
May 14 02:41:58 db3 postgres[22009]: [4879-1] LOG:  archived transaction log
fil
e "00000001000000160000007B"
May 14 02:57:35 db3 postgres[22009]: [4880-1] LOG:  archived transaction log
fil
e "00000001000000160000007C"

The clients are all on two windows web servers, anybody know where to look for
errors with that?


>
>> May 15 01:57:31 db3 postgres[12723]: [1-1] LOG:  autovacuum: processing database "vb_web"
>> May 15 01:58:03 db3 crond(pam_unix)[12755]: session closed for user postgres
>> May 15 01:58:31 db3 postgres[12824]: [1-1] LOG:  autovacuum: processing database "template1"
>> May 15 01:59:01 db3 crond(pam_unix)[12858]: session closed for user postgres
>> May 15 01:59:31 db3 postgres[12928]: [1-1] LOG:  autovacuum: processing database "postgres"
>> May 15 02:00:10 db3 crond(pam_unix)[12964]: session closed for user postgres
>> May 15 02:00:11 db3 postgres[22008]: [1090-1] LOG:  checkpoints are occurring too frequently (10 seconds apart)
>> May 15 02:00:11 db3 postgres[22008]: [1090-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:00:14 db3 postgres[22009]: [4886-1] LOG:  archived transaction log file "00000001000000160000007F"
>> May 15 02:00:23 db3 postgres[22009]: [4887-1] LOG:  archived transaction log file "000000010000001600000080"
>> May 15 02:00:23 db3 postgres[22008]: [1091-1] LOG:  checkpoints are occurring too frequently (9 seconds apart)
>> May 15 02:00:23 db3 postgres[22008]: [1091-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:00:31 db3 postgres[22009]: [4888-1] LOG:  archived transaction log file "000000010000001600000081"
>> May 15 02:00:36 db3 postgres[13747]: [1-1] LOG:  autovacuum: processing database "vb_web"
>> May 15 02:00:36 db3 postgres[22008]: [1092-1] LOG:  checkpoints are occurring too frequently (13 seconds apart)
>> May 15 02:00:36 db3 postgres[22008]: [1092-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:00:38 db3 postgres[22009]: [4889-1] LOG:  archived transaction log file "000000010000001600000082"
>> May 15 02:00:45 db3 postgres[22009]: [4890-1] LOG:  archived transaction log file "000000010000001600000083"
>> May 15 02:00:53 db3 postgres[22009]: [4891-1] LOG:  archived transaction log file "000000010000001600000084"
>> May 15 02:00:54 db3 postgres[22008]: [1093-1] LOG:  checkpoints are occurring too frequently (18 seconds apart)
>> May 15 02:00:54 db3 postgres[22008]: [1093-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:01:00 db3 postgres[22009]: [4892-1] LOG:  archived transaction log file "000000010000001600000085"
>> May 15 02:01:05 db3 crond(pam_unix)[13815]: session closed for user postgres
>> May 15 02:01:07 db3 postgres[22009]: [4893-1] LOG:  archived transaction log file "000000010000001600000086"
>> May 15 02:01:13 db3 postgres[22008]: [1094-1] LOG:  checkpoints are occurring too frequently (19 seconds apart)
>> May 15 02:01:13 db3 postgres[22008]: [1094-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:01:14 db3 postgres[22009]: [4894-1] LOG:  archived transaction log file "000000010000001600000087"
>> May 15 02:01:23 db3 postgres[22009]: [4895-1] LOG:  archived transaction log file "000000010000001600000088"
>> May 15 02:01:25 db3 postgres[22008]: [1095-1] LOG:  checkpoints are occurring too frequently (12 seconds apart)
>> May 15 02:01:25 db3 postgres[22008]: [1095-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:01:31 db3 postgres[22009]: [4896-1] LOG:  archived transaction log file "000000010000001600000089"
>> May 15 02:01:35 db3 postgres[22008]: [1096-1] LOG:  checkpoints are occurring too frequently (10 seconds apart)
>> May 15 02:01:35 db3 postgres[22008]: [1096-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:01:39 db3 postgres[22009]: [4897-1] LOG:  archived transaction log file "00000001000000160000008A"
>> May 15 02:01:45 db3 postgres[22008]: [1097-1] LOG:  checkpoints are occurring too frequently (10 seconds apart)
>> May 15 02:01:45 db3 postgres[22008]: [1097-2] HINT:  Consider increasing the configuration parameter
"checkpoint_segments".
>> May 15 02:01:47 db3 postgres[22009]: [4898-1] LOG:  archived transaction log file "00000001000000160000008B"
>> May 15 02:01:47 db3 postgres[13964]: [1-1] LOG:  autovacuum: processing database "template1"
>> May 15 02:01:54 db3 postgres[22009]: [4899-1] LOG:  archived transaction log file "00000001000000160000008C"
>> May 15 02:02:00 db3 postgres[22009]: [4900-1] LOG:  archived transaction log file "00000001000000160000008D"
>> May 15 02:02:07 db3 crond(pam_unix)[14000]: session closed for user postgres
>> May 15 02:02:08 db3 postgres[22009]: [4901-1] LOG:  archived transaction log file "00000001000000160000008E"
>> May 15 02:02:15 db3 postgres[22009]: [4902-1] LOG:  archived transaction log file "00000001000000160000008F"
>> May 15 02:02:21 db3 postgres[22009]: [4903-1] LOG:  archived transaction log file "000000010000001600000090"
>> May 15 02:02:27 db3 postgres[22009]: [4904-1] LOG:  archived transaction log file "000000010000001600000091"
>> May 15 02:02:34 db3 postgres[22009]: [4905-1] LOG:  archived transaction log file "000000010000001600000092"
>> May 15 02:02:42 db3 postgres[22009]: [4906-1] LOG:  archived transaction log file "000000010000001600000093"
>> May 15 02:02:51 db3 postgres[14150]: [1-1] LOG:  autovacuum: processing database "postgres"
>> May 15 02:02:52 db3 postgres[22009]: [4907-1] LOG:  archived transaction log file "000000010000001600000094"
>> May 15 02:03:00 db3 postgres[22009]: [4908-1] LOG:  archived transaction log file "000000010000001600000095"
>> May 15 02:03:05 db3 crond(pam_unix)[14182]: session closed for user postgres
>> May 15 02:03:08 db3 postgres[22009]: [4909-1] LOG:  archived transaction log file "000000010000001600000096"
>> May 15 02:03:15 db3 postgres[22009]: [4910-1] LOG:  archived transaction log file "000000010000001600000097"
>> May 15 02:03:21 db3 postgres[22009]: [4911-1] LOG:  archived transaction log file "000000010000001600000098"
>> May 15 02:03:51 db3 postgres[14292]: [1-1] LOG:  autovacuum: processing database "vb_web"
>> May 15 02:04:08 db3 crond(pam_unix)[14305]: session closed for user postgres
>> May 15 02:04:51 db3 postgres[22009]: [4912-1] LOG:  archived transaction log file "000000010000001600000099"
>> May 15 02:05:05 db3 crond(pam_unix)[14415]: session closed for user postgres
>> May 15 02:06:05 db3 crond(pam_unix)[14517]: session closed for user postgres
>> May 15 02:06:24 db3 postgres[14582]: [1-1] LOG:  autovacuum: processing database "template1"
>> May 15 02:06:43 db3 postgres[22009]: [4913-1] LOG:  archived transaction log file "00000001000000160000009A"
>> May 15 02:07:07 db3 crond(pam_unix)[14632]: session closed for user postgres
>> May 15 02:07:33 db3 postgres[14702]: [1-1] LOG:  autovacuum: processing database "postgres"
>> May 15 02:08:48 db3 postgres[14724]: [1-1] LOG:  autovacuum: processing database "vb_web"
>> May 15 02:13:03 db3 postgres[14781]: [1-1] LOG:  unexpected EOF on client connection
>> May 15 02:14:42 db3 postgres[14796]: [1-1] LOG:  unexpected EOF on client connection
>> May 15 02:15:53 db3 postgres[14755]: [1-1] LOG:  unexpected EOF on client connection
>> May 15 02:16:11 db3 postgres[14856]: [1-1] LOG:  unexpected EOF on client connection
>> May 15 02:19:52 db3 postgres[14869]: [1-1] LOG:  unexpected EOF on client connection
>
>

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> On Tue, 16 May 2006, Simon Riggs wrote:
>> Whatever happened between 02:08 and 02:14 seems important.

> I have the logs and after reviewing /var/log/messages for that time period,
> there is no other activity besides postgres.

I have a lurking feeling that the still-hypothetical connection between
archiver and foreground operations might come into operation at pg_clog
page boundaries (which require emitting XLOG events) --- that is, every
32K transactions something special happens.  The time delay between
archiver wedging and foreground wedging would then correspond to how
long it took the XID counter to reach the next 32K multiple.  (Jeff,
what transaction rate do you see on that server --- is that a plausible
delay for some thousands of transactions to pass?)

This is just a guess, but if you check the archives for Chris K-L's
out-of-disk-space server meltdown a year or three ago, you'll see
something similar.

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Wed, 17 May 2006, Tom Lane wrote:

>
> I have a lurking feeling that the still-hypothetical connection between
> archiver and foreground operations might come into operation at pg_clog
> page boundaries (which require emitting XLOG events) --- that is, every
> 32K transactions something special happens.  The time delay between
> archiver wedging and foreground wedging would then correspond to how
> long it took the XID counter to reach the next 32K multiple.  (Jeff,
> what transaction rate do you see on that server --- is that a plausible
> delay for some thousands of transactions to pass?)
>
> This is just a guess, but if you check the archives for Chris K-L's
> out-of-disk-space server meltdown a year or three ago, you'll see
> something similar.

This sounds interesting and I'll go have a look for this thread in the
archives.

It looks like it's running about 3,000 transactions per minute right now
(around 10pm).  I got that by doing the following...tell me if this is
reasonable:

db3:~ $ echo select xact_commit from pg_stat_database where datname =
\'vb_web\'\; | psql vb_web ; sleep 60 ; echo select xact_commit from
pg_stat_database where datname = \'vb_web\'\; | psql vb_web
  xact_commit
-------------
       806325
(1 row)

  xact_commit
-------------
       809414
(1 row)

So, doing a little subtraction, I got 3,089 transactions for the 60 second
period I ran it.  I would guess it's quite a bit more during the day.  That
means it would hit 32k transactions around 10-11 minutes...does that fit in
with our timing from above?  (Tom, I'm not sure which timing you're
using.)


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Wed, 2006-05-17 at 00:36 -0400, Tom Lane wrote:
> Jeff Frost <jeff@frostconsultingllc.com> writes:
> > On Tue, 16 May 2006, Simon Riggs wrote:
> >> Whatever happened between 02:08 and 02:14 seems important.
>
> > I have the logs and after reviewing /var/log/messages for that time period,
> > there is no other activity besides postgres.
>
> I have a lurking feeling that the still-hypothetical connection between
> archiver and foreground operations might come into operation at pg_clog
> page boundaries (which require emitting XLOG events) --- that is, every
> 32K transactions something special happens.  The time delay between
> archiver wedging and foreground wedging would then correspond to how
> long it took the XID counter to reach the next 32K multiple.  (Jeff,
> what transaction rate do you see on that server --- is that a plausible
> delay for some thousands of transactions to pass?)
>
> This is just a guess, but if you check the archives for Chris K-L's
> out-of-disk-space server meltdown a year or three ago, you'll see
> something similar.

You'll have to explain a little more. I checked the archives...

archiver looks for archive_status files that end with .ready and that
has got nothing at all to do with transactions, LWlocks etc. If there's
a file ready, it will archive it, if there's not - it won't. There is
very deliberately a very low amount of synchronization there: archiver
holds no locks, LWLocks or spinlocks at any time.

The "lurking feeling" scenario above might or might nor be an issue
here, but I can't see how the archiver could be involved at all. I see
no evidence for the archiver to be the source of a problem here and that
the only reason we're checking that is as a result of Jeff's original
conjecture that there was a connection. There *was* a problem, yes, but
I think we're looking in the wrong place for the murder weapon.

pg_clog page extension does look like it can offer problems, generally.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> You'll have to explain a little more. I checked the archives...

I was thinking of
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
full explanation here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php

> The "lurking feeling" scenario above might or might nor be an issue
> here, but I can't see how the archiver could be involved at all.

Well, I don't see it either; at this point we're waiting on Jeff to
provide some harder evidence ...

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Wed, 2006-05-17 at 10:01 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > You'll have to explain a little more. I checked the archives...
>
> I was thinking of
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
> full explanation here:
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php

Sorry, I see my note looks like I couldn't find it. I meant "I've read
the archives". Thanks though.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Wed, 17 May 2006, Tom Lane wrote:

>> The "lurking feeling" scenario above might or might nor be an issue
>> here, but I can't see how the archiver could be involved at all.
>
> Well, I don't see it either; at this point we're waiting on Jeff to
> provide some harder evidence ...

Was the 3,000 transactions per minute helpful?  What other evidence should I
be looking for?  Really the only evidence I have at this point is that when
the NAS gets rebooted and comes back up, the postmaster resumes normal
operations and the client count goes back down; however, that could just be
coincidental, but we've seen it happen 3 times now.  This might have nothing
to do with archiving, it just seems like the only likely candidate since there
are only two things which happen on this volmume: PITR and rsyncing of
pg_dumps (we dump locally first for speed reasons).  Maybe this is just a
linux kernel thing?  Or maybe something else is blocking because it tries to
stat all the filesystems and /mnt/pgbackup blocks?  I can't imagine postgres
doing that though.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> Was the 3,000 transactions per minute helpful?  What other evidence should I
> be looking for?

Did you try generating a test case using a long sleep() as a replacement
for the archive_command script?  If there is a PG bug here it shouldn't
be that hard to expose it in a simple test case.  I'm up to my armpits
in other stuff and don't have time to try it myself...

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Wed, 17 May 2006, Tom Lane wrote:

> Did you try generating a test case using a long sleep() as a replacement
> for the archive_command script?  If there is a PG bug here it shouldn't
> be that hard to expose it in a simple test case.  I'm up to my armpits
> in other stuff and don't have time to try it myself...

Interesting that you should say that as I had just started setting up a test.
I seem to get alot of these:

May 17 21:34:04 discord postgres[20573]: [4-1] LOG:  archived transaction log
file "000000010000000000000001"
May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename file
"pg_xlog/archive_status/000000010000000000000001.ready" to
May 17 21:34:04 discord postgres[20573]: [5-2]
"pg_xlog/archive_status/000000010000000000000001.done": No such file or
directory

I'm guessing that the archiver moves the WAL file to <wal filename>.ready, and
successful completion, it renames it to <wal filename>.done?  My
archive_command is the utmost in simplicity...looks like this:

----
#!/bin/sh

FULLPATH="$1"
FILENAME="$2"

sleep 300

cat $FULLPATH > /dev/null
----

So far I haven't been able to reproduce, but I'm just getting started.
Currently I'm using pgbench to generate WAL rollover, do you guys have any
other handy testing tools for this sort of job or is this the best tool?


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Wed, 17 May 2006, Jeff Frost wrote:

> On Wed, 17 May 2006, Tom Lane wrote:
>
>> Did you try generating a test case using a long sleep() as a replacement
>> for the archive_command script?  If there is a PG bug here it shouldn't
>> be that hard to expose it in a simple test case.  I'm up to my armpits
>> in other stuff and don't have time to try it myself...
>
> Interesting that you should say that as I had just started setting up a test.
> I seem to get alot of these:
>
> May 17 21:34:04 discord postgres[20573]: [4-1] LOG:  archived transaction log
> file "000000010000000000000001"
> May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename
> file "pg_xlog/archive_status/000000010000000000000001.ready" to
> May 17 21:34:04 discord postgres[20573]: [5-2]
> "pg_xlog/archive_status/000000010000000000000001.done": No such file or
> directory
>
> I'm guessing that the archiver moves the WAL file to <wal filename>.ready,
> and successful completion, it renames it to <wal filename>.done?

Oh, except that they are 0 bytes, so they must be just status files.
Nevermind on that question. :-)


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> I seem to get alot of these:

> May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename file
> "pg_xlog/archive_status/000000010000000000000001.ready" to
> May 17 21:34:04 discord postgres[20573]: [5-2]
> "pg_xlog/archive_status/000000010000000000000001.done": No such file or
> directory

That seems odd ...

> Currently I'm using pgbench to generate WAL rollover, do you guys have any
> other handy testing tools for this sort of job or is this the best tool?

pgbench seems like an OK load for this, although it doesn't start/end
any new connections while running.  I *think* that that's not critical
--- my guess is that your observation of new connections hanging is just
because each new connection has to run one startup transaction, and
transactions in general are at risk of hanging --- but that could be
wrong.

Also, you could increase the rate of WAL generation by decreasing the
checkpoint segments/timeout parameters, if you need to.

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Thu, 18 May 2006, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> I seem to get alot of these:
>
>> May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename file
>> "pg_xlog/archive_status/000000010000000000000001.ready" to
>> May 17 21:34:04 discord postgres[20573]: [5-2]
>> "pg_xlog/archive_status/000000010000000000000001.done": No such file or
>> directory
>
> That seems odd ...

Further interesting items:

May 17 22:41:18 discord postgres[23817]: [3-1] LOG:  archive command
"/usr/local/pgsql-8.1.3/bin/archive_test.sh "pg_xlog/00000001000000000000000E"
"00000001000000000000000E""
May 17 22:41:18 discord postgres[23817]: [3-2]  failed: return code 256

And in the window where I started postgres via pg_ctl, I had this:

cat: pg_xlog/00000001000000000000000E: No such file or directory
cat: pg_xlog/00000001000000000000000E: No such file or directory

Seems bad.


>
>> Currently I'm using pgbench to generate WAL rollover, do you guys have any
>> other handy testing tools for this sort of job or is this the best tool?
>
> pgbench seems like an OK load for this, although it doesn't start/end
> any new connections while running.  I *think* that that's not critical
> --- my guess is that your observation of new connections hanging is just
> because each new connection has to run one startup transaction, and
> transactions in general are at risk of hanging --- but that could be
> wrong.
>
> Also, you could increase the rate of WAL generation by decreasing the
> checkpoint segments/timeout parameters, if you need to.
>
>             regards, tom lane
>
>

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Wed, 17 May 2006, Jeff Frost wrote:

> And in the window where I started postgres via pg_ctl, I had this:
>
> cat: pg_xlog/00000001000000000000000E: No such file or directory
> cat: pg_xlog/00000001000000000000000E: No such file or directory

Hrmmm...my pgbench died with an integer out of range error:

Client 56 aborted in state 8: ERROR:  integer out of range
client 81 receiving
client 81 sending UPDATE branches SET bbalance = bbalance + 4512 WHERE bid =
1;
client 86 receiving
Client 86 aborted in state 8: ERROR:  integer out of range
client 85 receiving
Client 85 aborted in state 8: ERROR:  integer out of range
client 81 receiving
Client 81 aborted in state 8: ERROR:  integer out of range
client 78 receiving
client 78 sending UPDATE branches SET bbalance = bbalance + 2868 WHERE bid =
1;
client 78 receiving
Client 78 aborted in state 8: ERROR:  integer out of range
Client 56 aborted in state 8: ERROR:  integer out of range
client 81 receiving
client 81 sending UPDATE branches SET bbalance = bbalance + 4512 WHERE bid =
1;
client 86 receiving
Client 86 aborted in state 8: ERROR:  integer out of range
client 85 receiving
Client 85 aborted in state 8: ERROR:  integer out of range
client 81 receiving
Client 81 aborted in state 8: ERROR:  integer out of range
client 78 receiving
client 78 sending UPDATE branches SET bbalance = bbalance + 2868 WHERE bid =
1;
client 78 receiving
Client 78 aborted in state 8: ERROR:  integer out of range

I'm guessing those bbalance + x updates ran bbalance up too high.

Anyway, I keep getting these on occassion:

May 17 23:01:22 discord postgres[23817]: [8-1] LOG:  archive command
"/usr/local/pgsql-8.1.3/bin/archive_test.sh "pg_xlog/000000010000000000000011"
"000000010000000000000011""
May 17 23:01:22 discord postgres[23817]: [8-2]  failed: return code 256
May 17 23:01:22 discord postgres[23817]: [9-1] WARNING:  transaction log file
"000000010000000000000011" could not be archived: too many failures
May 17 23:04:06 discord postgres[20573]: [40-1] LOG:  archived transaction log
file "000000010000000000000013"
May 17 23:04:06 discord postgres[20573]: [41-1] WARNING:  could not rename
file "pg_xlog/archive_status/000000010000000000000013.ready" to
May 17 23:04:06 discord postgres[20573]: [41-2]
"pg_xlog/archive_status/000000010000000000000013.done": No such file or
directory
May 17 23:04:27 discord postgres[20228]: [22-1] LOG:  archived transaction log
file "000000010000000000000014"

and in the other window:

cat: pg_xlog/000000010000000000000011: No such file or directory
cat: pg_xlog/000000010000000000000011: No such file or directory
cat: pg_xlog/000000010000000000000011: No such file or directory

How on earth can this happen?

>
>
>>
>>> Currently I'm using pgbench to generate WAL rollover, do you guys have any
>>> other handy testing tools for this sort of job or is this the best tool?
>>
>> pgbench seems like an OK load for this, although it doesn't start/end
>> any new connections while running.  I *think* that that's not critical
>> --- my guess is that your observation of new connections hanging is just
>> because each new connection has to run one startup transaction, and
>> transactions in general are at risk of hanging --- but that could be
>> wrong.
>>
>> Also, you could increase the rate of WAL generation by decreasing the
>> checkpoint segments/timeout parameters, if you need to.
>>
>>             regards, tom lane
>>
>>
>
>

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Wed, 2006-05-17 at 22:45 -0700, Jeff Frost wrote:
> On Thu, 18 May 2006, Tom Lane wrote:
>
> > Jeff Frost <jeff@frostconsultingllc.com> writes:
> >> I seem to get alot of these:
> >
> >> May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename file
> >> "pg_xlog/archive_status/000000010000000000000001.ready" to
> >> May 17 21:34:04 discord postgres[20573]: [5-2]
> >> "pg_xlog/archive_status/000000010000000000000001.done": No such file or
> >> directory
> >
> > That seems odd ...
>
> Further interesting items:
>
> May 17 22:41:18 discord postgres[23817]: [3-1] LOG:  archive command
> "/usr/local/pgsql-8.1.3/bin/archive_test.sh "pg_xlog/00000001000000000000000E"
> "00000001000000000000000E""
> May 17 22:41:18 discord postgres[23817]: [3-2]  failed: return code 256
> And in the window where I started postgres via pg_ctl, I had this:
>
> cat: pg_xlog/00000001000000000000000E: No such file or directory
> cat: pg_xlog/00000001000000000000000E: No such file or directory
>
> Seems bad.

Seems so.

Can you post the full test, plus full execution log.

[You don't need to "cat" you could just do "ls" instead FWIW]

Are you doing *anything* with pg_xlog directory or below? I understand
your saying No to that question and pg_xlog has not been moved, its just
underneath data directory, which is on normal disk?

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> Hrmmm...my pgbench died with an integer out of range error:

That's normal, if you run it long enough without re-creating the tables.
It keeps adding small values to the balances, and eventually they
overflow.  (Possibly someone should fix it so that the deltas are
uniformly distributed around zero, instead of being always positive.)

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Thu, 18 May 2006, Simon Riggs wrote:

> Seems so.
>
> Can you post the full test, plus full execution log.
>
> [You don't need to "cat" you could just do "ls" instead FWIW]
>
> Are you doing *anything* with pg_xlog directory or below? I understand
> your saying No to that question and pg_xlog has not been moved, its just
> underneath data directory, which is on normal disk?

I did a fresh compile of 8.1.3 with --prefix=/usr/local/pgsql-8.1.3.  On this
particular system, this is actually located on / which is /dev/md1 (a mirror).
I'll try and start a brand new version of the test so I can capture the
logging to syslog for you guys as well as the stdout on the pg_ctl console.

More later.

Unfortunately, I'm not really sure this is related to the problems we saw
before.  (i.e. the system doesn't get unusually slow or anything, nor do the
connections seems to block)

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Thu, 2006-05-18 at 10:08 -0700, Jeff Frost wrote:

> May 18 08:00:18 discord postgres[20228]: [129-1] LOG:  archived transaction log file "00000001000000000000007F"
> May 18 08:00:41 discord postgres[20573]: [254-1] LOG:  archived transaction log file "00000001000000000000007F"
> May 18 08:00:41 discord postgres[20573]: [255-1] WARNING:  could not rename file
"pg_xlog/archive_status/00000001000000000000007F.ready"to 
> May 18 08:00:41 discord postgres[20573]: [255-2]  "pg_xlog/archive_status/00000001000000000000007F.done": No such
fileor directory 

I'm not clear how you can have two processes both issuing this message,
since it is only the archiver process that ever says this.

Do you have two archivers running on this system?
pid = 20229 and pid = 20573

Presumably pg_xlog/archive_status/00000001000000000000007F.done already
exists?

Are there two postmasters running (at all)?
Is there somehow an archiver process running from a previously shutdown
postmaster (somehow)?

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Fri, 19 May 2006, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> Do you think the postmaster on 5432 is trying to archive the other
>> postmaster's WAL files somehow?
>
> Not as long as they aren't in the same data directory ;-).  What Simon
> was wondering about was whether an archiver process had somehow been
> left over from a previous incarnation of the test postmaster.  The thing
> to do is look through "ps auxww" (or local equivalent) and see if you
> see more than one thing calling itself an archiver process.
>
> (Whether or not this explains Jeff's problem, it definitely seems like
> a failure mode that we need to guard against.  We go to great lengths
> to prevent a new postmaster from starting when there are still live
> backends from a previous postmaster, but I don't think that interlock
> is effective for the archiver.)

Well now, will you look at this:

postgres 20228     1  0 May17 ?        00:00:00 postgres: archiver process
postgres 20573     1  0 May17 ?        00:00:00 postgres: archiver process
postgres 23817 23810  0 May17 pts/11   00:00:00 postgres: archiver process

23810 is the running postmaster:

postgres 23810     1  0 May17 pts/11   00:03:01 /usr/local/pgsql-8.1.3/bin/postm

do you think that got left around the last time I did a pg_ctl restart?

I guess I can stop my re-run of the test after stopping the /var/lib/pgsql
postmaster.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Fri, 19 May 2006, Tom Lane wrote:

> Well, there's our smoking gun.  IIRC, all the failures you showed us are
> consistent with race conditions caused by multiple archiver processes
> all trying to do the same tasks concurrently.
>
> Do you frequently stop and restart the postmaster?  Because I don't see
> how you could get into this state without having done so.
>
> I've just been looking at the code, and the archiver does commit
> hara-kiri when it notices its parent postmaster is dead; but it only
> checks that in the outer loop.  Given sufficiently long delays in the
> archive_command, that could be a long time after the postmaster died;
> and in the meantime, successive executions of the archive_command could
> be conflicting with those launched by a later archiver incarnation.

Hurray!  Unfortunately, the postmaster on the original troubled server almost
never gets restarted, and in fact only has only one archiver process running
right now.  Drat!

I guess I'll have to try and catch it in the act again the next time the NAS
gets wedged so I can debug a little more (it was caught by one of the windows
folks last time) and gather some useful data.

Let me know if you want me to test a patch since I've already got this test
case setup.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> Hurray!  Unfortunately, the postmaster on the original troubled server almost
> never gets restarted, and in fact only has only one archiver process running
> right now.  Drat!

Well, the fact that there's only one archiver *now* doesn't mean there
wasn't more than one when the problem happened.  The orphaned archiver
would eventually quit.

Do you have logs that would let you check when the production postmaster
was restarted?

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
I wrote:
> Well, the fact that there's only one archiver *now* doesn't mean there
> wasn't more than one when the problem happened.  The orphaned archiver
> would eventually quit.

But, actually, nevermind: we have explained the failures you were seeing
in the test setup, but a multiple-active-archiver situation still
doesn't explain the original situation of incoming connections getting
blocked.

What I'd suggest is resuming the test after making sure you've killed
off any old archivers, and seeing if you can make any progress on
reproducing the original problem.  We definitely need a
multiple-archiver interlock, but I think that must be unrelated to your
real problem.

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Fri, 2006-05-19 at 12:20 -0400, Tom Lane wrote:
> I wrote:
> > Well, the fact that there's only one archiver *now* doesn't mean there
> > wasn't more than one when the problem happened.  The orphaned archiver
> > would eventually quit.
>
> But, actually, nevermind: we have explained the failures you were seeing
> in the test setup, but a multiple-active-archiver situation still
> doesn't explain the original situation of incoming connections getting
> blocked.

Agreed.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Fri, 19 May 2006, Tom Lane wrote:

> Well, the fact that there's only one archiver *now* doesn't mean there
> wasn't more than one when the problem happened.  The orphaned archiver
> would eventually quit.
>
> Do you have logs that would let you check when the production postmaster
> was restarted?

I looked through /var/log/messages* and there wasn't a restart prior to the
problem in the logs.  They go back to April 16.  The postmaster was restarted
on May 15th (this Monday), but that was after the reported problem.


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Fri, 19 May 2006, Tom Lane wrote:

> What I'd suggest is resuming the test after making sure you've killed
> off any old archivers, and seeing if you can make any progress on
> reproducing the original problem.  We definitely need a
> multiple-archiver interlock, but I think that must be unrelated to your
> real problem.

Ok, so I've got the old archivers gone (and btw, after a restart I ended up
with 3 of them - so I stopped postmaster, and killed them all individually and
started postmaster again).  Now I can run my same pg_bench, or do you guys
have any other suggestions on attempting to reproduce the problem?

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Fri, 2006-05-19 at 09:36 -0700, Jeff Frost wrote:
> On Fri, 19 May 2006, Tom Lane wrote:
>
> > What I'd suggest is resuming the test after making sure you've killed
> > off any old archivers, and seeing if you can make any progress on
> > reproducing the original problem.  We definitely need a
> > multiple-archiver interlock, but I think that must be unrelated to your
> > real problem.
>
> Ok, so I've got the old archivers gone (and btw, after a restart I ended up
> with 3 of them - so I stopped postmaster, and killed them all individually and
> started postmaster again).

Thats good.

> Now I can run my same pg_bench, or do you guys
> have any other suggestions on attempting to reproduce the problem?

No. We're back on track to try to reproduce the original error.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Fri, 19 May 2006, Simon Riggs wrote:

>> Now I can run my same pg_bench, or do you guys
>> have any other suggestions on attempting to reproduce the problem?
>
> No. We're back on track to try to reproduce the original error.

I've been futzing with trying to reproduce the original problem for a few days
and so far postgres seems to be just fine with a long delay on archiving, so
now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in
pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps
I'll try a long delay followed by a failure to see if that could be it.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Sun, 2006-05-21 at 14:16 -0700, Jeff Frost wrote:
> On Fri, 19 May 2006, Simon Riggs wrote:
>
> >> Now I can run my same pg_bench, or do you guys
> >> have any other suggestions on attempting to reproduce the problem?
> >
> > No. We're back on track to try to reproduce the original error.
>
> I've been futzing with trying to reproduce the original problem for a few days
> and so far postgres seems to be just fine with a long delay on archiving, so
> now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in
> pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps
> I'll try a long delay followed by a failure to see if that could be it.

So the chances of the original problem being archiver related are
receding...

--
  Simon Riggs
  EnterpriseDB          http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Sun, 21 May 2006, Simon Riggs wrote:

>> I've been futzing with trying to reproduce the original problem for a few days
>> and so far postgres seems to be just fine with a long delay on archiving, so
>> now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in
>> pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps
>> I'll try a long delay followed by a failure to see if that could be it.
>
> So the chances of the original problem being archiver related are
> receding...

This is possible, but I guess I should try and reproduce the actual problem
with the same archive_command script and a CIFS mount just to see what
happens.  Perhaps the real root of the problem is elsewhere, it just seems
strange since the archive_command is the only postgres related process that
accesses the CIFS share.  More later.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Sun, 21 May 2006, Jeff Frost wrote:

>> So the chances of the original problem being archiver related are
>> receding...
>
> This is possible, but I guess I should try and reproduce the actual problem
> with the same archive_command script and a CIFS mount just to see what
> happens.  Perhaps the real root of the problem is elsewhere, it just seems
> strange since the archive_command is the only postgres related process that
> accesses the CIFS share.  More later.

I tried both pulling the plug on the CIFS server and unsharing the CIFS share,
but pgbench continued completely unconcerned.  I guess the failure mode of the
NAS device in the customer colo must be something different that I don't yet
know how to simulate.  I suspect I'll have to wait till it happens again and
try to gather some more data before restarting the NAS device.  Thanks for all
your suggestions guys!

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> I tried both pulling the plug on the CIFS server and unsharing the CIFS share,
> but pgbench continued completely unconcerned.  I guess the failure mode of the
> NAS device in the customer colo must be something different that I don't yet
> know how to simulate.  I suspect I'll have to wait till it happens again and
> try to gather some more data before restarting the NAS device.  Thanks for all
> your suggestions guys!

I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/
is on the NAS device.  Please double-check the file locations.

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Tue, 23 May 2006, Tom Lane wrote:

> I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/
> is on the NAS device.  Please double-check the file locations.

I know that seems like an excellent candidate, but it really isn't, I swear.
In fact, you almost had me convinced the last time you asked me to check.. I
thought some helpful admin had moved something, but no:

postgres  9194  0.0  0.4 486568 16464 ?      S    May16   0:11
/usr/local/pgsql/bin/postmaster -p 5432 -D /usr/local/pgsql/data

db3:~/data $ pwd
/usr/local/pgsql/data

db3:~/data $ ls -l
total 64
-rw-------  1 postgres postgres     4 Feb 13 20:13 PG_VERSION
drwx------  6 postgres postgres  4096 Feb 13 21:00 base
drwx------  2 postgres postgres  4096 May 22 21:03 global
drwx------  2 postgres postgres  4096 May 22 17:45 pg_clog
-rw-------  1 postgres postgres  3575 Feb 13 20:13 pg_hba.conf
-rw-------  1 postgres postgres  1460 Feb 13 20:13 pg_ident.conf
drwx------  4 postgres postgres  4096 Feb 13 20:13 pg_multixact
drwx------  2 postgres postgres  4096 May 22 20:45 pg_subtrans
drwx------  2 postgres postgres  4096 Feb 13 20:13 pg_tblspc
drwx------  2 postgres postgres  4096 Feb 13 20:13 pg_twophase
lrwxrwxrwx  1 postgres postgres     9 Feb 13 22:10 pg_xlog -> /pg_xlog/
-rw-------  1 postgres postgres 13688 May 16 17:50 postgresql.conf
-rw-------  1 postgres postgres    63 May 16 17:54 postmaster.opts
-rw-------  1 postgres postgres    47 May 16 17:54 postmaster.pid

db3:~/data $ mount
/dev/sda2 on / type ext3 (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sda1 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/sdb1 on /usr/local/pgsql type ext3 (rw,data=writeback)
/dev/sda5 on /var type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
//10.1.1.28/pgbackup on /mnt/pgbackup type cifs (rw,mand,noexec,nosuid,nodev)

So, no..I wish it was that easy. :-/

If you have any other suggestions or inspirations, I'm all ears!

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> Well now, will you look at this:

> postgres 20228     1  0 May17 ?        00:00:00 postgres: archiver process
> postgres 20573     1  0 May17 ?        00:00:00 postgres: archiver process
> postgres 23817 23810  0 May17 pts/11   00:00:00 postgres: archiver process

> 23810 is the running postmaster:

> postgres 23810     1  0 May17 pts/11   00:03:01 /usr/local/pgsql-8.1.3/bin/postm

Well, there's our smoking gun.  IIRC, all the failures you showed us are
consistent with race conditions caused by multiple archiver processes
all trying to do the same tasks concurrently.

Do you frequently stop and restart the postmaster?  Because I don't see
how you could get into this state without having done so.

I've just been looking at the code, and the archiver does commit
hara-kiri when it notices its parent postmaster is dead; but it only
checks that in the outer loop.  Given sufficiently long delays in the
archive_command, that could be a long time after the postmaster died;
and in the meantime, successive executions of the archive_command could
be conflicting with those launched by a later archiver incarnation.

Seems we need an interlock to ensure there's not more than one archiver
active.

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Fri, 2006-05-19 at 08:53 -0700, Jeff Frost wrote:
> On Fri, 19 May 2006, Tom Lane wrote:
>
> > Jeff Frost <jeff@frostconsultingllc.com> writes:
> >> Do you think the postmaster on 5432 is trying to archive the other
> >> postmaster's WAL files somehow?
> >
> > Not as long as they aren't in the same data directory ;-).  What Simon
> > was wondering about was whether an archiver process had somehow been
> > left over from a previous incarnation of the test postmaster.  The thing
> > to do is look through "ps auxww" (or local equivalent) and see if you
> > see more than one thing calling itself an archiver process.
> >
> > (Whether or not this explains Jeff's problem, it definitely seems like
> > a failure mode that we need to guard against.  We go to great lengths
> > to prevent a new postmaster from starting when there are still live
> > backends from a previous postmaster, but I don't think that interlock
> > is effective for the archiver.)
>
> Well now, will you look at this:
>
> postgres 20228     1  0 May17 ?        00:00:00 postgres: archiver process
> postgres 20573     1  0 May17 ?        00:00:00 postgres: archiver process
> postgres 23817 23810  0 May17 pts/11   00:00:00 postgres: archiver process
>
> 23810 is the running postmaster:
>
> postgres 23810     1  0 May17 pts/11   00:03:01 /usr/local/pgsql-8.1.3/bin/postm
>
> do you think that got left around the last time I did a pg_ctl restart?

OK, I'm on it.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Simon Riggs
Date:
On Fri, 2006-05-19 at 08:23 -0700, Jeff Frost wrote:
> On Fri, 19 May 2006, Simon Riggs wrote:
>
> > On Thu, 2006-05-18 at 10:08 -0700, Jeff Frost wrote:
> >
> >> May 18 08:00:18 discord postgres[20228]: [129-1] LOG:  archived transaction log file "00000001000000000000007F"
> >> May 18 08:00:41 discord postgres[20573]: [254-1] LOG:  archived transaction log file "00000001000000000000007F"
> >> May 18 08:00:41 discord postgres[20573]: [255-1] WARNING:  could not rename file
"pg_xlog/archive_status/00000001000000000000007F.ready"to 
> >> May 18 08:00:41 discord postgres[20573]: [255-2]  "pg_xlog/archive_status/00000001000000000000007F.done": No such
fileor directory 
> >
> > I'm not clear how you can have two processes both issuing this message,
> > since it is only the archiver process that ever says this.
> >
> > Do you have two archivers running on this system?
> > pid = 20229 and pid = 20573
> >
> > Presumably pg_xlog/archive_status/00000001000000000000007F.done already
> > exists?
> >
> > Are there two postmasters running (at all)?
> > Is there somehow an archiver process running from a previously shutdown
> > postmaster (somehow)?
>
> Ah!!  Good catch Simon!  There are in fact two postmasters running on this
> system.  The normal postmaster running in /var/lib/pgsql on port 5432 and this
> test postmaster running on port 55432.  The 8.0.7 postmaster on port 5432
> shows the following for it's archive_command though:
>
> template1=# show archive_command;
>   archive_command
> -----------------
>   unset
> (1 row)
>
> and the port 55432 postmaster shows:
>
> template1=# show archive_command;
>                     archive_command
> ------------------------------------------------------
>   /usr/local/pgsql-8.1.3/bin/archive_test.sh "%p" "%f"
> (1 row)
>
> Do you think the postmaster on 5432 is trying to archive the other
> postmaster's WAL files somehow?

Some wierd interaction is occurring. The log showed two attempts to
archive the file, the second one each time failing, as you'd expect. But
two archivers should not be accessing the same data directory.

What were those two pids? Which postmasters are they children of?

Give me some more info and I can probably track down a bug...

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Re: does wal archiving block the current client connection?

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> Do you think the postmaster on 5432 is trying to archive the other
> postmaster's WAL files somehow?

Not as long as they aren't in the same data directory ;-).  What Simon
was wondering about was whether an archiver process had somehow been
left over from a previous incarnation of the test postmaster.  The thing
to do is look through "ps auxww" (or local equivalent) and see if you
see more than one thing calling itself an archiver process.

(Whether or not this explains Jeff's problem, it definitely seems like
a failure mode that we need to guard against.  We go to great lengths
to prevent a new postmaster from starting when there are still live
backends from a previous postmaster, but I don't think that interlock
is effective for the archiver.)

            regards, tom lane

Re: does wal archiving block the current client connection?

From
Jeff Frost
Date:
On Fri, 19 May 2006, Simon Riggs wrote:

> On Thu, 2006-05-18 at 10:08 -0700, Jeff Frost wrote:
>
>> May 18 08:00:18 discord postgres[20228]: [129-1] LOG:  archived transaction log file "00000001000000000000007F"
>> May 18 08:00:41 discord postgres[20573]: [254-1] LOG:  archived transaction log file "00000001000000000000007F"
>> May 18 08:00:41 discord postgres[20573]: [255-1] WARNING:  could not rename file
"pg_xlog/archive_status/00000001000000000000007F.ready"to 
>> May 18 08:00:41 discord postgres[20573]: [255-2]  "pg_xlog/archive_status/00000001000000000000007F.done": No such
fileor directory 
>
> I'm not clear how you can have two processes both issuing this message,
> since it is only the archiver process that ever says this.
>
> Do you have two archivers running on this system?
> pid = 20229 and pid = 20573
>
> Presumably pg_xlog/archive_status/00000001000000000000007F.done already
> exists?
>
> Are there two postmasters running (at all)?
> Is there somehow an archiver process running from a previously shutdown
> postmaster (somehow)?

Ah!!  Good catch Simon!  There are in fact two postmasters running on this
system.  The normal postmaster running in /var/lib/pgsql on port 5432 and this
test postmaster running on port 55432.  The 8.0.7 postmaster on port 5432
shows the following for it's archive_command though:

template1=# show archive_command;
  archive_command
-----------------
  unset
(1 row)

and the port 55432 postmaster shows:

template1=# show archive_command;
                    archive_command
------------------------------------------------------
  /usr/local/pgsql-8.1.3/bin/archive_test.sh "%p" "%f"
(1 row)

Do you think the postmaster on 5432 is trying to archive the other
postmaster's WAL files somehow?

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954