Thread: testing cvs HEAD - HS/SR - cannot stat

testing cvs HEAD - HS/SR - cannot stat

From
"Erik Rijkers"
Date:
Hello,

Testing 9.0devel HS/SR: I used cvs HEAD (today) with
the new_smart_shutdown_20100201.patch of Fujii Masao.

Replication works well.  The slave can be stopped and restarted.

However, whenever (re-)starting the slave the I get
messages like:

cp: cannot stat `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': No such
file or directory

At this point,  /var/data1/pg_stuff/dump/replication_archive:

-rw------- 1 xxxxxxxx xxxxxxxx      240 Feb  3 22:35 000000010000000000000001.00000020.backup
-rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb  3 22:35 000000010000000000000001
-rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb  3 22:35 000000010000000000000000


Maybe the message is not really a bug: everything works fine otherwise.


thanks,


Erik Rijkers


# primary ./data/postgresql.conf
data_directory = '/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data'
max_connections = 100
shared_buffers = 32MB
checkpoint_segments = 50
archive_mode = 'on'
archive_command= 'cp %p /var/data1/pg_stuff/dump/replication_archive/%f'
max_wal_senders = 5


# standby ./data/postgresql.conf
data_directory = '/var/data1/pg_stuff/pg_installations/pgsql.sr_slavery/data'
port = 6566
max_connections = 100
shared_buffers = 32MB
checkpoint_segments = 50
archive_mode = 'on'
archive_command= 'cp %p /var/data1/pg_stuff/dump/replication_archive/%f'
max_wal_senders = 5

# standby ./data/recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=/tmp port=6565 user=aardvark'
restore_command = 'cp -i /var/data1/pg_stuff/dump/replication_archive/%f %p </dev/null'


Centos, Linux 2.6.18-164.el5  x86_64


./configure \--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary \--with-pgport=6565 --quiet
--enable-cassert--enable-debug \--enable-depend --with-openssl --with-perl --with-libxml \--with-libxslt
 





Re: testing cvs HEAD - HS/SR - cannot stat

From
Fujii Masao
Date:
On Thu, Feb 4, 2010 at 6:39 AM, Erik Rijkers <er@xs4all.nl> wrote:
> However, whenever (re-)starting the slave the I get
> messages like:
>
> cp: cannot stat `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002': No such
> file or directory
>
> At this point,  /var/data1/pg_stuff/dump/replication_archive:
>
> -rw------- 1 xxxxxxxx xxxxxxxx      240 Feb  3 22:35 000000010000000000000001.00000020.backup
> -rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb  3 22:35 000000010000000000000001
> -rw------- 1 xxxxxxxx xxxxxxxx 16777216 Feb  3 22:35 000000010000000000000000
>
>
> Maybe the message is not really a bug: everything works fine otherwise.

Yeah, this is not a bug.

At first, the standby performs an archive recovery until an invalid
WAL record is found. Then it starts replication and tries to receive
the missing WAL records from the primary. So such an error message
would be logged whenever an invalid record is found and replication
is started.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: testing cvs HEAD - HS/SR - cannot stat

From
Josh Berkus
Date:
> Yeah, this is not a bug.
> 
> At first, the standby performs an archive recovery until an invalid
> WAL record is found. Then it starts replication and tries to receive
> the missing WAL records from the primary. So such an error message
> would be logged whenever an invalid record is found and replication
> is started.

Can we improve the error message?  Right now it's alarming people.  Such as:

cannot stat
`/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002':
End of Log

Of course, it would be even better to supress this error message
entirely unless the user is at DEBUG1 or better.

--Josh Berkus



Re: testing cvs HEAD - HS/SR - cannot stat

From
Heikki Linnakangas
Date:
Josh Berkus wrote:
> Can we improve the error message?  Right now it's alarming people.  Such as:
> 
> cannot stat
> `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002':
> End of Log

Not really, it's coming from 'cp'. Not sure if we could capture the
stderr and somehow decorate the message to make it less alarming.

This isn't really a new problem, PITR has always done that, but I agree
it's always been a bit clumsy to have those messages in the log.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing cvs HEAD - HS/SR - cannot stat

From
Magnus Hagander
Date:
2010/2/4 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
> Josh Berkus wrote:
>> Can we improve the error message?  Right now it's alarming people.  Such as:
>>
>> cannot stat
>> `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002':
>> End of Log
>
> Not really, it's coming from 'cp'. Not sure if we could capture the
> stderr and somehow decorate the message to make it less alarming.
>
> This isn't really a new problem, PITR has always done that, but I agree
> it's always been a bit clumsy to have those messages in the log.

Can we follow it up with a well-phrased message that we switched to
streaming mode right after it, so people realize it's not a problem?

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: testing cvs HEAD - HS/SR - cannot stat

From
"Erik Rijkers"
Date:
On Thu, February 4, 2010 19:29, Heikki Linnakangas wrote:
> Josh Berkus wrote:
>> Can we improve the error message?  Right now it's alarming people.  Such as:
>>
>> cannot stat
>> `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002':
>> End of Log
>
> Not really, it's coming from 'cp'. Not sure if we could capture the
> stderr and somehow decorate the message to make it less alarming.
>
> This isn't really a new problem, PITR has always done that, but I agree
> it's always been a bit clumsy to have those messages in the log.


It was (eventually) clear to me that it was not really a problem,
but only because I was testing anyway :)

I think most people would do what I did: search the documentation
for the string 'No such file or directory' or 'cannot stat'.
(But that produces no mention of this (non-)problem.)

ISTM mentioning in the documentation is good enough.



Erik Rijkers















Re: testing cvs HEAD - HS/SR - cannot stat

From
Heikki Linnakangas
Date:
Magnus Hagander wrote:
> 2010/2/4 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:
>> Josh Berkus wrote:
>>> Can we improve the error message?  Right now it's alarming people.  Such as:
>>>
>>> cannot stat
>>> `/var/data1/pg_stuff/dump/replication_archive/000000010000000000000002':
>>> End of Log
>> Not really, it's coming from 'cp'. Not sure if we could capture the
>> stderr and somehow decorate the message to make it less alarming.
>>
>> This isn't really a new problem, PITR has always done that, but I agree
>> it's always been a bit clumsy to have those messages in the log.
> 
> Can we follow it up with a well-phrased message that we switched to
> streaming mode right after it, so people realize it's not a problem?

Yeah, that would be a good idea. We should give the messages we emit
some holistic thinking. There's quite a few new messages printed thanks
to both Hot Standby and Streaming replication, and it seems we still
need a few more. Which ones are really required, and which ones are just
noise?

Here's what I see in a test standby server:

> LOG:  database system was interrupted while in recovery at log time 2010-02-04 20:45:40 EET
> HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier
recoverytarget.
 

Can we avoid printing this? It's a bit alarming to talk about corrupted
data.

> LOG:  starting archive recovery
> LOG:  restore_command = 'cp /home/hlinnaka/pgsql.cvshead/walarchive/%f %p'
> LOG:  standby_mode = 'true'
> LOG:  primary_conninfo = 'host=localhost port=5432 user=rep_user password=reppass'
> LOG:  trigger_file = '/tmp/standby-trigger'

Do we really need to echo all the lines in recovery.conf? That might be
interesting information, but perhaps it could be condensed and worded
more nicely.

> cp: cannot stat `/home/hlinnaka/pgsql.cvshead/walarchive/000000010000000000000007': No such file or directory

This is the noise line Josh started this thread with.

> LOG:  automatic recovery in progress

Ok, so what?

> LOG:  initializing recovery connections

Seems like unnecessary noise, doesn't mean anything to a DBA.

> LOG:  redo starts at 0/700140C

I guess this could be useful debug information sometimes.

> LOG:  consistent recovery state reached at 0/700142C

It's nice to know that it has reached consistency, but was there any way
to know before this line that it hadn't been reached yet? Perhaps the
"redo starts" line should mention that consistency hasn't been reached yet.

> LOG:  database system is ready to accept read only connections

This is an important piece of information with Hot Standby.

Now, should we print a line when we connect to the master successfully?
Seems like useful information.

Then there's the LOG lines whenever a file is restored successfully from
archive; are the necessary anymore, now that you can connect to the
standby and use pg_last_xlog_replay_location() to poll its status?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing cvs HEAD - HS/SR - cannot stat

From
Fujii Masao
Date:
On Fri, Feb 5, 2010 at 3:58 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> LOG:  database system was interrupted while in recovery at log time 2010-02-04 20:45:40 EET
>> HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier
recoverytarget. 
>
> Can we avoid printing this? It's a bit alarming to talk about corrupted
> data.

I think this is useful information as a hint of data corruption. But it's
odd for this to be reported even when restarting the standby server that
has been previously shut down normally (i.e., in fact shutdown mode).

How about adding new system status like DB_SHUTDOWNED_IN_RECOVERY, setting
the status to it when the shutdown is performed during recovery, and
reporting the suitable message when starting up the server from it?

>> LOG:  starting archive recovery

This is reported even if restore_command is not given and the WAL files are
never restored from the archive. We should get rid of this in that case?

>> LOG:  restore_command = 'cp /home/hlinnaka/pgsql.cvshead/walarchive/%f %p'
>> LOG:  standby_mode = 'true'
>> LOG:  primary_conninfo = 'host=localhost port=5432 user=rep_user password=reppass'
>> LOG:  trigger_file = '/tmp/standby-trigger'
>
> Do we really need to echo all the lines in recovery.conf? That might be
> interesting information, but perhaps it could be condensed and worded
> more nicely.

It's OK for me to move them from LOG to DEBUG.

>> cp: cannot stat `/home/hlinnaka/pgsql.cvshead/walarchive/000000010000000000000007': No such file or directory
>
> This is the noise line Josh started this thread with.

Agreed. But the messages other than ENOENT that restore_command emits
might be useful.

>> LOG:  automatic recovery in progress
>
> Ok, so what?

Seems unnecessary.

>> LOG:  initializing recovery connections
>
> Seems like unnecessary noise, doesn't mean anything to a DBA.

Agreed.

>> LOG:  redo starts at 0/700140C
>
> I guess this could be useful debug information sometimes.

Agreed.

>> LOG:  consistent recovery state reached at 0/700142C
>
> It's nice to know that it has reached consistency, but was there any way
> to know before this line that it hadn't been reached yet? Perhaps the
> "redo starts" line should mention that consistency hasn't been reached yet.

But redo might restart from the consistent database if the standby
server was shut down after it reached the consistent status.

>> LOG:  database system is ready to accept read only connections
>
> This is an important piece of information with Hot Standby.

Agreed.

> Now, should we print a line when we connect to the master successfully?
> Seems like useful information.

Agreed.

> Then there's the LOG lines whenever a file is restored successfully from
> archive; are the necessary anymore, now that you can connect to the
> standby and use pg_last_xlog_replay_location() to poll its status?

How about moving those messages from LOG to DEBUG?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: testing cvs HEAD - HS/SR - cannot stat

From
Heikki Linnakangas
Date:
Ok, I committed a patch to reduce the chatter a bit:

Fujii Masao wrote:
> On Fri, Feb 5, 2010 at 3:58 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>>> LOG:  starting archive recovery
> 
> This is reported even if restore_command is not given and the WAL files are
> never restored from the archive. We should get rid of this in that case?

Yeah, removed.

>>> LOG:  restore_command = 'cp /home/hlinnaka/pgsql.cvshead/walarchive/%f %p'
>>> LOG:  standby_mode = 'true'
>>> LOG:  primary_conninfo = 'host=localhost port=5432 user=rep_user password=reppass'
>>> LOG:  trigger_file = '/tmp/standby-trigger'
>> Do we really need to echo all the lines in recovery.conf? That might be
>> interesting information, but perhaps it could be condensed and worded
>> more nicely.
> 
> It's OK for me to move them from LOG to DEBUG.

Done.

>>> cp: cannot stat `/home/hlinnaka/pgsql.cvshead/walarchive/000000010000000000000007': No such file or directory
>> This is the noise line Josh started this thread with.
> 
> Agreed. But the messages other than ENOENT that restore_command emits
> might be useful.
> 
>>> LOG:  automatic recovery in progress
>> Ok, so what?
> 
> Seems unnecessary.

I replaced this with a more informative message that says either:

"entering standby mode", or
"starting point-in-time recovery to XID %u", or
"starting archive recovery"

>>> LOG:  initializing recovery connections
>> Seems like unnecessary noise, doesn't mean anything to a DBA.
> 
> Agreed.

Demoted to DEBUG1.

>>> LOG:  redo starts at 0/700140C
>> I guess this could be useful debug information sometimes.
> 
> Agreed.

I left this as it was.

>>> LOG:  consistent recovery state reached at 0/700142C
>> It's nice to know that it has reached consistency, but was there any way
>> to know before this line that it hadn't been reached yet? Perhaps the
>> "redo starts" line should mention that consistency hasn't been reached yet.
> 
> But redo might restart from the consistent database if the standby
> server was shut down after it reached the consistent status.

Yeah, it would be nice to say whether the database is already consistent
or not, but I've left this alone for now.

>> Now, should we print a line when we connect to the master successfully?
>> Seems like useful information.
> 
> Agreed.
> 
>> Then there's the LOG lines whenever a file is restored successfully from
>> archive; are the necessary anymore, now that you can connect to the
>> standby and use pg_last_xlog_replay_location() to poll its status?
> 
> How about moving those messages from LOG to DEBUG?

Didn't touch these yet.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com