Thread: WARNINGs after starting backup server created with PITR

WARNINGs after starting backup server created with PITR

From
Erik Jones
Date:
I recently started up a server created using PITR (exact detail
below) and received about 125 of these type of errors spanning pages
in about 10 different tables:

2008-01-17 21:47:34 CST 7598 :WARNING:  relation "table_name" page
5728 is uninitialized --- fixing

Looking in the archives I found this thread: http://
archives.postgresql.org/pgsql-general/2007-10/msg01538.php.  However,
in that case he wasn't taking the base backup from the primary server
(which I did) and the actual warning was never explained by anybody
as to what it means, does it entail a botched backup?

This backup was made using rsync across co-locations, so before
actually running pg_start_backup and the actual PITR rsync run, we
ran rsync a couple times over a few days to shorten the amount of
time for the main rsync run.  Sometime after the standby was current
we had an issue where a WAL file wasn't successfully archived to the
standby server (even though rsync apparently reported that it was as
seen in the successful archive message in the primary server's log),
so I stopped the standby and re-ran the backup process:
pg_start_backup('standby') -> rsync primary to standby ->
pg_stop_backup() -> start standby in recovery mode.  At that point,
once the standby was again "caught up" with the most recently shipped
WAL file, I brought it out of recovery mode to run a dump.  It was
once it was out of recovery that those warnings appeared.

Is this backup screwed?

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: WARNINGs after starting backup server created with PITR

From
Tom Lane
Date:
Erik Jones <erik@myemma.com> writes:
> I recently started up a server created using PITR (exact detail
> below) and received about 125 of these type of errors spanning pages
> in about 10 different tables:

> 2008-01-17 21:47:34 CST 7598 :WARNING:  relation "table_name" page
> 5728 is uninitialized --- fixing

If you do a vacuum on the master, do you get the same warnings?

            regards, tom lane

Re: WARNINGs after starting backup server created with PITR

From
Erik Jones
Date:
On Jan 18, 2008, at 3:17 PM, Tom Lane wrote:

> Erik Jones <erik@myemma.com> writes:
>> I recently started up a server created using PITR (exact detail
>> below) and received about 125 of these type of errors spanning pages
>> in about 10 different tables:
>
>> 2008-01-17 21:47:34 CST 7598 :WARNING:  relation "table_name" page
>> 5728 is uninitialized --- fixing
>
> If you do a vacuum on the master, do you get the same warnings?

/me runs VACUUM VERBOSE on the two tables that would matter.

Nope.  What worries me is, that since I have a verified case of rsync
thinking it had successfully transferred a WAL, the same may have
happened with these files during the base backup.  Does that warning,
in fact, entail that there were catalog entries for those files, but
that the file was not there, and by "fixing" it the server just
created empty files?

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: WARNINGs after starting backup server created with PITR

From
Brian Wipf
Date:
On 18-Jan-08, at 2:32 PM, Erik Jones wrote:

> What worries me is, that since I have a verified case of rsync
> thinking it had successfully transferred a WAL, the same may have
> happened with these files during the base backup.  Does that
> warning, in fact, entail that there were catalog entries for those
> files, but that the file was not there, and by "fixing" it the
> server just created empty files?

We archive WALs directly to an NFS mount. We once had a zero-byte WAL
file archived, which I believe was the result of a temporary issue
with the NFS mount. We had to perform a new base backup since the WAL
was deleted/reused by PG because it was told it was archived
successfully. It sounds similar to the problem you experienced. Do you
rsync to an NFS mount?

If this issue is occurring when archiving WALs, I agree that it could
be occurring when trying to get a base backup.

Brian Wipf
<brian@clickspace.com>


Re: WARNINGs after starting backup server created with PITR

From
Erik Jones
Date:
On Jan 18, 2008, at 4:43 PM, Brian Wipf wrote:

> On 18-Jan-08, at 2:32 PM, Erik Jones wrote:
>
>> What worries me is, that since I have a verified case of rsync
>> thinking it had successfully transferred a WAL, the same may have
>> happened with these files during the base backup.  Does that
>> warning, in fact, entail that there were catalog entries for those
>> files, but that the file was not there, and by "fixing" it the
>> server just created empty files?
>
> We archive WALs directly to an NFS mount. We once had a zero-byte
> WAL file archived, which I believe was the result of a temporary
> issue with the NFS mount. We had to perform a new base backup since
> the WAL was deleted/reused by PG because it was told it was
> archived successfully. It sounds similar to the problem you
> experienced. Do you rsync to an NFS mount?
>
> If this issue is occurring when archiving WALs, I agree that it
> could be occurring when trying to get a base backup.

For our primary, er, main, onsite standby server that's also what we
do.  But, this was a co-location to co-location transfer so there was
no NFS mount, it was a direct rsync to the server at the other co-
location.  For WAL files, I've already decided to write a WALShipper
utility that will handle shipping WALs to multiple standbys with
verfication, but for the base backup, this is distressing.  We do
have the option to do the base backup to a portable USB drive and
then carry it to the second co-lo for now.  But, pretty soon we're
going to be surpassing the available limits in portably drive
capacity unless we invest in tape drives.

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: WARNINGs after starting backup server created with PITR

From
Tom Lane
Date:
Erik Jones <erik@myemma.com> writes:
>>> 2008-01-17 21:47:34 CST 7598 :WARNING:  relation "table_name" page
>>> 5728 is uninitialized --- fixing
>>
>> If you do a vacuum on the master, do you get the same warnings?

> /me runs VACUUM VERBOSE on the two tables that would matter.

> Nope.  What worries me is, that since I have a verified case of rsync
> thinking it had successfully transferred a WAL, the same may have
> happened with these files during the base backup.  Does that warning,
> in fact, entail that there were catalog entries for those files, but
> that the file was not there, and by "fixing" it the server just
> created empty files?

Not necessarily.  What the warning actually means is that VACUUM found
an all-zeroes page within a table.  There are scenarios where this is
not unexpected, particularly after a crash on the master.  The reason
is that adding a page to a table is a two-step process.  First we
write() a page of zeroes at the current EOF; this is basically to make
the filesystem reserve the space.  We don't want to report that we've
committed a page-full of new rows and then discover there's no disk
space for them.  Then we initialize the page (ie set up the page header)
and start putting rows into it.  But these latter operations happen
inside a shared buffer, and might not reach disk until the next
checkpoint.  Now, the insertions of the rows are entered into the WAL
log, and once the first such WAL entry has reached disk, the page will
be re-initialized by WAL replay if there's a crash.  But there's an
interval between the filesystem's extension of a table with zeroes and
the first WAL entry related to the page reaching disk.  If you get a
crash in that interval then the all-zeroes page will still be there
after recovery, and will go unused until VACUUM reclaims it (and
produces the WARNING).

So this would explain some zero pages (though not large numbers of
them) if you'd had crashes on the master.  I'm not sure offhand whether
there's any case in which bringing up a PITR slave is close enough to
crash recovery that the same mechanism could apply to produce a zero
page on the slave where there had been none on the master.

In any case, 125 different zeroed pages is pretty hard to explain
by such a mechanism (especially if they were scattered rather than
in contiguous clumps).  I tend to agree that it sounds like there
was something wrong with the rsync mirroring process.

            regards, tom lane

Re: WARNINGs after starting backup server created with PITR

From
Robert Treat
Date:
On Friday 18 January 2008 18:04, Erik Jones wrote:
> For our primary, er, main, onsite standby server that's also what we
> do.  But, this was a co-location to co-location transfer so there was
> no NFS mount, it was a direct rsync to the server at the other co-
> location.  For WAL files, I've already decided to write a WALShipper
> utility that will handle shipping WALs to multiple standbys with
> verfication, but for the base backup, this is distressing.  We do
> have the option to do the base backup to a portable USB drive and
> then carry it to the second co-lo for now.  But, pretty soon we're
> going to be surpassing the available limits in portably drive
> capacity unless we invest in tape drives.
>

Are you guys running ZFS yet? If so it's snapshot / cloning capabilities might
be the way to go.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Re: WARNINGs after starting backup server created with PITR

From
Simon Riggs
Date:
On Fri, 2008-01-18 at 19:58 -0500, Tom Lane wrote:
> In any case, 125 different zeroed pages is pretty hard to explain
> by such a mechanism (especially if they were scattered rather than
> in contiguous clumps).

Can you show us the messages, so we can understand the distribution of
the pages?

Are there different numbers of rows in the two tables? Just a select
count(*) might do, but any way you have of verifying data between the
two systems would be very useful.

The page numbers are identical between both systems, so use the
contrib/pageinspect get_raw_page() function to record the contents on
both systems before they diverge too much. (BTW, the final commit of
those tools seems to have removed the docs I wrote for the original
version and haven't been replaced with a README -- huh!?).

create table bad_blocks as select <blockid> as blockid,
get_raw_page(<relname>, <blockid>)::bytea as raw_page;

I've got a bad feeling about rsync. Mason's recent problems used rsync
and so far they are not properly explained, except as a hardware
problem.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: WARNINGs after starting backup server created with PITR

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> The page numbers are identical between both systems, so use the
> contrib/pageinspect get_raw_page() function to record the contents on
> both systems before they diverge too much. (BTW, the final commit of
> those tools seems to have removed the docs I wrote for the original
> version and haven't been replaced with a README -- huh!?).

You mean
http://developer.postgresql.org/pgdocs/postgres/pageinspect.html
?

But in any case that module doesn't exist in 8.2 or before, so
Erik would probably have a hard time using it.

            regards, tom lane

Re: WARNINGs after starting backup server created with PITR

From
Simon Riggs
Date:
On Sat, 2008-01-19 at 11:51 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > The page numbers are identical between both systems, so use the
> > contrib/pageinspect get_raw_page() function to record the contents on
> > both systems before they diverge too much. (BTW, the final commit of
> > those tools seems to have removed the docs I wrote for the original
> > version and haven't been replaced with a README -- huh!?).
>
> You mean
> http://developer.postgresql.org/pgdocs/postgres/pageinspect.html
> ?

Many thanks, I didn't realise we had done that.

> But in any case that module doesn't exist in 8.2 or before, so
> Erik would probably have a hard time using it.

If we can diagnose a bug by doing this, surely that is the right thing
to do? I think the versions wouldn't be that different anyway.

If we can find a row that is missing in the standby version then we will
have at least found a bullet hole.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: WARNINGs after starting backup server created with PITR

From
Erik Jones
Date:
On Jan 19, 2008, at 2:26 AM, Simon Riggs wrote:

> On Fri, 2008-01-18 at 19:58 -0500, Tom Lane wrote:
>> In any case, 125 different zeroed pages is pretty hard to explain
>> by such a mechanism (especially if they were scattered rather than
>> in contiguous clumps).
>
> Can you show us the messages, so we can understand the distribution of
> the pages?
>
> Are there different numbers of rows in the two tables? Just a select
> count(*) might do, but any way you have of verifying data between the
> two systems would be very useful.
>
> The page numbers are identical between both systems, so use the
> contrib/pageinspect get_raw_page() function to record the contents on
> both systems before they diverge too much. (BTW, the final commit of
> those tools seems to have removed the docs I wrote for the original
> version and haven't been replaced with a README -- huh!?).
>
> create table bad_blocks as select <blockid> as blockid,
> get_raw_page(<relname>, <blockid>)::bytea as raw_page;
>
> I've got a bad feeling about rsync. Mason's recent problems used rsync
> and so far they are not properly explained, except as a hardware
> problem.


All of the warnings are below.  For tables that had multiple warnings
they seem to be for consecutive pages.  All of these tables were
seeing some pretty decent write traffic during the base backup which
took place Tuesday night.  The good news, for us, is that none of the
queue tables matter -- that was transient data.  For the other two
history tables, since the PITR recovery ended at 4:58 am Wednesday
morning when it ran up against the dropped WAL file they've already
seen a good amount of traffic.  However, I've been able to verify
missing data by checking the newest timestamp field is older than
those in other tables that had no errors, finding differences of as
much as 11 hours.

If the page_inspect module would work with 8.2 I wouldn't have a
problem building and using it, I already do that with pg_standby.
But, with three days of traffic on these table in the production
server and, having already verified some missing data, I don't see
the point now.    Let me know if it would work, though, so that in
future I'll be aware of an available tool a little sooner.

And, yes, rsync does seem to be suspect, especially given that I'm
not the only one who has reported issues when using it.  So far,
though, we haven't had any issues when using it over local networks,
just between our co-locations. As I mentioned before, where WAL
shipping is concerned, this can be beaten with a more complete and
versatile WAL shipping solution which will be a new personal project
for me.  For base backups, though, it's probably time to explore
other options here.  Also, since iirc, rsync is mentioned directly in
at least one of the chapters dealing with PITR, it may be a good idea
to qualify it's usage with a warning to the effect that some people
have reported inconsistent results from it's use.

2008-01-17 17:34:57 CST 7598 :WARNING:  relation
"userdata_4212_messages_queue" page 170 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1320 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1321 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1322 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1323 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1324 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1325 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1326 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1327 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1328 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1329 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1330 is uninitialized --- fixing
2008-01-17 17:44:48 CST 7598 :WARNING:  relation
"userdata_12176_messages_queue" page 1331 is uninitialized --- fixing
2008-01-17 19:57:05 CST 7598 :WARNING:  relation
"userdata_8520_messages_queue" page 90 is uninitialized --- fixing
2008-01-17 20:02:33 CST 7598 :WARNING:  relation
"userdata_1708_messages_queue" page 262 is uninitialized --- fixing
2008-01-17 20:03:36 CST 7598 :WARNING:  relation
"userdata_7496_messages_queue" page 145 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30712 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30713 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30714 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30715 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30716 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30717 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30718 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30719 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30720 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30721 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30722 is uninitialized --- fixing
2008-01-17 21:10:04 CST 7598 :WARNING:  relation
"userdata_7829_messages_history" page 30723 is uninitialized --- fixing
2008-01-17 21:42:05 CST 7598 :WARNING:  relation
"userdata_16498_messages_queue" page 159 is uninitialized --- fixing
2008-01-17 21:42:05 CST 7598 :WARNING:  relation
"userdata_16498_messages_queue" page 160 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8491 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8492 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8493 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8494 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8495 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8496 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8497 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8498 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8499 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8500 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8501 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8502 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8503 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8504 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8505 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8506 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8507 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8508 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8509 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8510 is uninitialized --- fixing
2008-01-17 21:47:05 CST 7598 :WARNING:  relation
"userdata_16649_messages_history" page 8511 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5671 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5672 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5673 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5674 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5675 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5676 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5677 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5678 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5679 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5680 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5681 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5682 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5683 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5684 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5685 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5686 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5687 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5688 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5689 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5690 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5691 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5692 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5693 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5694 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5695 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5696 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5697 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5698 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5699 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5700 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5701 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5702 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5703 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5704 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5705 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5706 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5707 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5708 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5709 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5710 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5711 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5712 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5713 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5714 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5715 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5716 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5717 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5718 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5719 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5720 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5721 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5722 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5723 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5724 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5725 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5726 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5727 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5728 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5729 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5730 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5731 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5732 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5733 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5734 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5735 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5736 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5737 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5738 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5739 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5740 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5741 is uninitialized --- fixing
2008-01-17 21:47:34 CST 7598 :WARNING:  relation
"userdata_16649_messages_queue" page 5742 is uninitialized --- fixing
2008-01-17 22:09:35 CST 7598 :WARNING:  relation
"userdata_14692_messages_queue" page 449 is uninitialized --- fixing
2008-01-17 22:17:49 CST 7598 :WARNING:  relation
"userdata_15375_messages_queue" page 24 is uninitialized --- fixing

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: WARNINGs after starting backup server created with PITR

From
Tom Lane
Date:
Erik Jones <erik@myemma.com> writes:
> On Jan 19, 2008, at 2:26 AM, Simon Riggs wrote:
>> Can you show us the messages, so we can understand the distribution of
>> the pages?

> All of the warnings are below.  For tables that had multiple warnings
> they seem to be for consecutive pages.

Hmm --- that pattern sure looks like a smoking gun to me.  Have you got
any numbers about the likely relative sizes of the affected tables,
and/or their growth rates while the rsync was in progress?  What I am
thinking is that the uninitialized pages probably represent the ranges
of pages that were added to these tables while the base backup was going
on.  That could be partially confirmed, or conclusively disproven, if
we had numbers about the table sizes and growth rates; but I think we
could take it as a working hypothesis if we don't.

If that is the case, then I can think of two explanations:

1. rsync misfeasance

2. the PITR replay started from some point later than that base backup
started, and the missing pages are those that were added and filled in
between.

Theory #2 has two subtheories: 2A, you didn't follow the procedures
properly, and 2B, you did but there's a bug in the procedures.

            regards, tom lane

Re: WARNINGs after starting backup server created with PITR

From
Simon Riggs
Date:
On Sat, 2008-01-19 at 11:28 -0600, Erik Jones wrote:

> If the page_inspect module would work with 8.2 I wouldn't have a
> problem building and using it, I already do that with pg_standby.
> But, with three days of traffic on these table in the production
> server and, having already verified some missing data, I don't see
> the point now.

Agreed

> Let me know if it would work, though, so that in
> future I'll be aware of an available tool a little sooner.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: WARNINGs after starting backup server created with PITR

From
Simon Riggs
Date:
On Sat, 2008-01-19 at 11:28 -0600, Erik Jones wrote:

> All of the warnings are below.  For tables that had multiple warnings
> they seem to be for consecutive pages.  All of these tables were
> seeing some pretty decent write traffic during the base backup which
> took place Tuesday night.  The good news, for us, is that none of the
> queue tables matter -- that was transient data.  For the other two
> history tables, since the PITR recovery ended at 4:58 am Wednesday
> morning when it ran up against the dropped WAL file they've already
> seen a good amount of traffic.

> However, I've been able to verify
> missing data by checking the newest timestamp field is older than
> those in other tables that had no errors, finding differences of as
> much as 11 hours.

So we definitely have missing data. I think the multi-phase rsync is
definitely suspect and should be avoided until we get to the bottom of
this.

Can you examine the server logs for that 11 hour period and see if there
are any other messages that might be relevant on both primary and
standby servers?

Can you save the WAL files covering that period also. We may want to
inspect them later to confirm whether the data was actually in them or
not. Can you save the very first WAL files after the recovery started?
We can use those to examine the first data blocks touched, which would
help confirm when the gun was fired.

Has rsync produced any messages of note? Might your script be ignoring
errors? What versions of rsync are you using?

Is the History table Insert only? We might use that fact to examine the
LSNs of the equivalent blocks on the Primary. If the LSNs are prior to
the start of the recovery, as noted in the backup label file of the
original base backup, then we can confirm rsync or postgres as the
cause. Do you still have the base backup label file or base backup?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: WARNINGs after starting backup server created with PITR

From
Simon Riggs
Date:
On Sat, 2008-01-19 at 12:56 -0500, Tom Lane wrote:
> Erik Jones <erik@myemma.com> writes:
> > On Jan 19, 2008, at 2:26 AM, Simon Riggs wrote:
> >> Can you show us the messages, so we can understand the distribution of
> >> the pages?
>
> > All of the warnings are below.  For tables that had multiple warnings
> > they seem to be for consecutive pages.
>
> Hmm --- that pattern sure looks like a smoking gun to me.  Have you got
> any numbers about the likely relative sizes of the affected tables,
> and/or their growth rates while the rsync was in progress?  What I am
> thinking is that the uninitialized pages probably represent the ranges
> of pages that were added to these tables while the base backup was going
> on.  That could be partially confirmed, or conclusively disproven, if
> we had numbers about the table sizes and growth rates; but I think we
> could take it as a working hypothesis if we don't.
>
> If that is the case, then I can think of two explanations:
>
> 1. rsync misfeasance
>
> 2. the PITR replay started from some point later than that base backup
> started, and the missing pages are those that were added and filled in
> between.
>
> Theory #2 has two subtheories: 2A, you didn't follow the procedures
> properly, and 2B, you did but there's a bug in the procedures.

Since we already know (1) has occurred with the WAL file transfer, it
seems logical that the same cause could explain the uninitialized pages.
Two unrelated bugs are possible, but less likely.

But we need to check both...

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: WARNINGs after starting backup server created with PITR

From
Erik Jones
Date:
On Jan 19, 2008, at 1:24 PM, Simon Riggs wrote:

> So we definitely have missing data. I think the multi-phase rsync is
> definitely suspect and should be avoided until we get to the bottom of
> this.

By this, do you mean the way we ran the rsync multiple times before
performing the actual base backup?  Easy enough.

> Can you examine the server logs for that 11 hour period and see if
> there
> are any other messages that might be relevant on both primary and
> standby servers?

On the primary, all relevant log entries are just entries for
successfully logged WALs.

> Can you save the WAL files covering that period also. We may want to
> inspect them later to confirm whether the data was actually in them or
> not. Can you save the very first WAL files after the recovery started?
> We can use those to examine the first data blocks touched, which would
> help confirm when the gun was fired.

Sure, I still have all of the WALs that were shipped from just before
the beginning of the base back up until about 8 or 9 hours after the
missing WAL file.  I'm not sure if I'll be able to get to it today as
I've got lots of household chores to get to, but either tomorrow of
first thing Monday I'll zip them and post them here.

> Has rsync produced any messages of note? Might your script be ignoring
> errors? What versions of rsync are you using?

Version 2.6.9 on all three servers (1 primary, 2 standbys).

Here's the python script I was using to ship to both servers, right
now I'm back to a direct rsync call for my archive_command to sb1.
What's really weird, is that for the two WALs that disappeared, or
didn't make it, even though the primary's log said they were
successfully archived, they weren't on either server.

#####
#!/usr/bin/env python
from commands import getstatusoutput
import sys

source_path = sys.argv[1]
file_name = sys.argv[2]

# db2
res, out = getstatusoutput("ssh db2 'test -f /mnt/xfer/wal_archive/%
s'" % file_name)
if res != 0:
    res, out = getstatusoutput("rsync -a %s db2:/mnt/xfer/wal_archive/"
% source_path)

if res != 0:
    sys.exit(res)

# sb1
res, out = getstatusoutput("test -f /mnt/xfer/wal_archive/%s" %
file_name)
if res != 0:
    res, out = getstatusoutput("rsync -a %s /mnt/xfer/wal_archive",
source_path)

sys.exit(res)
#####

> Is the History table Insert only? We might use that fact to examine
> the
> LSNs of the equivalent blocks on the Primary. If the LSNs are prior to
> the start of the recovery, as noted in the backup label file of the
> original base backup, then we can confirm rsync or postgres as the
> cause. Do you still have the base backup label file or base backup?

I still have the backup label but not the base backup.  The history
tables are about 90-95% insert only, on average, but they do see the
occasional update if a message's delivery status is a bounce.  If you
think this would still be an option, I'd be happy to follow whatever
instructions, or supply any files, needed.

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: WARNINGs after starting backup server created with PITR

From
Simon Riggs
Date:
On Sat, 2008-01-19 at 13:53 -0600, Erik Jones wrote:

> > Is the History table Insert only? We might use that fact to examine
> > the
> > LSNs of the equivalent blocks on the Primary. If the LSNs are prior to
> > the start of the recovery, as noted in the backup label file of the
> > original base backup, then we can confirm rsync or postgres as the
> > cause. Do you still have the base backup label file or base backup?
>
> I still have the backup label but not the base backup.  The history
> tables are about 90-95% insert only, on average, but they do see the
> occasional update if a message's delivery status is a bounce.  If you
> think this would still be an option, I'd be happy to follow whatever
> instructions, or supply any files, needed.

OK. Off-list, please post me the backup label file and 2-3 compressed
WAL files covering the period where recovery started.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: WARNINGs after starting backup server created with PITR

From
Simon Riggs
Date:
On Sat, 2008-01-19 at 13:53 -0600, Erik Jones wrote:

> Here's the python script I was using to ship to both servers, right
> now I'm back to a direct rsync call for my archive_command to sb1.
> What's really weird, is that for the two WALs that disappeared, or
> didn't make it, even though the primary's log said they were
> successfully archived, they weren't on either server.

Can you explain this some more? How do you know this? What messages are
received relating to those files?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: WARNINGs after starting backup server created with PITR

From
Erik Jones
Date:
On Jan 20, 2008, at 3:19 AM, Simon Riggs wrote:

> On Sat, 2008-01-19 at 13:53 -0600, Erik Jones wrote:
>
>> Here's the python script I was using to ship to both servers, right
>> now I'm back to a direct rsync call for my archive_command to sb1.
>> What's really weird, is that for the two WALs that disappeared, or
>> didn't make it, even though the primary's log said they were
>> successfully archived, they weren't on either server.
>
> Can you explain this some more? How do you know this? What messages
> are
> received relating to those files?

In my archive directory on the standby, a sorted listing has in it
this sequence:

000000010000059D00000021
000000010000059D00000023

I found this when I check the standby.log pg_standby is writing to
and found it waiting on 000000010000059D00000022.  Checking the
primary server's log I found this line:

2008-01-16 04:58:41 CST 7841 :LOG:  archived transaction log file
"000000010000059D00000022"

The same file was missing on the sb1 server (the one with the local
NFS mounted wal_archive directory).  Since that was the local, to NFS
rsync, which I've never had any issues with, and which was preceded by

if res != 0:
    sys.exit(res)

wherein the value of res comes from a remotely execute command (test
via ssh or rsync via ssh), something weird happened causing my
transfer script to exit abnormally there (or before), but Postgres
somehow saw a 0 exit status, which I understand is required in order
to get that archived transaction log file entry.

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com