Re: CLOG read problem after pg_basebackup - Mailing list pgsql-general

From Adrian Klaver
Subject Re: CLOG read problem after pg_basebackup
Date
Msg-id 54C3B9B1.5000809@aklaver.com
Whole thread Raw
In response to Re: CLOG read problem after pg_basebackup  (David G Johnston <david.g.johnston@gmail.com>)
List pgsql-general
On 01/23/2015 05:18 PM, David G Johnston wrote:
> Petr Novak wrote
>> Three of them failed to start after pg_basebackup completed with:
>>
>> FATAL:  could not access status of transaction 923709700
>> DETAIL:  Could not read from file "pg_clog/0370" at offset 237568:
>> Success.
>>
>> (the clog file differed in each case of course..)
>>
>> As for PG versions one is 9.1.14 (on both master and replica) and the
>> other
>> two 9.2.9 (also on both)
>
> To clarify, the pg_basebackup against the master failed with the above
> message?
>
> You confirmed that the archive did not contain the named clog file?
>
> But when you went and checked the running cluster's pg_clog directory the
> file was present?

In the initial post the OP said that the pg_clog file was present on
both the master and replica and that copying the presumably updated file
from the master, after the pg_basebackup, to the replica 'cured' the
problem. This would seem to explain the could not read from offset
error. Initially the replicated Postgres was looking for data in the
pg_clog file at an offset that existed only in the file version on the
master. Once the replica was provided with the updated file it was
happy. The question being how it got in that state?  Your observations
below are better then anything I could come up with.

>
> What was the timestamp of the file in the running cluster relative to the
> start of the pg_basebackup?
>
> Did you attempt another pg_basebackup against any of the failing servers -
> i.e., is the error now a constant for the server or was it transient?
>
> I am somewhat at a loss to explain how pg_basebackup works with pg_clog
> given this quote from the wiki:
>
> https://wiki.postgresql.org/wiki/Hint_Bits
>
> "CLOG pages don't make their way out to disk until the internal CLOG buffers
> are filled, at which point the least recently used buffer there is evicted
> to permanent storage."
>
> Either pg_clog should be course-corrected by WAL, in which case you
> shouldn't get a fatal error if an incomplete clog file is found to exist, or
> there must something being done to avoid a race condition in this area.  If
> that isn't happening then your error could potentially be explained - though
> damn bad luck getting it on three servers...
>
> The last observation leads one to wonder if there some kind of transaction
> volume or I/O difference that makes the failing servers special (more prone
> to getting hit by said race condition)?
>
> I may be just blowing smoke here but maybe it will spark an idea in someone
> more knowledgeable.
>
> David J.
>
>
>
>
> --
> View this message in context:
http://postgresql.nabble.com/CLOG-read-problem-after-pg-basebackup-tp5835204p5835296.html
> Sent from the PostgreSQL - general mailing list archive at Nabble.com.
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: hash function in Postgres
Next
From: Christopher Browne
Date:
Subject: Re: [SQL] commit inside a function failing