Re: CLOG read problem after pg_basebackup - Mailing list pgsql-general

From David G Johnston
Subject Re: CLOG read problem after pg_basebackup
Date
Msg-id 1422062327407-5835296.post@n5.nabble.com
Whole thread Raw
In response to CLOG read problem after pg_basebackup  (Petr Novak <petr.novak23@gmail.com>)
Responses Re: CLOG read problem after pg_basebackup  (Adrian Klaver <adrian.klaver@aklaver.com>)
List pgsql-general
Petr Novak wrote
> Three of them failed to start after pg_basebackup completed with:
>
> FATAL:  could not access status of transaction 923709700
> DETAIL:  Could not read from file "pg_clog/0370" at offset 237568:
> Success.
>
> (the clog file differed in each case of course..)
>
> As for PG versions one is 9.1.14 (on both master and replica) and the
> other
> two 9.2.9 (also on both)

To clarify, the pg_basebackup against the master failed with the above
message?

You confirmed that the archive did not contain the named clog file?

But when you went and checked the running cluster's pg_clog directory the
file was present?

What was the timestamp of the file in the running cluster relative to the
start of the pg_basebackup?

Did you attempt another pg_basebackup against any of the failing servers -
i.e., is the error now a constant for the server or was it transient?

I am somewhat at a loss to explain how pg_basebackup works with pg_clog
given this quote from the wiki:

https://wiki.postgresql.org/wiki/Hint_Bits

"CLOG pages don't make their way out to disk until the internal CLOG buffers
are filled, at which point the least recently used buffer there is evicted
to permanent storage."

Either pg_clog should be course-corrected by WAL, in which case you
shouldn't get a fatal error if an incomplete clog file is found to exist, or
there must something being done to avoid a race condition in this area.  If
that isn't happening then your error could potentially be explained - though
damn bad luck getting it on three servers...

The last observation leads one to wonder if there some kind of transaction
volume or I/O difference that makes the failing servers special (more prone
to getting hit by said race condition)?

I may be just blowing smoke here but maybe it will spark an idea in someone
more knowledgeable.

David J.




--
View this message in context: http://postgresql.nabble.com/CLOG-read-problem-after-pg-basebackup-tp5835204p5835296.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Postgres seems to use indexes in the wrong order
Next
From: Ravi Kiran
Date:
Subject: hash function in Postgres