On Thu, Oct 13, 2011 at 4:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Oct 13, 2011 at 4:07 PM, Bob Hatfield <bobhatfield@gmail.com> wrote:
>>> have you had any power events? hard shutdowns, etc? I wonder if the problem is in the clog files, and not the heap
itself.
>>
>> Nothing unusual for as long as I can tell. Reminder that as long as I
>> don't restart the primary's pg process, everything works fine
>> (secondary's data is intact).
>>
>> It's as if stopping/starting the primary causes a shipped wal file to
>> be corrupt or contain duplicated data then processed by the secondary.
>
> My money is on clog/visibility related issues. It's a bit of a bear,
> but can you pull the xmin/xmax/ctid for the two duplicate records on
> the standby and the correspondingly non-duplicated record on the
> master? I'm curious if the heap blocks are identical and if the
> standby is incorrectly marking a transaction as valid/invalid.
>
> From there,
>
> We need to:
> *) figure out the transaction bits in clog on both systems and look
> them up there.
> *) also, look for differences in clog generally
> *) digest the heap block containing the records to see if they are identical
> *) double check hint bits?
Any movement on this? There is considerable interest in any known
issues resolving reproducible issues with postgres replication. Do
you happen to remember if set up the standby when the master was under
high load conditions? Any interesting/unexplained messages in the
standby logs?
merlin