Re: [HACKERS] Failed recovery with new faster 2PC code - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: [HACKERS] Failed recovery with new faster 2PC code
Date
Msg-id CAMkU=1y98=hMk=giv8LDszkZqGgTkk2yYWeHPiz+4SN6m7RL5g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Failed recovery with new faster 2PC code  (Nikhil Sontakke <nikhils@2ndquadrant.com>)
Responses Re: [HACKERS] Failed recovery with new faster 2PC code  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers


On Tue, Apr 18, 2017 at 1:17 AM, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
Hi,

There was a bug in the redo 2PC remove code path. Because of which, autovac would think that the 2PC is gone and cause removal of the corresponding clog entry earlier than needed. 

Please find attached, the bug fix: 2pc_redo_remove_bug.patch. 

I have been testing this on top of Michael's 2pc-restore-fix.patch and things seem to be ok for the past one+ hour. Will keep it running for long. 

Jeff, thanks for these very useful scripts. I am going to make a habit to run these scripts on my side from now on. Do you have any other script that I could try against these patches? Please let me know.

This script is the only one I have that specifically targets 2PC.  I wrote it last year when the previous round of speed-up code (which avoided writing the files upon "PREPARE" by delaying them until the next checkpoint) was developed.  I just decided to dust that test off to try again here.  I don't know how to change it to make it more targeted towards this set of patches.  Would this bug have been seen in a replica server in the absence of crashes, or was it only vulnerable during crash recovery rather than streaming replication?

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: [HACKERS] Failed recovery with new faster 2PC code
Next
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] tablesync patch broke the assumption that logical repdepends on?