Re: [HACKERS] segfault in hot standby for hash indexes - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: [HACKERS] segfault in hot standby for hash indexes
Date
Msg-id CAMkU=1zdP6jX_afiMc8yJWqjUS6K0xCxbrNhm2O7QyLmFHRzvg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] segfault in hot standby for hash indexes  (Ashutosh Sharma <ashu.coek88@gmail.com>)
Responses Re: [HACKERS] segfault in hot standby for hash indexes  (Ashutosh Sharma <ashu.coek88@gmail.com>)
List pgsql-hackers
On Tue, Mar 21, 2017 at 4:00 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hi Jeff,

On Tue, Mar 21, 2017 at 1:54 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 1:28 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Against an unmodified HEAD (17fa3e8), I got a segfault in the hot standby.
>>
>
> I think I see the problem in hash_xlog_vacuum_get_latestRemovedXid().
> It seems to me that we are using different block_id for registering
> the deleted items in xlog XLOG_HASH_VACUUM_ONE_PAGE and then using
> different block_id for fetching those items in
> hash_xlog_vacuum_get_latestRemovedXid().  So probably matching those
> will fix this issue (instead of fetching block number and items from
> block_id 1, we should use block_id 0).
>

Thanks for reporting this issue. As Amit said, it is happening due to
block_id mismatch. Attached is the patch that fixes the same. I
apologise for such a silly mistake. Please note that  I was not able
to reproduce the issue on my local machine using the test script you
shared. Could you please check with the attached patch if you are
still seeing the issue. Thanks in advance.


Hi Ashutosh,

I can confirm that that fixes the seg faults for me.

Did you mean you couldn't reproduce the problem in the first place, or that you could reproduce it and now the patch fixes it?  If the first of those, I forget to say you do have to wait for hot standby to reach a consistency and open for connections, and then connect to the standby ("psql -p 9874"), before the seg fault will be triggered.

But, there are places where hash_xlog_vacuum_get_latestRemovedXid diverges from btree_xlog_delete_get_latestRemovedXid, which I don't understand the reason for the divergence.  Is there a reason we dropped the PANIC if we have not reached consistency?  That is a case which should never happen, but it seems worth preserving the PANIC.  And why does this code get 'unused' from XLogRecGetBlockData(record, 0, &len), while the btree code gets it from xlrec?  Is that because the record being replayed is structured differently between btree and hash, or is there some other reason?

Thanks,

Jeff

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Partitioned tables and relfilenode