Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processingBRIN indexes in VACUUM - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processingBRIN indexes in VACUUM
Date
Msg-id 20171031173730.cxl6gkye6jrfccsm@alvherre.pgsql
Whole thread Raw
In response to Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> So in a few more runs this morning using Alvaro's simplified test case,
> I have seen the following behaviors not previously reported:

> 1. Crashes in PageIndexTupleOverwrite, which has the same "invalid index
> offnum: %u" error report as PageIndexTupleDeleteNoCompact.  I note the
> same message appears in plain PageIndexTupleDelete as well.
> I think we've been too hasty to assume all instances of this came out of
> PageIndexTupleDeleteNoCompact.

Ah, I wasn't paying close attention to the originator routine of the
message, but you're right, I see this one too.

> 2. Crashes in the data-insertion process, not only the process running
> summarize_range:

Yeah, I saw these.  I was expecting it, since the two routines
(brininsert and summarize_range) pretty much share the insertion
protocol.

> I really don't understand how any of this "let's release the buffer
> lock and then take it back later" logic is supposed to work reliably.

Yeah, evidently that was way too optimistic and I'll need to figure out
a better mechanism to handle this.

The intention was to avoid deadlocks while locking the target page for
the insertion: by having both pages start unlocked we can simply lock
them in block number order.  If we keep the page containing the tuple
locked, I don't see how to reliably avoid a deadlock while acquiring a
buffer to insert the new tuple.

> BTW, while I'm bitching, it seems fairly insane from a concurrency
> standpoint that brin_getinsertbuffer is calling RecordPageWithFreeSpace
> while holding at least one and possibly two buffer locks.  Shouldn't
> that be done someplace else?

Hmm.  I spent a lot of effort (commit ccc4c074994d) to avoid leaving
pages uninitialized / unrecorded in FSM.  I left this on purpose on the
rationale that trying to fix it would make the callsites more convoluted
(the retry logic doesn't help).  But as I recall this was supposed to be
done only in the rare case where the buffer could not be returned to
caller ... but that's not what the current code does, so there is
something wrong there.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: [HACKERS] SQL procedures
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] Fix dumping pre-10 DBs by pg_dump10 if table "name" exists