Thread: brin index vacuum versus transaction snapshots

brin index vacuum versus transaction snapshots

From
Kevin Grittner
Date:
If you run `make installcheck` against a cluster with
default_transaction_isolation = 'repeatable read' you get one
failure:

*** /home/kgrittn/pg/master/src/test/regress/expected/brin.out 2015-07-21 10:21:58.798619111 -0500
--- /home/kgrittn/pg/master/src/test/regress/results/brin.out 2015-07-21 14:00:25.169320059 -0500
***************
*** 405,409 ****
--- 405,410 ----       box(point(odd, even), point(thousand, twothousand)) FROM tenk1 ORDER BY unique2 LIMIT 5 OFFSET
5;VACUUM brintest; -- force a summarization cycle in brinidx
 
+ ERROR: brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot UPDATE brintest
SETint8col = int8col * int4col; UPDATE brintest SET textcol = '' WHERE textcol IS NOT NULL;
 


The stacktrace is:

#0 brinsummarize (index=0x7f1fe0eca6b8, heapRel=0x7f1fe0ed64f8, numSummarized=0x30e1da8, numExisting=0x30e1da8) at
brin.c:1080
#1 0x00000000004683d3 in brinvacuumcleanup (fcinfo=0x7ffd41b95c98) at brin.c:731
#2 0x0000000000a69b28 in FunctionCall2Coll (flinfo=0x7ffd41b96078, collation=0, arg1=140725706121624, arg2=0) at
fmgr.c:1323
#3 0x00000000004f7b60 in index_vacuum_cleanup (info=0x7ffd41b96198, stats=0x0) at indexam.c:717
#4 0x00000000006e1004 in lazy_cleanup_index (indrel=0x7f1fe0eca6b8, stats=0x0, vacrelstats=0x30e0bb0) at
vacuumlazy.c:1397
#5 0x00000000006df637 in lazy_scan_heap (onerel=0x7f1fe0ed64f8, vacrelstats=0x30e0bb0, Irel=0x30e0ca8, nindexes=1,
scan_all=0'\000') at vacuumlazy.c:1108
 
#6 0x00000000006dda3b in lazy_vacuum_rel (onerel=0x7f1fe0ed64f8, options=1, params=0x7ffd41b96798, bstrategy=0x30d9a38)
atvacuumlazy.c:244
 
#7 0x00000000006dc18d in vacuum_rel (relid=30220, relation=0x2f8d1a8, options=1, params=0x7ffd41b96798) at
vacuum.c:1372
#8 0x00000000006db711 in vacuum (options=1, relation=0x2f8d1a8, relid=0, params=0x7ffd41b96798, va_cols=0x0,
bstrategy=0x30d9a38,isTopLevel=1 '\001') at vacuum.c:293
 
#9 0x00000000006db31d in ExecVacuum (vacstmt=0x2f8d200, isTopLevel=1 '\001') at vacuum.c:121
#10 0x00000000008bef36 in standard_ProcessUtility (parsetree=0x2f8d200, queryString=0x2f8c788 "VACUUM brintest;",
context=PROCESS_UTILITY_TOPLEVEL,params=0x0, dest=0x2f8d588, completionTag=0x7ffd41b96d50 "") at utility.c:654
 
#11 0x00000000008be69e in ProcessUtility (parsetree=0x2f8d200, queryString=0x2f8c788 "VACUUM brintest;",
context=PROCESS_UTILITY_TOPLEVEL,params=0x0, dest=0x2f8d588, completionTag=0x7ffd41b96d50 "") at utility.c:335
 
#12 0x00000000008be1a7 in PortalRunUtility (portal=0x2f36b18, utilityStmt=0x2f8d200, isTopLevel=1 '\001',
dest=0x2f8d588,completionTag=0x7ffd41b96d50 "") at pquery.c:1187
 
#13 0x00000000008bd1bd in PortalRunMulti (portal=0x2f36b18, isTopLevel=1 '\001', dest=0x2f8d588, altdest=0x2f8d588,
completionTag=0x7ffd41b96d50"") at pquery.c:1318
 
#14 0x00000000008bc80d in PortalRun (portal=0x2f36b18, count=9223372036854775807, isTopLevel=1 '\001', dest=0x2f8d588,
altdest=0x2f8d588,completionTag=0x7ffd41b96d50 "") at pquery.c:816
 
#15 0x00000000008b7edf in exec_simple_query (query_string=0x2f8c788 "VACUUM brintest;") at postgres.c:1104
#16 0x00000000008b720c in PostgresMain (argc=1, argv=0x2f1d450, dbname=0x2f1d2b0 "regression", username=0x2f1d290
"kgrittn")at postgres.c:4025
 
#17 0x000000000081ab99 in BackendRun (port=0x2f3d610) at postmaster.c:4183
#18 0x000000000081a17a in BackendStartup (port=0x2f3d610) at postmaster.c:3859
#19 0x0000000000816753 in ServerLoop () at postmaster.c:1618
#20 0x0000000000813d4a in PostmasterMain (argc=3, argv=0x2f1c460) at postmaster.c:1263
#21 0x000000000074ec36 in main (argc=3, argv=0x2f1c460) at main.c:223


Note that the function mentioned in the error message is not
anywhere in the stack trace, and that there was not any explicit
transaction started -- just a VACUUM command for the table, without
any BEGIN/COMMIT.


This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote:

> If you run `make installcheck` against a cluster with
> default_transaction_isolation = 'repeatable read' you get one
> failure:

> + ERROR: brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot

> This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.

Ah, the very next commit fixed this while I was running the tests.

Nothing to see here, folks; move along....

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Jeff Janes
Date:
On Tue, Jul 21, 2015 at 3:10 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Kevin Grittner <kgrittn@ymail.com> wrote:

> If you run `make installcheck` against a cluster with
> default_transaction_isolation = 'repeatable read' you get one
> failure:

> + ERROR: brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot

> This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.

Ah, the very next commit fixed this while I was running the tests.

Nothing to see here, folks; move along....


Still looks broken from here.




Re: brin index vacuum versus transaction snapshots

From
Kevin Grittner
Date:
Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Jul 21, 2015 at 3:10 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> Kevin Grittner <kgrittn@ymail.com> wrote:
>>
>>> If you run `make installcheck` against a cluster with
>>> default_transaction_isolation = 'repeatable read' you get one
>>> failure:
>>
>>> + ERROR: brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot
>>
>>> This is using source at commit 9faa6ae14f6098e4b55f0131f7ec2694a381fb87.
>>
>> Ah, the very next commit fixed this while I was running the tests.
>
> Still looks broken from here.

You are right; I rushed my test to see whether Tom's patch had any
impact, and neglected to set default_transaction_isolation after
intalling the new build.  :-(  The problem still reliably occurs,
and the error message still references a function name that was not
part of the problem.

I have added this to the "PostgreSQL 9.5 Open Items" Wiki page.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> If you run `make installcheck` against a cluster with
> default_transaction_isolation = 'repeatable read' you get one
> failure:

So there's some code that's specifically intended to handle this case:
            /*             * If using transaction-snapshot mode, it would be possible             * for another
transactionto insert a tuple that's not             * visible to our snapshot if we have already acquired one,
  * when in snapshot-isolation mode; therefore, disallow this             * from running in such a transaction unless a
snapshothasn't             * been acquired yet.             *             * This code is called by VACUUM and
 * brin_summarize_new_values. Have the error message mention             * the latter because VACUUM cannot run in a
transactionand             * thus cannot cause this issue.             */            if (IsolationUsesXactSnapshot() &&
FirstSnapshotSet)               ereport(ERROR,                        (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
                  errmsg("brin_summarize_new_values() cannot run in a transaction that has already obtained a
snapshot")));

However, this comment is full of it because a snapshot is obtained even
for VACUUM.  So the fact that brin_summarize_new_values() is mentioned
as being in trouble is just a very minor issue: fixing that would just
be a matter of passing a flag down from the caller into
summarize_range() to indicate whether it's vacuum or the function.  The
real problem is that VACUUM should be allowed to run without error.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> If you run `make installcheck` against a cluster with
> default_transaction_isolation = 'repeatable read' you get one
> failure:

I am tempted to say that we should just disallow to run vacuum on a
table containing a brin index in a transaction-snapshot transaction.
It is possible to silence the problem by checking for vacuum flags, but
(without testing) I think there will be a problem because the snapshot
is acquired too early and it is possible for concurrent transactions to
insert tuples in the table that the summarizing scan will not see, which
will cause the index to become effectively corrupt.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: brin index vacuum versus transaction snapshots

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Kevin Grittner wrote:
>> If you run `make installcheck` against a cluster with
>> default_transaction_isolation = 'repeatable read' you get one
>> failure:

> I am tempted to say that we should just disallow to run vacuum on a
> table containing a brin index in a transaction-snapshot transaction.

Huh?  We don't allow vacuum inside a (user started) transaction now.
What new restriction are you proposing?
        regards, tom lane



Re: brin index vacuum versus transaction snapshots

From
Simon Riggs
Date:
On 30 July 2015 at 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Kevin Grittner wrote:
>> If you run `make installcheck` against a cluster with
>> default_transaction_isolation = 'repeatable read' you get one
>> failure:

> I am tempted to say that we should just disallow to run vacuum on a
> table containing a brin index in a transaction-snapshot transaction.

Huh?  We don't allow vacuum inside a (user started) transaction now.
What new restriction are you proposing?

 Forgive me, but I believe there is a confusion here. Nobody is changing VACUUM.

The code comment mentioned as "full of it" by Kevin appears to be accurate and appropriate.

The code is called by VACUUM and brin_summarize_new_values(). It can't be VACUUM, as you say and as the comment already says, so it must be the function brin_summarize_new_values().

The test assumes that the default_transaction_isolation is read committed and so the test fails at other levels. If anything, it is the test that is at fault.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin index vacuum versus transaction snapshots

From
Kevin Grittner
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:

>> If you run `make installcheck` against a cluster with
>> default_transaction_isolation = 'repeatable read' you get one
>> failure:
>
> I am tempted to say that we should just disallow to run vacuum on
> a table containing a brin index in a transaction-snapshot
> transaction.

You do realize that the VACUUM command was not run inside an
explicit transaction, right?  The only change was that in the
postgresql.conf file I set default_transaction_isolation =
'repeatable read'.  That was my first step before confirming that
nothing was broken for 'serializable'; but if it doesn't pass in
'repeatable read' it's not going to pass in serializable.  Anyway,
for those using serializable transactions to handle race conditions
so that they don't need to use explicit LOCK TABLE statements or
SELECT FOR UPDATE clauses, setting that as the default in
postgresql.conf is standard procedure.  Are you suggesting that
those users should need to run VACUUM commands bracketed with
adjustments to that GUC, like this?:

set default_transaction_isolation = 'read committed';
vacuum [...];
reset default_transaction_isolation;

What about autovacuum?

> It is possible to silence the problem by checking for vacuum
> flags, but (without testing) I think there will be a problem
> because the snapshot is acquired too early and it is possible for
> concurrent transactions to insert tuples in the table that the
> summarizing scan will not see, which will cause the index to
> become effectively corrupt.

If that's true, it sounds like vacuum of BRIN indexes might be
broken in a pretty fundamental way.  Shouldn't it be using a
snapshot that lets it see everything past the global xmin
threshold, and let the *users* of the index ignore those tuples
that are not visible to *them*?  From what you're saying here it 
sounds like the BRIN index is failing to include in its bounds any 
tuples not visible to some arbitrary snapshot?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Kevin Grittner
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> On 30 July 2015 at 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> Kevin Grittner wrote:

>>>> If you run `make installcheck` against a cluster with
>>>> default_transaction_isolation = 'repeatable read' you get one
>>>> failure:
>>>
>>> I am tempted to say that we should just disallow to run vacuum
>>> on a table containing a brin index in a transaction-snapshot
>>> transaction.
>>
>> Huh?  We don't allow vacuum inside a (user started) transaction
>> now.  What new restriction are you proposing?
>
>  Forgive me, but I believe there is a confusion here. Nobody is
> changing VACUUM.
>
> The code comment mentioned as "full of it" by Kevin appears to be
> accurate and appropriate.
>
> The code is called by VACUUM and brin_summarize_new_values(). It
> can't be VACUUM,

Reread the thread.  The error is being hit on a VACUUM command
(which is not, of course in any explicit transaction).

> as you say and as the comment already says, so it must be the
> function brin_summarize_new_values().

It is, in fact, not brin_summarize_new_values() -- as shown by the
stack trace when the error is thrown.  This is why it is such a bad
idea to have any function assume that it knows everywhere it is
being called from, and under what conditions each caller will call
it.  (That's distinct from a function having a documented API about
invariants which must be true when calling the function.)  A simple
thinko like forgetting that even a VACUUM command starts a
transaction if it is not already in an explicit transaction can
cause confusing problems when the function makes a bad guess at who
"must have" called it.  Let the function name itself it if is going
to name something (in a user-facing error message) that is only
visible to those with access to the source code.  Better yet, write
an error message that will mean something to those users who don't
have the source code handy.

> The test assumes that the default_transaction_isolation is read
> committed and so the test fails at other levels. If anything, it
> is the test that is at fault.

Yes, the test being wrong is a bigger problem than an arcane and
misleading error message when it causes a failure.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Simon Riggs
Date:
On 31 July 2015 at 15:48, Kevin Grittner <kgrittn@ymail.com> wrote:

Reread the thread. 

I stand corrected.
 
The error is being hit on a VACUUM command
(which is not, of course in any explicit transaction).

Seems like VACUUM only ever needs a read-committed snapshot. Whether it takes a read-committed or repeatable read makes no difference.

Does brin_summarize_new_values() need to track what it reads for serializable transactions?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
Let me explain a bit more what's the mechanism at play.

When scanning a block range to summarize it, we use an MVCC snapshot.
Any tuples that are inserted by transactions that appear as in-progress
to that snapshot will not be seen by that range scan; therefore we need
another mechanism to include their values in the summary tuple.

This mechanism is a "placeholder tuple."  We insert this tuple into the
index; any process that tries to insert into the range will update the
placeholder tuple.  All insertions that are not visible according to the
MVCC scan must update the placeholder tuple.

So there are two events that are time-critical regarding each other: the
insertion of the placeholder tuple, and the acquisition of the snapshot
for the MVCC scan.  All transactions in progress for the snapshot *must*
see the placeholder tuple; therefore, we make sure we insert the
placeholder tuple first, and acquire the snapshot later.

This is all regardless of a transaction being a user-declared
transaction block, or an implicit transaction opened by some command;
regardless of vacuum being user-invoked or autovacuum.  The description
of the above is critical for correctness in all those cases.

I assumed (wrongly) that vacuum would not acquire a snapshot.  (For one
thing, we already rely for correctness on lazy vacuum not holding xmin
back, so this is not entirely without sense.)  The problem is that
PortalRunUtility does it inconditionally.  This is not a problem in
read-committed mode, because we simply freshen the snapshot each time
and all is well.  In transaction-snapshot mode, taking another snapshot
is a no-op and thus the whole thing is broken.  The code is right to
raise an error.  (I don't really give a damn whether the error message
is 100% correct or not.  Fixing that part is trivial.)

Even assuming that PortalRunUtility would not grab a snapshot at start,
things would still be broken if vacuum processed more than one range in
transaction-snapshot isolation mode: insert placeholder tuple, grab
snapshot for the first range, scan range: all is well up to this point.
Then we need to process the second range: insert placeholder tuple, grab
snapshot ... which reuses the previous snapshot, older than the
placeholder tuple.  Oops.

I think the real solution to this problem is to avoid use of
GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
as I can see, that should completely close the hole.  This requires
patching IndexBuildHeapRangeScan() to allow for that.

Untested patch attached.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
> I think the real solution to this problem is to avoid use of
> GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
> as I can see, that should completely close the hole.  This requires
> patching IndexBuildHeapRangeScan() to allow for that.

Actually I think there's another problem: if a transaction starts and
inserts a tuple into the page range, then goes to sleep, and then
another session does the summarization of the page range, session 1 is
seen as "in progress" by session 2 (so the scan does not see the new
tuple), but the placeholder tuple was not modified either because it was
inserted later than the snapshot.  So the update is lost.

I think the only way to close this hole is to have summarize_range()
sleep until all open snapshots are gone after inserting the placeholder
tuple and before acquiring the snapshot, similarly to how CREATE INDEX
CONCURRENTLY does it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: brin index vacuum versus transaction snapshots

From
Kevin Grittner
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

>> I think the real solution to this problem is to avoid use of
>> GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
>> as I can see, that should completely close the hole.  This requires
>> patching IndexBuildHeapRangeScan() to allow for that.
>
> Actually I think there's another problem: if a transaction starts and
> inserts a tuple into the page range, then goes to sleep, and then
> another session does the summarization of the page range, session 1 is
> seen as "in progress" by session 2 (so the scan does not see the new
> tuple), but the placeholder tuple was not modified either because it was
> inserted later than the snapshot.  So the update is lost.
>
> I think the only way to close this hole is to have summarize_range()
> sleep until all open snapshots are gone after inserting the placeholder
> tuple and before acquiring the snapshot, similarly to how CREATE INDEX
> CONCURRENTLY does it.

Please excuse my naiveté on the topic, but could you explain (or
point me to the documented explanation) of why we don't scan using
a non-MVCC snapshot and build the page range based on all non-dead
tuples?  I understand that the range being scanned would need to be
locked, but we're OK with doing that for creation of other indexes.
(There is no mention of snapshots or locks in the BRIN README....)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
> >> I think the real solution to this problem is to avoid use of
> >> GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
> >> as I can see, that should completely close the hole.  This requires
> >> patching IndexBuildHeapRangeScan() to allow for that.
> >
> > Actually I think there's another problem: if a transaction starts and
> > inserts a tuple into the page range, then goes to sleep, and then
> > another session does the summarization of the page range, session 1 is
> > seen as "in progress" by session 2 (so the scan does not see the new
> > tuple), but the placeholder tuple was not modified either because it was
> > inserted later than the snapshot.  So the update is lost.
> >
> > I think the only way to close this hole is to have summarize_range()
> > sleep until all open snapshots are gone after inserting the placeholder
> > tuple and before acquiring the snapshot, similarly to how CREATE INDEX
> > CONCURRENTLY does it.
> 
> Please excuse my naiveté on the topic, but could you explain (or
> point me to the documented explanation) of why we don't scan using
> a non-MVCC snapshot and build the page range based on all non-dead
> tuples?

Because I don't know of any mechanism to lock insertion into the block
range while the scan takes place.  If you can suggest something
workable, that might be better than what I'm proposing.

Note that in my proposal, we would have to wait for all snapshots to go
away *for every page range*, which seems very troublesome.  A better
answer might be to insert all placeholder tuples first, then wait for
concurrent snapshots to go away, then do each scan.  But this is a
larger rework of code.

> I understand that the range being scanned would need to be
> locked, but we're OK with doing that for creation of other indexes.

That might be so, but this is not index creation; it's not acceptable to
lock the table during vacuuming.  Do we have anything with finer
granularity than locking the entire table?  (I don't think holding
buffer lock on all the pages in the range is acceptable.)

> (There is no mention of snapshots or locks in the BRIN README....)

Um.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: brin index vacuum versus transaction snapshots

From
Kevin Grittner
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> Untested patch attached.

That fixes the installcheck failure on my machine.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Robert Haas
Date:
On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> I think the real solution to this problem is to avoid use of
>> GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
>> as I can see, that should completely close the hole.  This requires
>> patching IndexBuildHeapRangeScan() to allow for that.
>
> Actually I think there's another problem: if a transaction starts and
> inserts a tuple into the page range, then goes to sleep, and then
> another session does the summarization of the page range, session 1 is
> seen as "in progress" by session 2 (so the scan does not see the new
> tuple), but the placeholder tuple was not modified either because it was
> inserted later than the snapshot.  So the update is lost.
>
> I think the only way to close this hole is to have summarize_range()
> sleep until all open snapshots are gone after inserting the placeholder
> tuple and before acquiring the snapshot, similarly to how CREATE INDEX
> CONCURRENTLY does it.

That's gonna be really slow, though, right?  Even if you rework things
so that vacuum inserts all the placeholder tuples first, then waits,
then does all the summarization, that could easily turn a vacuum that
would have finished in a second into one that instead takes multiple
hours.  During that time an AV worker is pinned down, and all sorts of
badness can ensue.

Maybe I'm misunderstanding, but this whole thing seems like a pretty
serious problem for BRIN.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > I think the only way to close this hole is to have summarize_range()
> > sleep until all open snapshots are gone after inserting the placeholder
> > tuple and before acquiring the snapshot, similarly to how CREATE INDEX
> > CONCURRENTLY does it.
> 
> That's gonna be really slow, though, right?  Even if you rework things
> so that vacuum inserts all the placeholder tuples first, then waits,
> then does all the summarization, that could easily turn a vacuum that
> would have finished in a second into one that instead takes multiple
> hours.  During that time an AV worker is pinned down, and all sorts of
> badness can ensue.

Yeah, it is bad and I was concerned about that too.  Thankfully I found
another way to solve it, which is to forgo usage of MVCC here and
instead use SnapshotAny.  There's already a mode in
IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
to do what we need.  I'm going to propose a patch along those lines
shortly.

> Maybe I'm misunderstanding, but this whole thing seems like a pretty
> serious problem for BRIN.  :-(

With this new approach it shouldn't be.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> Thankfully I found
> another way to solve it, which is to forgo usage of MVCC here and
> instead use SnapshotAny.  There's already a mode in
> IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
> to do what we need.  I'm going to propose a patch along those lines
> shortly.

Here's the promised patch.  Includes a new isolation test (which fails
with the old code in two ways: first when using VACUUM it causes the
error message Kevin didn't like to be raised; second when using
brin_summarize_new_values it causes the inserted tuple to not be part of
the summary, which is wrong).  This test leaves the pageinspect
extension installed in the isolationtest database, but that seems fine
to me.  (No other isolation test installs an extension.)

The error message Kevin complained about is gone, as is some related
commentary.  This is replaced by tweaks in IndexBuildHeapRangeScan that
know to do a SnapshotAny scan without being noisy about in progress
insertions/deletions.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: brin index vacuum versus transaction snapshots

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> Here's the promised patch.

Pushed to master and 9.5.  Thanks for reporting!

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services