Thread: Re: Somebody has not thought through subscription lockingconsiderations

Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 30/03/17 07:25, Tom Lane wrote:
> I noticed this failure report:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-03-29%2019%3A45%3A27
> 
> in which we find
> 
> *** /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/expected/updatable_views.out    Thu Mar 30
04:45:432017
 
> --- /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/results/updatable_views.out    Thu Mar 30
05:32:372017
 
> ***************
> *** 349,354 ****
> --- 349,358 ----
>   DROP VIEW ro_view10, ro_view12, ro_view18;
>   DROP SEQUENCE seq CASCADE;
>   NOTICE:  drop cascades to view ro_view19
> + ERROR:  deadlock detected
> + DETAIL:  Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
> + Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
> + HINT:  See server log for query details.
>   -- simple updatable view
>   CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
>   INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
> 
> and the referenced bit of log is
> 
> [58dc19dd.1d98:175] ERROR:  deadlock detected
> [58dc19dd.1d98:176] DETAIL:  Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by
process7577.
 
>     Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
>     Process 7576: DROP SEQUENCE seq CASCADE;
>     Process 7577: VACUUM FULL pg_class;
> [58dc19dd.1d98:177] HINT:  See server log for query details.
> [58dc19dd.1d98:178] STATEMENT:  DROP SEQUENCE seq CASCADE;
> 
> Of course, 1259 is pg_class and 6102 is pg_subscription_rel.
> 
> I await with interest an explanation of what "VACUUM FULL pg_class" is
> doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
> to mention why a DROP SEQUENCE is holding some fairly strong lock on that
> relation.  *Especially* in a situation where no subscriptions exist ---
> but even if any did, this seems unacceptable on its face.  Access to core
> catalogs like pg_class cannot depend on random other stuff.
> 

Hmm, the DROP SEQUENCE is result of not having dependency info for
relations/subscriptions I think. I was told during review it's needless
bloat of dependency catalog. I guess we should revisit that. It's also
likely that RemoveSubscriptionRel will work fine with lower lock level.

I have no idea why VACUUM FULL of pg_class would touch the
pg_subscription_rel though, I'll have to dig into that.

I can see that locking for example pg_trigger or pg_depend in
ShareRowExclusiveLock will also block VACUUM FULL on pg_class (and other
related tables like pg_attribute). So maybe we just need to be careful
about not taking such a strong lock...

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Somebody has not thought through subscription locking considerations

From
Masahiko Sawada
Date:
On Fri, Mar 31, 2017 at 9:53 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 30/03/17 07:25, Tom Lane wrote:
>> I noticed this failure report:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-03-29%2019%3A45%3A27
>>
>> in which we find
>>
>> *** /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/expected/updatable_views.out     Thu Mar 30
04:45:432017
 
>> --- /home/buildfarm/data/buildroot/HEAD/pgsql.build/src/test/regress/results/updatable_views.out      Thu Mar 30
05:32:372017
 
>> ***************
>> *** 349,354 ****
>> --- 349,358 ----
>>   DROP VIEW ro_view10, ro_view12, ro_view18;
>>   DROP SEQUENCE seq CASCADE;
>>   NOTICE:  drop cascades to view ro_view19
>> + ERROR:  deadlock detected
>> + DETAIL:  Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 7577.
>> + Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
>> + HINT:  See server log for query details.
>>   -- simple updatable view
>>   CREATE TABLE base_tbl (a int PRIMARY KEY, b text DEFAULT 'Unspecified');
>>   INSERT INTO base_tbl SELECT i, 'Row ' || i FROM generate_series(-2, 2) g(i);
>>
>> and the referenced bit of log is
>>
>> [58dc19dd.1d98:175] ERROR:  deadlock detected
>> [58dc19dd.1d98:176] DETAIL:  Process 7576 waits for AccessShareLock on relation 1259 of database 16384; blocked by
process7577.
 
>>       Process 7577 waits for ShareRowExclusiveLock on relation 6102 of database 16384; blocked by process 7576.
>>       Process 7576: DROP SEQUENCE seq CASCADE;
>>       Process 7577: VACUUM FULL pg_class;
>> [58dc19dd.1d98:177] HINT:  See server log for query details.
>> [58dc19dd.1d98:178] STATEMENT:  DROP SEQUENCE seq CASCADE;
>>
>> Of course, 1259 is pg_class and 6102 is pg_subscription_rel.
>>
>> I await with interest an explanation of what "VACUUM FULL pg_class" is
>> doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
>> to mention why a DROP SEQUENCE is holding some fairly strong lock on that
>> relation.  *Especially* in a situation where no subscriptions exist ---
>> but even if any did, this seems unacceptable on its face.  Access to core
>> catalogs like pg_class cannot depend on random other stuff.
>>
>
> Hmm, the DROP SEQUENCE is result of not having dependency info for
> relations/subscriptions I think. I was told during review it's needless
> bloat of dependency catalog. I guess we should revisit that. It's also
> likely that RemoveSubscriptionRel will work fine with lower lock level.
>
> I have no idea why VACUUM FULL of pg_class would touch the
> pg_subscription_rel though, I'll have to dig into that.

VACUUM FULL of any table acquires ShareRowExclusiveLock on
pg_subscription_rel because when doDeletion removes old heap the
RemoveSubscriptionRel is called in heap_drop_with_catalog. The
following stack trace is what I got when run VACUUM FULL pg_class.

#2  0x000000000084b9d2 in LockRelationOid (relid=6102, lockmode=6) at lmgr.c:115
#3  0x00000000004cec37 in relation_open (relationId=6102, lockmode=6)
at heapam.c:1122
#4  0x00000000004ceef9 in heap_open (relationId=6102, lockmode=6) at
heapam.c:1288
#5  0x0000000000599a02 in RemoveSubscriptionRel (subid=0, relid=16409)
at pg_subscription.c:361
#6  0x000000000056037f in heap_drop_with_catalog (relid=16409) at heap.c:1842
#7  0x000000000055b420 in doDeletion (object=0x1bca758, flags=1) at
dependency.c:1125
#8  0x000000000055b192 in deleteOneObject (object=0x1bca758,
depRel=0x7fff000716a0, flags=1) at dependency.c:1028
#9  0x000000000055a169 in deleteObjectsInList
(targetObjects=0x1b99708, depRel=0x7fff000716a0, flags=1) at
dependency.c:263
#10 0x000000000055a21a in performDeletion (object=0x7fff00071750,
behavior=DROP_RESTRICT, flags=1) at dependency.c:344
#11 0x0000000000615597 in finish_heap_swap (OIDOldHeap=1259,
OIDNewHeap=16409, is_system_catalog=1 '\001', swap_toast_by_content=0
'\000', check_constraints=0 '\000', is_internal=1 '\001',
frozenXid=571, cutoffMulti=1, newrelpersistence=112 'p') at
cluster.c:1574
#12 0x0000000000613bd3 in rebuild_relation (OldHeap=0x7f541f0d24a0,
indexOid=0, verbose=0 '\000') at cluster.c:590
#13 0x000000000061362a in cluster_rel (tableOid=1259, indexOid=0,
recheck=0 '\000', verbose=0 '\000') at cluster.c:404
#14 0x000000000069f40f in vacuum_rel (relid=1259, relation=0x1b9a770,
options=17, params=0x7fff00071a80) at vacuum.c:1441
#15 0x000000000069db9b in vacuum (options=17, relation=0x1b9a770,
relid=0, params=0x7fff00071a80, va_cols=0x0, bstrategy=0x1bf2f50,
isTopLevel=1 '\001') at vacuum.c:304
#16 0x000000000069d7ec in ExecVacuum (vacstmt=0x1b9a7c8, isTopLevel=1
'\001') at vacuum.c:122
#17 0x0000000000873a9c in standard_ProcessUtility (pstmt=0x1b9ab28,
queryString=0x1b99d50 "vacuum FULL pg_class;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x1b9ac20,
completionTag=0x7fff00071eb0 "") at utility.c:670
#18 0x0000000000873222 in ProcessUtility (pstmt=0x1b9ab28,
queryString=0x1b99d50 "vacuum FULL pg_class;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x1b9ac20,
completionTag=0x7fff00071eb0 "") at utility.c:353
#19 0x000000000087220c in PortalRunUtility (portal=0x1b35540,
pstmt=0x1b9ab28, isTopLevel=1 '\001', setHoldSnapshot=0 '\000',
dest=0x1b9ac20, completionTag=0x7fff00071eb0 "") at pquery.c:1174
#20 0x0000000000872419 in PortalRunMulti (portal=0x1b35540,
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0x1b9ac20,
altdest=0x1b9ac20, completionTag=0x7fff00071eb0 "") at pquery.c:1317
#21 0x0000000000871945 in PortalRun (portal=0x1b35540,
count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001',
dest=0x1b9ac20, altdest=0x1b9ac20, completionTag=0x7fff00071eb0 "") at
pquery.c:795
#22 0x000000000086ba9e in exec_simple_query (query_string=0x1b99d50
"vacuum FULL pg_class;") at postgres.c:1101
#23 0x000000000086fbf8 in PostgresMain (argc=1, argv=0x1b44220,
dbname=0x1b44080 "postgres", username=0x1b44058 "masahiko") at
postgres.c:4071
#24 0x00000000007d6e46 in BackendRun (port=0x1b3c5d0) at postmaster.c:4317
#25 0x00000000007d65c6 in BackendStartup (port=0x1b3c5d0) at postmaster.c:3989
#26 0x00000000007d2bde in ServerLoop () at postmaster.c:1729
#27 0x00000000007d22a2 in PostmasterMain (argc=5, argv=0x1b15ed0) at
postmaster.c:1337
#28 0x0000000000710cfa in main (argc=5, argv=0x1b15ed0) at main.c:228

> I can see that locking for example pg_trigger or pg_depend in
> ShareRowExclusiveLock will also block VACUUM FULL on pg_class (and other
> related tables like pg_attribute). So maybe we just need to be careful
> about not taking such a strong lock...
>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Masahiko Sawada <sawada.mshk@gmail.com> writes:
> On Fri, Mar 31, 2017 at 9:53 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> On 30/03/17 07:25, Tom Lane wrote:
>>> I await with interest an explanation of what "VACUUM FULL pg_class" is
>>> doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
>>> to mention why a DROP SEQUENCE is holding some fairly strong lock on that
>>> relation.

> VACUUM FULL of any table acquires ShareRowExclusiveLock on
> pg_subscription_rel because when doDeletion removes old heap the
> RemoveSubscriptionRel is called in heap_drop_with_catalog.

This seems entirely horrid: it *guarantees* deadlock possibilities.
And I wonder what happens when I VACUUM FULL pg_subscription_rel
itself.

At the very least, it would be a good idea to exclude the system
catalogs from logical replication, wouldn't it?
        regards, tom lane



Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 31/03/17 19:35, Tom Lane wrote:
> Masahiko Sawada <sawada.mshk@gmail.com> writes:
>> On Fri, Mar 31, 2017 at 9:53 AM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>> On 30/03/17 07:25, Tom Lane wrote:
>>>> I await with interest an explanation of what "VACUUM FULL pg_class" is
>>>> doing trying to acquire ShareRowExclusiveLock on pg_subscription_rel, not
>>>> to mention why a DROP SEQUENCE is holding some fairly strong lock on that
>>>> relation.
> 
>> VACUUM FULL of any table acquires ShareRowExclusiveLock on
>> pg_subscription_rel because when doDeletion removes old heap the
>> RemoveSubscriptionRel is called in heap_drop_with_catalog.
> 
> This seems entirely horrid: it *guarantees* deadlock possibilities.
> And I wonder what happens when I VACUUM FULL pg_subscription_rel
> itself.
> 
> At the very least, it would be a good idea to exclude the system
> catalogs from logical replication, wouldn't it?
> 

They are excluded. It works same way for triggers and many other objects
so I would not say it's horrid.

The problematic part is that the pg_subscription_rel manipulation
functions acquire ShareRowExclusiveLock and not the usual
RowExclusiveLock because there were some worries about concurrency. I
think though that it's not needed though given the access patterns
there. It's only updated by CREATE SUBSCRIPTION/ALTER SUBSCRIPTION
REFRESH and then by tablesync which holds exclusive lock on the table
itself anyway.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
> On 31/03/17 19:35, Tom Lane wrote:
>> At the very least, it would be a good idea to exclude the system
>> catalogs from logical replication, wouldn't it?

> They are excluded.

Well, the exclusion is apparently not checked early enough.  I would say
that touches of system catalogs should never result in any attempts to
access the logical relation infrastructure, but what we have here is
such an access.

> The problematic part is that the pg_subscription_rel manipulation
> functions acquire ShareRowExclusiveLock and not the usual
> RowExclusiveLock because there were some worries about concurrency.

No, the problematic part is that there is any heap_open happening at
all.  That open could very easily result in a recursive attempt to read
pg_class, for example, which is going to be fatal if we're in the midst
of vacuum full'ing or reindex'ing pg_class.  It's frankly astonishing
to me that this patch seems to have survived testing under
CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
preventing such recursive lookups.
        regards, tom lane



Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 31/03/17 20:23, Tom Lane wrote:
> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
>> On 31/03/17 19:35, Tom Lane wrote:
>>> At the very least, it would be a good idea to exclude the system
>>> catalogs from logical replication, wouldn't it?
> 
>> They are excluded.
> 
> Well, the exclusion is apparently not checked early enough.  I would say
> that touches of system catalogs should never result in any attempts to
> access the logical relation infrastructure, but what we have here is
> such an access.
> 
>> The problematic part is that the pg_subscription_rel manipulation
>> functions acquire ShareRowExclusiveLock and not the usual
>> RowExclusiveLock because there were some worries about concurrency.
> 
> No, the problematic part is that there is any heap_open happening at
> all.  That open could very easily result in a recursive attempt to read
> pg_class, for example, which is going to be fatal if we're in the midst
> of vacuum full'ing or reindex'ing pg_class.  It's frankly astonishing
> to me that this patch seems to have survived testing under
> CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
> preventing such recursive lookups.
> 

Hmm okay, so the solution is to either use standard dependency info for
this so that it's only called for tables that are actually know to be
subscribed or have some exceptions in the current code to call the
function only for user catalogs. Any preferences?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
> On 31/03/17 20:23, Tom Lane wrote:
>> No, the problematic part is that there is any heap_open happening at
>> all.  That open could very easily result in a recursive attempt to read
>> pg_class, for example, which is going to be fatal if we're in the midst
>> of vacuum full'ing or reindex'ing pg_class.  It's frankly astonishing
>> to me that this patch seems to have survived testing under
>> CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
>> preventing such recursive lookups.

> Hmm okay, so the solution is to either use standard dependency info for
> this so that it's only called for tables that are actually know to be
> subscribed or have some exceptions in the current code to call the
> function only for user catalogs. Any preferences?

Looking at dependency info isn't going to fix this, it only moves the
unsafe catalog access somewhere else (ie pg_depend instead of
pg_subscription_rel).  I suspect the only safe solution is doing an
IsCatalogRelation or similar test pretty early in the logical replication
code paths.
        regards, tom lane



Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 31/03/17 21:00, Tom Lane wrote:
> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
>> On 31/03/17 20:23, Tom Lane wrote:
>>> No, the problematic part is that there is any heap_open happening at
>>> all.  That open could very easily result in a recursive attempt to read
>>> pg_class, for example, which is going to be fatal if we're in the midst
>>> of vacuum full'ing or reindex'ing pg_class.  It's frankly astonishing
>>> to me that this patch seems to have survived testing under
>>> CLOBBER_CACHE_ALWAYS, because it's only the catalog caches that are
>>> preventing such recursive lookups.
> 
>> Hmm okay, so the solution is to either use standard dependency info for
>> this so that it's only called for tables that are actually know to be
>> subscribed or have some exceptions in the current code to call the
>> function only for user catalogs. Any preferences?
> 
> Looking at dependency info isn't going to fix this, it only moves the
> unsafe catalog access somewhere else (ie pg_depend instead of
> pg_subscription_rel).  I suspect the only safe solution is doing an
> IsCatalogRelation or similar test pretty early in the logical replication
> code paths.
> 

I don't follow, everything else does dependency info check in this
situation, how is this any different?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
> On 31/03/17 21:00, Tom Lane wrote:
>> Looking at dependency info isn't going to fix this, it only moves the
>> unsafe catalog access somewhere else (ie pg_depend instead of
>> pg_subscription_rel).  I suspect the only safe solution is doing an
>> IsCatalogRelation or similar test pretty early in the logical replication
>> code paths.

> I don't follow, everything else does dependency info check in this
> situation, how is this any different?

What do you mean by "everything else"?  The key point here is that
access to the bootstrap catalogs like pg_class and pg_attribute can't
be dependent on accesses to other catalogs, or we get into circularities.
We certainly aren't trying to look in pg_depend when we do a heap_open.

(Or, if you tell me that we are now because the logical replication
patch added it, I'm going to start agitating for reverting the patch
and sending it back for redesign.)
        regards, tom lane



Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 01/04/17 00:52, Tom Lane wrote:
> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
>> On 31/03/17 21:00, Tom Lane wrote:
>>> Looking at dependency info isn't going to fix this, it only moves the
>>> unsafe catalog access somewhere else (ie pg_depend instead of
>>> pg_subscription_rel).  I suspect the only safe solution is doing an
>>> IsCatalogRelation or similar test pretty early in the logical replication
>>> code paths.
> 
>> I don't follow, everything else does dependency info check in this
>> situation, how is this any different?
> 
> What do you mean by "everything else"?  The key point here is that
> access to the bootstrap catalogs like pg_class and pg_attribute can't
> be dependent on accesses to other catalogs, or we get into circularities.
> We certainly aren't trying to look in pg_depend when we do a heap_open.
> 
> (Or, if you tell me that we are now because the logical replication
> patch added it, I'm going to start agitating for reverting the patch
> and sending it back for redesign.)

But the pg_subscription_rel is also not accessed on heap_open, the
problematic code is called from heap_drop_with_catalog. And VACUUM FULL
pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
goes through performDeletion so through dependency info which is what I
mean by everything else does this).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
> But the pg_subscription_rel is also not accessed on heap_open, the
> problematic code is called from heap_drop_with_catalog. And VACUUM FULL
> pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
> goes through performDeletion so through dependency info which is what I
> mean by everything else does this).

Hmmm.  What the heap_drop_with_catalog call is being applied to is a
short-lived table that is not pg_class.  It happens to own the disk
storage that used to belong to pg_class, but it is not pg_class;
in particular it doesn't have the same OID, and it is not what would
be looked in if someone happened to need to fetch a pg_class row
at that point.  So there's no circularity involved.

However ... by that same token, it ought to be okay to consult
pg_subscription_rel during that drop step.  Indeed, if we put
an IsCatalogRelation check in there, it wouldn't fire, because
the target table has OID 16409 according to your stack trace.
So that line of thought is indeed not very helpful.

On further reflection it seems like you were right, the problem is
taking a self-exclusive lock on pg_subscription_rel during low-level
catalog operations.  We're going to have to find a way to reduce that
lock strength, or we're going to have a lot of deadlock problems.
        regards, tom lane



Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 01/04/17 01:20, Tom Lane wrote:
> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
>> But the pg_subscription_rel is also not accessed on heap_open, the
>> problematic code is called from heap_drop_with_catalog. And VACUUM FULL
>> pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
>> goes through performDeletion so through dependency info which is what I
>> mean by everything else does this).
> 
> Hmmm.  What the heap_drop_with_catalog call is being applied to is a
> short-lived table that is not pg_class.  It happens to own the disk
> storage that used to belong to pg_class, but it is not pg_class;
> in particular it doesn't have the same OID, and it is not what would
> be looked in if someone happened to need to fetch a pg_class row
> at that point.  So there's no circularity involved.
> 
> On further reflection it seems like you were right, the problem is
> taking a self-exclusive lock on pg_subscription_rel during low-level
> catalog operations.  We're going to have to find a way to reduce that
> lock strength, or we're going to have a lot of deadlock problems.
> 

Well we have heavy lock because we were worried about failure scenarios
in our dump upsert in SetSubscriptionRelState which does cache lookup
and if it finds something it updates, otherwise inserts. We have similar
pattern in other places but they are called from DDL statements usually
so the worst that can happen is DDL fails with weird errors when done
concurrently with same name so using RowExclusiveLock is okay.

That being said, looking at use-cases for SetSubscriptionRelState that's
basically CREATE SUBSCRIPTION, ALTER SUBSCRIPTION REFRESH and tablesync
worker. So the DDL thing applies to first ones as well and tablesync
should not be running in case the record does not exist so it's fine if
it fails. In terms of RemoveSubscriptionRel that's only called from
heap_drop_with_catalog and tablesync holds relation lock so there is no
way heap_drop_with_catalog will happen on the same relation. This leads
me to thinking that RowExclusiveLock is fine for both
SetSubscriptionRelState and RemoveSubscriptionRel as long as we document
that callers should be aware that SetSubscriptionRelState has
concurrency might fail on unique index check.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 01/04/17 01:57, Petr Jelinek wrote:
> On 01/04/17 01:20, Tom Lane wrote:
>> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
>>> But the pg_subscription_rel is also not accessed on heap_open, the
>>> problematic code is called from heap_drop_with_catalog. And VACUUM FULL
>>> pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
>>> goes through performDeletion so through dependency info which is what I
>>> mean by everything else does this).
>>
>> Hmmm.  What the heap_drop_with_catalog call is being applied to is a
>> short-lived table that is not pg_class.  It happens to own the disk
>> storage that used to belong to pg_class, but it is not pg_class;
>> in particular it doesn't have the same OID, and it is not what would
>> be looked in if someone happened to need to fetch a pg_class row
>> at that point.  So there's no circularity involved.
>>
>> On further reflection it seems like you were right, the problem is
>> taking a self-exclusive lock on pg_subscription_rel during low-level
>> catalog operations.  We're going to have to find a way to reduce that
>> lock strength, or we're going to have a lot of deadlock problems.
>>
> 
> Well we have heavy lock because we were worried about failure scenarios
> in our dump upsert in SetSubscriptionRelState which does cache lookup

*dumb (ie, like me ;) )


--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Somebody has not thought through subscription lockingconsiderations

From
Petr Jelinek
Date:
On 01/04/17 01:57, Petr Jelinek wrote:
> On 01/04/17 01:20, Tom Lane wrote:
>> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes:
>>> But the pg_subscription_rel is also not accessed on heap_open, the
>>> problematic code is called from heap_drop_with_catalog. And VACUUM FULL
>>> pg_class calls heap_drop_with_catalog() when doing the heap swap (and it
>>> goes through performDeletion so through dependency info which is what I
>>> mean by everything else does this).
>>
>> Hmmm.  What the heap_drop_with_catalog call is being applied to is a
>> short-lived table that is not pg_class.  It happens to own the disk
>> storage that used to belong to pg_class, but it is not pg_class;
>> in particular it doesn't have the same OID, and it is not what would
>> be looked in if someone happened to need to fetch a pg_class row
>> at that point.  So there's no circularity involved.
>>
>> On further reflection it seems like you were right, the problem is
>> taking a self-exclusive lock on pg_subscription_rel during low-level
>> catalog operations.  We're going to have to find a way to reduce that
>> lock strength, or we're going to have a lot of deadlock problems.
>>
> 
> Well we have heavy lock because we were worried about failure scenarios
> in our dumb upsert in SetSubscriptionRelState which does cache lookup
> and if it finds something it updates, otherwise inserts. We have similar
> pattern in other places but they are called from DDL statements usually
> so the worst that can happen is DDL fails with weird errors when done
> concurrently with same name so using RowExclusiveLock is okay.
> 
> That being said, looking at use-cases for SetSubscriptionRelState that's
> basically CREATE SUBSCRIPTION, ALTER SUBSCRIPTION REFRESH and tablesync
> worker. So the DDL thing applies to first ones as well and tablesync
> should not be running in case the record does not exist so it's fine if
> it fails. In terms of RemoveSubscriptionRel that's only called from
> heap_drop_with_catalog and tablesync holds relation lock so there is no
> way heap_drop_with_catalog will happen on the same relation. This leads
> me to thinking that RowExclusiveLock is fine for both
> SetSubscriptionRelState and RemoveSubscriptionRel as long as we document
> that callers should be aware that SetSubscriptionRelState has
> concurrency issues and fail on unique index check.
> 

And a simple patch to do so. Peter do you see any problem with doing this?

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Attachment