Thread: group locking: incomplete patch, just for discussion

group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
For parallelism, I think we need a concept of group locking.  That is,
suppose we have a user backend and N worker backends collaborating to
execute some query.  For the sake of argument, let's say it's a
parallel CLUSTER, requiring a full table lock.  We need all of the
backends to be able to lock the table even though at least one of them
holds AccessExclusiveLock.  This suggests that the backends should all
be members of a locking group, and that locks within the same locking
group should be regarded as mutually non-conflicting.

At least to me, that simple scenario is clear-cut[1], but what do we
do in more complicated situations?  For example, suppose backends A
and B are members of the same locking group.  A locks a relation with
AccessShareLock, an unrelated process X queues up waiting for an
AccessExclusiveLock, and then B also requests AccessShareLock.  The
normal behavior here is that B should wait for X to go first, but here
that's a problem.  If A is the user backend and B is a worker backend,
A will eventually wait for B, which is waiting for X, which is waiting
for A: deadlock.  If it's the other way around, we might survive, but
at the very best it's undesirable to have "parallelism" where A and B
are actually completely serialized, with X taking its turn in the
middle.  Now, in practical cases, we probably expect that if A and B
lock the same relation, they're going to do so in very quick
succession, so the actual behavior of the lock manager shouldn't
matter much in terms of fairness.  But since we can't predict what
other processes may do in the window between those lock acquisitions,
a sound theoretical approach is needed to avoid deadlocks.

After some thought I've come to the conclusion that we should assume
that every process in a locking group will eventually wait for every
other process in that locking group.  It follows that once any single
process in a locking group obtains any lock on an object, any future
lock request by another lock group member on that object should, if
non-conflicting, be granted immediately.  It also follows that, when
processes in a locking group are waiting for a lock on an object, none
should be awoken until all lock requests have been satisfied.  For
example, suppose process X holds AccessExclusiveLock on a relation.
First, process Y waits for AccessShareLock.  Then, processes A and B,
members of the same locking group, queue up respectively for
AccessShareLock and AccessExclusiveLock, in that order.  Next, process
X releases its lock.  At this point, we should wake Y *but not A*; A
should continue to wait until we can grant locks to both A and B.  The
reason is that a new process Q might come along and, for purposes of
deadlock avoidance, we might need to reorder the lock queue.  Putting
Q ahead of both A and B is fine; but putting it between A and B is bad
for the reasons discussed in the previous paragraph.

Attached is an incomplete draft patch.  It modifies
LockAcquireExtended() and ProcSleep() but not ProcWakeup() or the
deadlock detector.  Aside from being incomplete, which is a problem in
and of itself, I don't have a very good idea how to comprehensively
test something like this.  I can of course cobble together some test
code for particular scenarios, but it's not at all clear to me what a
comprehensive test plan would look like.  Any ideas on that point
would be particularly appreciated, as would feedback on the overall
design.   A lot more work is clearly needed here, but I've been known
to advocate soliciting feedback on proposed patches early, so here I
am taking my own advice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] You could argue that one process instead should take locks on
behalf of the group.  But I think this is a terrible approach.  First,
it means that whichever process is holding the locks, presumably the
user backend, had better not try to do any real computation of its
own, so that it can timely service lock requests, which seems like a
huge loss of efficiency; if the optimum degree of parallelism is 2,
we'll need 50% more processes and the whole result set will have to be
copied an extra time.  Yuck.  Second, it means that the lock-holding
process must refuse to die until all of the processes on behalf of
which it is holding locks don't need them any more.  And I hate
backend processes that refuse to die with a fiery passion that cannot
be quenched.

Attachment

Re: group locking: incomplete patch, just for discussion

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> For parallelism, I think we need a concept of group locking.  That is,
> suppose we have a user backend and N worker backends collaborating to
> execute some query.  For the sake of argument, let's say it's a
> parallel CLUSTER, requiring a full table lock.  We need all of the
> backends to be able to lock the table even though at least one of them
> holds AccessExclusiveLock.  This suggests that the backends should all
> be members of a locking group, and that locks within the same locking
> group should be regarded as mutually non-conflicting.

In the background worker case, I imagined that the foreground process
would hold a lock and the background processes would just assume they
could access the table without holding locks of their own.  Aren't
you building a mountain where a molehill would do?

> [1] You could argue that one process instead should take locks on
> behalf of the group.  But I think this is a terrible approach.  First,
> it means that whichever process is holding the locks, presumably the
> user backend, had better not try to do any real computation of its
> own, so that it can timely service lock requests, which seems like a
> huge loss of efficiency;

What is "timely service lock requests"?  You got the lock before firing
off the background workers, you hold it till they're done.

This is a truly horrible excuse for the amount of complexity (and,
without doubt, bugs and performance penalties) you propose to load onto
the lock mechanism.
        regards, tom lane



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Oct 15, 2014 at 12:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What is "timely service lock requests"?  You got the lock before firing
> off the background workers, you hold it till they're done.

If you want to run any non-trivial (read: useful) code in the
background workers, it rapidly gets very hard to predict which
relation locks they might need, and infeasible to hold them for the
lifetime of the computation.  For example, suppose we restrict
background workers to running only operators found in btree opclasses.
That's a far more draconian restriction than we'd like to have, but
perhaps liveable in a pinch.  But bttextcmp() uses
PG_GETARG_TEXT_PP(), which means it may (or may not) need to lock the
TOAST table; and it can indirectly call lookup_collation_cache() which
does SearchSysCache1(COLLOID, ...) which may result in scanning
pg_collation.  And enum_cmp_internal() will do
SearchSysCache1(ENUMOID, ...) which may result in scanning pg_enum.

There's obviously a balance to be struck, here.  It will never be safe
to run just anything in a background worker, and we'll go crazy if we
try to make that work.  On the other hand, if there's essentially no
code that can run in a background worker without extensive
modification, we might as well not bother implementing parallelism in
the first place.  I admit that getting group locking is far from
simple, but I also don't believe it's as complicated as previous
projects like SSI, logical decoding, or LATERAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 15 October 2014 05:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> For parallelism, I think we need a concept of group locking.  That is,
>> suppose we have a user backend and N worker backends collaborating to
>> execute some query.  For the sake of argument, let's say it's a
>> parallel CLUSTER, requiring a full table lock.  We need all of the
>> backends to be able to lock the table even though at least one of them
>> holds AccessExclusiveLock.  This suggests that the backends should all
>> be members of a locking group, and that locks within the same locking
>> group should be regarded as mutually non-conflicting.
>
> In the background worker case, I imagined that the foreground process
> would hold a lock and the background processes would just assume they
> could access the table without holding locks of their own.  Aren't
> you building a mountain where a molehill would do?

Yeh. Locks should be made in the name of the main transaction and
released by it.

When my family goes to a restaurant, any member of the party may ask
for a table and the request is granted for the whole family. But the
lock is released only when I pay the bill. Once we have the table, any
stragglers know we have locked the table and they just come sit at the
table without needing to make their own lock request to the Maitre D',
though they clearly cache the knowledge that we have the table locked.

So all lock requests held until EOX should be made in the name of the
top level process. Any child process wanting a lock should request it,
but on discovering it is already held at parent level should just
update the local lock table. Transient locks, like catalog locks can
be made and released locally; I think there is more detail there but
it shouldn't affect the generalisation.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Oct 15, 2014 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 15 October 2014 05:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> For parallelism, I think we need a concept of group locking.  That is,
>>> suppose we have a user backend and N worker backends collaborating to
>>> execute some query.  For the sake of argument, let's say it's a
>>> parallel CLUSTER, requiring a full table lock.  We need all of the
>>> backends to be able to lock the table even though at least one of them
>>> holds AccessExclusiveLock.  This suggests that the backends should all
>>> be members of a locking group, and that locks within the same locking
>>> group should be regarded as mutually non-conflicting.
>>
>> In the background worker case, I imagined that the foreground process
>> would hold a lock and the background processes would just assume they
>> could access the table without holding locks of their own.  Aren't
>> you building a mountain where a molehill would do?
>
> Yeh. Locks should be made in the name of the main transaction and
> released by it.
>
> When my family goes to a restaurant, any member of the party may ask
> for a table and the request is granted for the whole family. But the
> lock is released only when I pay the bill. Once we have the table, any
> stragglers know we have locked the table and they just come sit at the
> table without needing to make their own lock request to the Maitre D',
> though they clearly cache the knowledge that we have the table locked.
>
> So all lock requests held until EOX should be made in the name of the
> top level process. Any child process wanting a lock should request it,
> but on discovering it is already held at parent level should just
> update the local lock table. Transient locks, like catalog locks can
> be made and released locally; I think there is more detail there but
> it shouldn't affect the generalisation.

Hmm, interesting idea.  Suppose, though, that the child process
requests a lock that can't immediately be granted, because the catalog
it's trying to access is locked in AccessExclusiveLock mode by an
unrelated transaction.  The unrelated transaction, in turn, is blocked
trying to acquire some resource, which the top level parallelism
process.  Assuming the top level parallelism process is waiting for
the child (or will eventually wait), this is a deadlock, but without
some modification to the deadlock detector, it can't see one of the
edges.

Figuring out what to do about that is really the heart of this
project, I think, and there are a variety of designs possible.  One of
the early ideas that I had was to the parallel workers directly
twaddle the main processes' PGPROC and lock table state.  In other
words, instead of taking locks using their own PGPROCs, everybody uses
a single PGPROC.  I made several attempts at getting designs along
these lines off the ground, but it got complicated and invasive: (1)
The processes need to coordinate to make sure that you don't have two
people twaddling the lock state at the same time; (2) The existing
data structures won't support more than one process waiting at a time,
but there's no reason why one parallel worker couldn't be trying to
lock one catalog while another one is trying to lock a different
catalog; (3) On a related note, when a lock wait ends, you can't just
wake up the process that owns the PGPROC, but rather the one that's
actually waiting; (4) the LWLockReleaseAll algorithm just falls apart
in this environment, as far as I can see.

The alternative design which I've been experimenting with is to have
each process use its own PGPROC and PROCLOCK structures, but to tag
each PROCLOCK with not only the owning PGPROC but also the group
leader's PGPROC.  This has not been entirely smooth sailing, but it
sees to break much less code than trying to have everybody use one
PGPROC.  Most of the changes that seem to be needed to make things
work are pretty well-isolated; rather than totally rearranging the
lock manager, you're just adding extra code that runs only in the
parallel case.

I'm definitely open to the idea that there's a better, simpler design
out there, but I haven't been able to think of one that doesn't break
deadlock detection.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 15 October 2014 14:46, Robert Haas <robertmhaas@gmail.com> wrote:

>> When my family goes to a restaurant, any member of the party may ask
>> for a table and the request is granted for the whole family. But the
>> lock is released only when I pay the bill. Once we have the table, any
>> stragglers know we have locked the table and they just come sit at the
>> table without needing to make their own lock request to the Maitre D',
>> though they clearly cache the knowledge that we have the table locked.

> Hmm, interesting idea.  Suppose, though, that the child process
> requests a lock that can't immediately be granted, because the catalog
> it's trying to access is locked in AccessExclusiveLock mode by an
> unrelated transaction.  The unrelated transaction, in turn, is blocked
> trying to acquire some resource, which the top level parallelism
> process.  Assuming the top level parallelism process is waiting for
> the child (or will eventually wait), this is a deadlock, but without
> some modification to the deadlock detector, it can't see one of the
> edges.

Family disputes are fairly easily resolved ;-)

The first and basic point is that in most cases the parent should
already hold the required locks. This can only happen for briefly held
locks and/or more complex stuff. In the first case, getting
parallelism to work without that complex stuff would be useful. I'd be
happy if the first version simply throws an error if a child can't
acquire a lock immediately. Don't overthink the first version. Knowing
you'll disagree, lets take a further step...

Second point, the relationship between parent and children is clear.
If we do a deadlock detection, we should be able to search for that as
a special case, since we will know that we are a child and that such a
situation might occur. So just add in an edge so the rest of the
deadlock code works fine.

If that doesn't work, use a heurisic. If parent is waiting when child
does deadlock test, assume its a deadlock and abort the child
speculatively just in case. You can work out how to do that better in
the future, since it won't happen that often.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Oct 15, 2014 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 15 October 2014 14:46, Robert Haas <robertmhaas@gmail.com> wrote:
>>> When my family goes to a restaurant, any member of the party may ask
>>> for a table and the request is granted for the whole family. But the
>>> lock is released only when I pay the bill. Once we have the table, any
>>> stragglers know we have locked the table and they just come sit at the
>>> table without needing to make their own lock request to the Maitre D',
>>> though they clearly cache the knowledge that we have the table locked.
>
>> Hmm, interesting idea.  Suppose, though, that the child process
>> requests a lock that can't immediately be granted, because the catalog
>> it's trying to access is locked in AccessExclusiveLock mode by an
>> unrelated transaction.  The unrelated transaction, in turn, is blocked
>> trying to acquire some resource, which the top level parallelism
>> process.  Assuming the top level parallelism process is waiting for
>> the child (or will eventually wait), this is a deadlock, but without
>> some modification to the deadlock detector, it can't see one of the
>> edges.
>
> Family disputes are fairly easily resolved ;-)
>
> The first and basic point is that in most cases the parent should
> already hold the required locks. This can only happen for briefly held
> locks and/or more complex stuff. In the first case, getting
> parallelism to work without that complex stuff would be useful. I'd be
> happy if the first version simply throws an error if a child can't
> acquire a lock immediately. Don't overthink the first version. Knowing
> you'll disagree, lets take a further step...

Well, I'm fervently in agreement with you on one point: the first
version of all this needs to be as simple as possible, or the time to
get to the first version will be longer than we can afford to wait.  I
think what we're discussing here is which things are important enough
that it makes sense to have them in the first version, and which
things can wait.  I also think we are in agreement that at least SOME
thought about lock management is needed; the question we're trying to
hash out is whether what I'm proposing to try to do here is
significantly more ambitious than what's really necessary for V1.
Which is a good question.

> Second point, the relationship between parent and children is clear.
> If we do a deadlock detection, we should be able to search for that as
> a special case, since we will know that we are a child and that such a
> situation might occur. So just add in an edge so the rest of the
> deadlock code works fine.
>
> If that doesn't work, use a heurisic. If parent is waiting when child
> does deadlock test, assume its a deadlock and abort the child
> speculatively just in case. You can work out how to do that better in
> the future, since it won't happen that often.

Well, the deadlock checker needs to work not only when one of the
parallel processes invokes it, but also when somebody else invokes it.
For example, suppose we have parallel processes A and B.  A holds lock
1 and awaits lock 2, while B holds awaits lock 3.  No problem!  Now
process X, which is holding lock 2 tries to grab lock 1.  X must be
able to detect the deadlock.  Now, that's not necessarily a problem
with what you said: it just means that the parent-child relationship
has to be clear from the contents of shared memory.

Which is pretty much the direction I'm aiming with the incomplete
patch I posted.  For the sake of simplicity, I want to assume that
every process in a locking group waits for every other process in the
same locking group, even though that might not be technically true in
every case.   If the parent is waiting for a lock and the guy who has
that lock is waiting for a parallel worker in the same group, my plan
(which I think matches your suggestion) is to call that a deadlock.
There are a few sticky points here: sometimes, the deadlock checker
doesn't actually abort transactions, but just rearranges the wait
queue, so something at least somewhat sensible needs to happen in that
case.

The thing that I'm aiming to do in the patch which I think you are
suggesting might not be necessary is to make it possible for the child
go ahead and request AccessShareLock on the scan relation even though
the parent might already hold some other lock (perhaps even
AccessExclusiveLock).   I want to make the lock manager smart enough
to recognize that those locks are mutually non-conflicting because of
the fact that the two processes are in close cooperation.  Clearly, we
could get by without that, but it adds complexity in other places: the
parent has to never release its lock (even if killed) until the child
is done with the relation; and the scan code itself needs to be
conditional, passing NoLock from children and some other mode in the
parent.  That's all manageable, but it looks to me like doing the
necessary surgery on the lock manager isn't actually going to be that
hard; most of the necessary logic is present in draft form in the
patch already, and it's not that complex.

In any design, the hard part, at least as far as I can see, is making
the deadlock detector work reliably.  I agree with you that it's OK
if, in some unlikely situation, it occasionally diagnoses a deadlock
between parallel processes where perhaps there isn't one.  But it's
not OK, at least in my opinion, if it ever fails to detect a real
deadlock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 15 October 2014 17:03, Robert Haas <robertmhaas@gmail.com> wrote:

> Well, I'm fervently in agreement with you on one point: the first
> version of all this needs to be as simple as possible, or the time to
> get to the first version will be longer than we can afford to wait.  I
> think what we're discussing here is which things are important enough
> that it makes sense to have them in the first version, and which
> things can wait.

I'm guessing we might differ slightly on what constitutes as simple as possible.

Something usable, with severe restrictions, is actually better than we
have now. I understand the journey this work represents, so don't be
embarrassed by submitting things with heuristics and good-enoughs in
it. Our mentor, Mr.Lane, achieved much by spreading work over many
releases, leaving others to join in the task.

Might I gently enquire what the "something usable" we are going to see
in this release? I'm not up on current plans.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 15 October 2014 17:03, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, I'm fervently in agreement with you on one point: the first
>> version of all this needs to be as simple as possible, or the time to
>> get to the first version will be longer than we can afford to wait.  I
>> think what we're discussing here is which things are important enough
>> that it makes sense to have them in the first version, and which
>> things can wait.
>
> I'm guessing we might differ slightly on what constitutes as simple as possible.

Yes, I believe there have been occasions in the past when that has
happened, so definitely possible.  :-)

> Something usable, with severe restrictions, is actually better than we
> have now. I understand the journey this work represents, so don't be
> embarrassed by submitting things with heuristics and good-enoughs in
> it. Our mentor, Mr.Lane, achieved much by spreading work over many
> releases, leaving others to join in the task.
>
> Might I gently enquire what the "something usable" we are going to see
> in this release? I'm not up on current plans.

I don't know how far I'm going to get for this release yet.  I think
pg_background is a pretty good milestone, and useful in its own right.
I would like to get something that's truly parallel working sooner
rather than later, but this group locking issue is one of 2 or 3
significant hurdles that I need to climb over first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 16 October 2014 16:22, Robert Haas <robertmhaas@gmail.com> wrote:

>> Might I gently enquire what the "something usable" we are going to see
>> in this release? I'm not up on current plans.
>
> I don't know how far I'm going to get for this release yet.  I think
> pg_background is a pretty good milestone, and useful in its own right.
> I would like to get something that's truly parallel working sooner
> rather than later, but this group locking issue is one of 2 or 3
> significant hurdles that I need to climb over first.

pg_background is very cute, but really its not really a step forward,
or at least very far. It's sounding like you've already decided that
is as far as we're going to get this release, which I'm disappointed
about.

Given your description of pg_background it looks an awful lot like
infrastructure to make Autonomous Transactions work, but it doesn't
even do that. I guess it could do in a very small additional patch, so
maybe it is useful for something.

You asked for my help, but I'd like to see some concrete steps towards
an interim feature so I can see some benefit in a clear direction.

Can we please have the first step we discussed? Parallel CREATE INDEX?
(Note the please)

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Jim Nasby
Date:
On 10/28/14, 3:48 PM, Simon Riggs wrote:
> Given your description of pg_background it looks an awful lot like
> infrastructure to make Autonomous Transactions work, but it doesn't
> even do that. I guess it could do in a very small additional patch, so
> maybe it is useful for something.

What do you see as being missing for autonomous transactios?

BTW, what I think would make this feature VERY useful is if it provided the ability to fire something up in another
backendand leave it running in the background. I think you can do that with FDW, but then you have the authentication
PITAto deal with (and perhaps pg_background is a more efficient way to move data between backends than FDW, but that's
justa guess...)
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Tue, Oct 28, 2014 at 4:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 16 October 2014 16:22, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Might I gently enquire what the "something usable" we are going to see
>>> in this release? I'm not up on current plans.
>>
>> I don't know how far I'm going to get for this release yet.  I think
>> pg_background is a pretty good milestone, and useful in its own right.
>> I would like to get something that's truly parallel working sooner
>> rather than later, but this group locking issue is one of 2 or 3
>> significant hurdles that I need to climb over first.
>
> pg_background is very cute, but really its not really a step forward,
> or at least very far. It's sounding like you've already decided that
> is as far as we're going to get this release, which I'm disappointed
> about.
>
> Given your description of pg_background it looks an awful lot like
> infrastructure to make Autonomous Transactions work, but it doesn't
> even do that. I guess it could do in a very small additional patch, so
> maybe it is useful for something.
>
> You asked for my help, but I'd like to see some concrete steps towards
> an interim feature so I can see some benefit in a clear direction.
>
> Can we please have the first step we discussed? Parallel CREATE INDEX?
> (Note the please)

What I've been thinking about trying to work towards is parallel
sequential scan.  I think that it would actually be pretty easy to
code up a mostly-working version using the existing infrastructure,
but the patch would be rejected with a bazooka, because the
non-working parts would include things like:

1. The cooperating backends might not all be using the same snapshot,
because that requires sharing the snapshot, combo CID hash, and
transaction state.
2. The quals that got pushed down to the workers might not return the
same answers that they would have produced with a single backend,
because we have no mechanism for assessing pushdown-safety.
3. Deadlock detection would be to some greater or lesser degree
broken, the details depending on the implementation choices you made.

There is a bit of a chicken-and-egg problem here.  If I submit a patch
for parallel sequential scan, it'll (justifiably) get rejected because
it doesn't solve those problems.  So I'm trying to solve those
above-enumerated problems first, with working and at least
somewhat-useful examples that show how the incremental bits of
infrastructure can be used to do stuff.  But that leads to your
(understandable) complaint that this isn't very real yet.

Why am I now thinking about parallel sequential scan instead of
parallel CREATE INDEX?  You may remember that I posted a patch for a
new memory allocator some time ago, and it came in for a fair amount
of criticism and not much approbation.   Some of that criticism was
certainly justified, and perhaps I was as hard on myself as anyone
else was.  However you want to look at it, I see the trade-off between
parallel sort and parallel seq-scan this way: parallel seq-scan
requires dealing with the planner (ouch!) but parallel sort requires
dealing with memory allocation in dynamic shared memory segments
(ouch!).  Both of them require solving the three problems listed
above.

And maybe a few others, but I think those are the big ones - and I
think proper deadlock detection is the hardest of them.  A colleague
of mine has drafted patches for sharing snapshots and combo CIDs
between processes, and as you might expect that's pretty easy.
Sharing the transaction state (so we can test whether a transaction ID
is "our" transaction ID inside the worker) is a bit trickier, but I
think not too hard.  Assessing pushdown-safety will probably boil down
to adding some equivalent of proisparallel.  Maybe not the most
elegant, but defensible, and if you're looking for the shortest path
to something usable, that's probably it.  But deadlock detection ...
well, I don't see any simpler solution than what I'm trying to build
here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Tue, Oct 28, 2014 at 7:22 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 10/28/14, 3:48 PM, Simon Riggs wrote:
>> Given your description of pg_background it looks an awful lot like
>> infrastructure to make Autonomous Transactions work, but it doesn't
>> even do that. I guess it could do in a very small additional patch, so
>> maybe it is useful for something.
>
> What do you see as being missing for autonomous transactios?

Personally, I don't see this patch set as having much to do with real
autonomous transactions.

> BTW, what I think would make this feature VERY useful is if it provided the
> ability to fire something up in another backend and leave it running in the
> background.

You *can* do that.  I mean, long-running transactions will have their
usual problems, but if you want to kick off a long-running (or a
short-running query) in the background and forget about it, this patch
lets you do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Tue, Oct 28, 2014 at 7:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 28, 2014 at 7:22 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>> On 10/28/14, 3:48 PM, Simon Riggs wrote:
>>> Given your description of pg_background it looks an awful lot like
>>> infrastructure to make Autonomous Transactions work, but it doesn't
>>> even do that. I guess it could do in a very small additional patch, so
>>> maybe it is useful for something.
>>
>> What do you see as being missing for autonomous transactios?
>
> Personally, I don't see this patch set as having much to do with real
> autonomous transactions.
>
>> BTW, what I think would make this feature VERY useful is if it provided the
>> ability to fire something up in another backend and leave it running in the
>> background.
>
> You *can* do that.  I mean, long-running transactions will have their
> usual problems, but if you want to kick off a long-running (or a
> short-running query) in the background and forget about it, this patch
> lets you do that.

Err, sorry.  pg_background lets you do that, not this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 28 October 2014 23:24, Robert Haas <robertmhaas@gmail.com> wrote:

>> You asked for my help, but I'd like to see some concrete steps towards
>> an interim feature so I can see some benefit in a clear direction.
>>
>> Can we please have the first step we discussed? Parallel CREATE INDEX?
>> (Note the please)
>
> What I've been thinking about trying to work towards is parallel
> sequential scan.  I think that it would actually be pretty easy to
> code up a mostly-working version using the existing infrastructure,
> but the patch would be rejected with a bazooka, because the
> non-working parts would include things like:
>
> 1. The cooperating backends might not all be using the same snapshot,
> because that requires sharing the snapshot, combo CID hash, and
> transaction state.
> 2. The quals that got pushed down to the workers might not return the
> same answers that they would have produced with a single backend,
> because we have no mechanism for assessing pushdown-safety.
> 3. Deadlock detection would be to some greater or lesser degree
> broken, the details depending on the implementation choices you made.
>
> There is a bit of a chicken-and-egg problem here.  If I submit a patch
> for parallel sequential scan, it'll (justifiably) get rejected because
> it doesn't solve those problems.  So I'm trying to solve those
> above-enumerated problems first, with working and at least
> somewhat-useful examples that show how the incremental bits of
> infrastructure can be used to do stuff.  But that leads to your
> (understandable) complaint that this isn't very real yet.
>
> Why am I now thinking about parallel sequential scan instead of
> parallel CREATE INDEX?  You may remember that I posted a patch for a
> new memory allocator some time ago, and it came in for a fair amount
> of criticism and not much approbation.   Some of that criticism was
> certainly justified, and perhaps I was as hard on myself as anyone
> else was.  However you want to look at it, I see the trade-off between
> parallel sort and parallel seq-scan this way: parallel seq-scan
> requires dealing with the planner (ouch!) but parallel sort requires
> dealing with memory allocation in dynamic shared memory segments
> (ouch!).  Both of them require solving the three problems listed
> above.
>
> And maybe a few others, but I think those are the big ones - and I
> think proper deadlock detection is the hardest of them.  A colleague
> of mine has drafted patches for sharing snapshots and combo CIDs
> between processes, and as you might expect that's pretty easy.
> Sharing the transaction state (so we can test whether a transaction ID
> is "our" transaction ID inside the worker) is a bit trickier, but I
> think not too hard.  Assessing pushdown-safety will probably boil down
> to adding some equivalent of proisparallel.  Maybe not the most
> elegant, but defensible, and if you're looking for the shortest path
> to something usable, that's probably it.  But deadlock detection ...
> well, I don't see any simpler solution than what I'm trying to build
> here.

OK, I see where you are headed, and perhaps why.

I said I would help and I mean to do so. Here's my thoughts:

Solving your 3 problems requires significant research, coding and
agreement that you're unlikely to get in 9.5 because there are no
solutions happening at the end of it; its all theoretical design,
which becomes more arguable. And even if you do, Parallel (||) Seq
Scan isn't interesting except for marketing purposes, since its use in
real queries is limited because most queries need to do something else
after the SeqScan, which opens up a fuller discussion of parallel
query which we've never had. My feeling is that releasing only || Seq
Scan would backfire since we don't yet do enough other stuff to be a
full solution.

If you do wish to pursue || Seq Scan, then a working prototype would
help. It allows us to see that there is an open source solution we are
working to solve the problems for. People can benchmark it, understand
the benefits and issues it raises and that would help focus attention
on the problems you are trying to solve in infrastructure. People may
have suggestions on how to solve or avoid those that you hadn't
thought of.

Personally, IMHO, its quite late in the cycle to be releasing such a
prototype since we have only 6-7 weeks until the major patch
submission deadline. Hence my request: can we get *something* good for
users into 9.5, since with respect, infrastructure isn't useful. (If
you take such comments negatively, we might discuss similar issues
with BDR - but we are working to make simple features available in
each release on that project also).

My proposal is we do a parallel index build scan... just as we
discussed earlier, so you would be following the direction set by Dev
Meeting, not just a proposal of mine.

As I mentioned previously when you started discussing shared memory
segments, parallel sort does NOT require shared memory. The only thing
you need to share are files. Split the problem into N pieces, sort
them to produce N files and then merge the files using existing code.
That only applies to large sorts, but then those are the ones you
cared about doing in parallel anyway.

CREATE INDEX has a lock on the table. Worker tasks can be simply
banned from acquiring new locks and doing many other tasks. We can
prevent transaction chains, so both normal and concurrent versions
have no complex combocid or transaction state issues. So your 3
problems come down to nothing that needs to be solved immediately.
Pushdown can be assumed for the functions we use for existing index
types: we can verify this from our internal code, so we are not at
risk there.

So my proposal is we do a parallel index build scan, a limited subset
of parallel seq scan, we put the data from that into a private sort
node, then output a series of files. The plan for this is the
following...

Parallel Sort Merge   Local Sort      Local IndexBuildScanSegment

Local IndexBuildScanSegment scans a portion of the table being
indexed, a "segment" - this is the heart of the division of labour
within the parallel scan.
Local Sort is just a normal sort node except it outputs a file, not a
stream of sorted tuples.
Parallel Sort Merge waits for all Local Sorts to finish and then
merges the inputs using existing sort merge logic.

Now I'm sure you can see there are ways to improve that plan using
shmem. But the above is a workable solution for 9.5 within the time
available. We'll learn much making this happen and it will fuel
everyone for more work in the following release.

For the next release, I suggest that || hash join is more useful goal
than || seq scan since it allows typical queries on big tables to be
optimized, but we can discuss that in more detail another time.

Happy to have a more detailed phone call or meeting to discuss.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Amit Kapila
Date:
On Wed, Oct 29, 2014 at 2:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> My proposal is we do a parallel index build scan... just as we
> discussed earlier, so you would be following the direction set by Dev
> Meeting, not just a proposal of mine.
>
> As I mentioned previously when you started discussing shared memory
> segments, parallel sort does NOT require shared memory. The only thing
> you need to share are files. Split the problem into N pieces, sort
> them to produce N files and then merge the files using existing code.
> That only applies to large sorts, but then those are the ones you
> cared about doing in parallel anyway.
>
> CREATE INDEX has a lock on the table. Worker tasks can be simply
> banned from acquiring new locks and doing many other tasks.

Banning worker tasks from taking locks is only possible if the
worker backend doesn't need to acquire any lock which is not
already acquired by main backend, but can we safely assume
the same?  One example as taken by Robert upthread about
bttextcmp() can break that assumption. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Oct 29, 2014 at 4:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If you do wish to pursue || Seq Scan, then a working prototype would
> help. It allows us to see that there is an open source solution we are
> working to solve the problems for. People can benchmark it, understand
> the benefits and issues it raises and that would help focus attention
> on the problems you are trying to solve in infrastructure. People may
> have suggestions on how to solve or avoid those that you hadn't
> thought of.

I've mulled that over a bit and it might be worth pursuing further.
Of course there's always the trade-off: doing that means not doing
something else.

> As I mentioned previously when you started discussing shared memory
> segments, parallel sort does NOT require shared memory. The only thing
> you need to share are files. Split the problem into N pieces, sort
> them to produce N files and then merge the files using existing code.
> That only applies to large sorts, but then those are the ones you
> cared about doing in parallel anyway.

A simple implementation of this would work only for simple
pass-by-value types, like integers.  Pass-by-reference types require
the comparator to de-TOAST, and some other types require catalog
lookups.  I don't think that's very useful: Noah previously did some
analysis of this problem and concluded (with apologies if I'm remember
the details incorrectly here) that the comparator for strings was
something like 1000x as expensive as the comparator for integers, and
that you basically couldn't get the latter to take enough time to be
worth parallelizing.

I care much more about getting the general infrastructure in place to
make parallel programming feasible in PostgreSQL than I do about
getting one particular case working.  And more than feasible: I want
it to be relatively straightforward.  That's not simple, but the
potential rewards are great.  Let's face it: there are people here who
are much better than I am at hacking on the planner and especially the
executor than I am.  Why haven't any of those people implemented
parallel anything?  I think it's because, right now, it's just too
darn hard.  I'm trying to reduce that to something approaching the
difficulty of writing normal PostgreSQL backend code, and I think I'm
6-12 patches away from that.  This is one of them and, yeah, it's not
done, and, yeah, we might not get to parallel anything this release
and, yeah, things would be going faster if I could work on parallelism
full time.  But I think that the progress we are making is meaningful
and the goal is within sight.

I appreciate that you'd probably attack this problem from a different
direction than I'm attacking it from, but I still think that what I'm
trying to do is a legitimate direction of attack which, by the way,
does not preclude anybody else from attacking it from a different
direction and, indeed, such a development would be most welcome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 29 October 2014 12:08, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Oct 29, 2014 at 2:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> My proposal is we do a parallel index build scan... just as we
>> discussed earlier, so you would be following the direction set by Dev
>> Meeting, not just a proposal of mine.
>>
>> As I mentioned previously when you started discussing shared memory
>> segments, parallel sort does NOT require shared memory. The only thing
>> you need to share are files. Split the problem into N pieces, sort
>> them to produce N files and then merge the files using existing code.
>> That only applies to large sorts, but then those are the ones you
>> cared about doing in parallel anyway.
>>
>> CREATE INDEX has a lock on the table. Worker tasks can be simply
>> banned from acquiring new locks and doing many other tasks.
>
> Banning worker tasks from taking locks is only possible if the
> worker backend doesn't need to acquire any lock which is not
> already acquired by main backend, but can we safely assume
> the same?  One example as taken by Robert upthread about
> bttextcmp() can break that assumption.

I think its reasonable to imagine some prep work will be required in
the main task before workers begin.

Locking the toast table of any main tables we access seems easily
done. Though perhaps we should make weak locking of the toast table
presumed. Do we have cases where the toast table can be accessed when
the main table is not also strong locked first?

As for locking the enums table or collation table, that's simple stuff
also. They're just AccessShareLocks.

I can't imagine we'll be able to presume that all functions of every
kind are parallel-safe.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 29 October 2014 12:28, Robert Haas <robertmhaas@gmail.com> wrote:

> I care much more about getting the general infrastructure in place to
> make parallel programming feasible in PostgreSQL than I do about
> getting one particular case working.  And more than feasible: I want
> it to be relatively straightforward.  That's not simple, but the
> potential rewards are great.  Let's face it: there are people here who
> are much better than I am at hacking on the planner and especially the
> executor than I am.  Why haven't any of those people implemented
> parallel anything?  I think it's because, right now, it's just too
> darn hard.  I'm trying to reduce that to something approaching the
> difficulty of writing normal PostgreSQL backend code, and I think I'm
> 6-12 patches away from that.  This is one of them and, yeah, it's not
> done, and, yeah, we might not get to parallel anything this release
> and, yeah, things would be going faster if I could work on parallelism
> full time.  But I think that the progress we are making is meaningful
> and the goal is within sight.

The role of an immediate working solution is to prove the
infrastructure so far is genuinely useful, helping also to build
interest and understanding.

There is a real danger that your "ta-dah" moment sometime in the
future contains flaws which need to be addressed, but we now have
piles of questionable infrastructure lieing around. If you have
similar doubts about anything I'm doing, please just ask.

If we could see the 6-12 patch descriptions and understand where you
are going it would help.

> I appreciate that you'd probably attack this problem from a different
> direction than I'm attacking it from, but I still think that what I'm
> trying to do is a legitimate direction of attack which, by the way,
> does not preclude anybody else from attacking it from a different
> direction and, indeed, such a development would be most welcome.

Well, I don't have time; been a little busy these last 10 years and
for at least one year more yet before projects open up again.

If I did, I don't think it would be that useful to interfere.
Cooperation seems better use of my time, if possible.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Oct 29, 2014 at 10:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> There is a real danger that your "ta-dah" moment sometime in the
> future contains flaws which need to be addressed, but we now have
> piles of questionable infrastructure lieing around. If you have
> similar doubts about anything I'm doing, please just ask.

I think the implication that the infrastructure created to date is of
questionable utility for parallelism is false and unjustifed.  Every
infrastructure patch thus far committed, and every patch proposed
other than this one, has been accompanied by code that demonstrates
what it does and how that's useful.  Most of them have had enough
interest by extension authors to generate bug reports from other
people - most notably Petr, who has obviously been playing with all of
the existing modules enough to notice bugs and come to conclusions
about the shortcomings of the existing technology similar to my own.
Interest in dynamic background workers seems to go considerably
further, to the point where it made 9.4's list of major features.

> If we could see the 6-12 patch descriptions and understand where you
> are going it would help.

OK, here's what I think we need (roughly):

1. The remaining pg_background patches.  The main things in there are
(a) the ability to propagate errors from a background worker back to
the user backend and (b) the ability to synchronize the background
worker's GUC state with the user backend.

2. Group locking.

3. Patches to synchronize the (a) snapshot, (b) combo CID hash, and
(c) transaction state from the user backend to a background worker.
Draft patches for (a) and (b) are already completed yet, but not
published because I didn't think anyone would be very excited about
them just yet.

4. A patch to add "parallel mode" in which a whole bunch of things are
prohibited in the user backend and even more things are prohibited in
the background worker - specifically, things that won't be safe in
those contexts.  Noah and Amit Kapila worked on this and we have draft
patches for this as well, also not published and for the same reasons.

5. A patch to add a "proisparallel" flag or similar so that we can
label parallel-safe functions.

6. A patch to add a way for a background worker that hits an error to
kick the user backend, probably via the ProcSignal mechanism, so that
the user doesn't have to poll the shm_mq structures it's using to
communicate with those backends for errors continuously (or
alternative sit idle while those background workers do all the work).
This might be something that can be skipped for early versions, but
eventually I think we'll need it.

7. Eventually, a dynamic shared memory allocator.  A lot of things can
be done without this, but some can't.

I think that's most of it.  No doubt we'll discover a few other needs
along the way, but if we had that infrastructure, then I think it
would be quite feasible for an experienced hacker to parallelize most
of what we want to parallelize without having to invent too much
that's really fundamental.

> If I did, I don't think it would be that useful to interfere.
> Cooperation seems better use of my time, if possible.

I am not asking anyone to interfere, in the sense of getting in the
way, but I certainly think it would be useful for people to contribute
code and/or reviews.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 29 October 2014 15:43, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Oct 29, 2014 at 10:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> There is a real danger that your "ta-dah" moment sometime in the
>> future contains flaws which need to be addressed, but we now have
>> piles of questionable infrastructure lieing around. If you have
>> similar doubts about anything I'm doing, please just ask.
>
> I think the implication that the infrastructure created to date is of
> questionable utility for parallelism is false and unjustifed.

Huh? I haven't implied that.

Thanks for your wider explanations.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Amit Kapila
Date:
On Wed, Oct 29, 2014 at 7:38 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 29 October 2014 12:08, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Oct 29, 2014 at 2:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >>
> >> CREATE INDEX has a lock on the table. Worker tasks can be simply
> >> banned from acquiring new locks and doing many other tasks.
> >
> > Banning worker tasks from taking locks is only possible if the
> > worker backend doesn't need to acquire any lock which is not
> > already acquired by main backend, but can we safely assume
> > the same?  One example as taken by Robert upthread about
> > bttextcmp() can break that assumption.
>
> I think its reasonable to imagine some prep work will be required in
> the main task before workers begin.
>
> Locking the toast table of any main tables we access seems easily
> done. Though perhaps we should make weak locking of the toast table
> presumed. Do we have cases where the toast table can be accessed when
> the main table is not also strong locked first?

I think it is possible to have a strong lock on toast table before
main table.
Cluster pg_toast.<toast_table_name> using <toast_index>;
Vacuum Full pg_toast.<toast_table_name>;
Reindex Index pg_toast.<toast_index_name>;
..

Now if take the lock on toast table in main task, it will block some of
the operations before actually they need to be blocked. 

> As for locking the enums table or collation table, that's simple stuff
> also. They're just AccessShareLocks.

Yeah, taking AccessShareLocks is not a problem, however the main
problem is that they block any other operation on those tables which require

AccessExclusiveLock lock or any other strong lock required, before it is

actually required to block any such operation.  


> I can't imagine we'll be able to presume that all functions of every
> kind are parallel-safe.

Yeah, thats right, so we need to either block such functions to get executed
in parallel worker or make arrangements so that they can be safely executed,
I think for first version blocking might be better.

Now, I think we can find ways to avoid all such things by hacking code such
that conflicting operations can be banned/blocked, however eventually we
need to have some kind of group locking to avoid deadlocks which can arise
due to operations in parallel worker.  So it seems to me it is not a bad idea
to have a strong infrastructure first for the things (like group locking) which
we think are required for any non-trivial useful parallel operation without
putting too much restrictions on users for using the feature.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 30 October 2014 04:24, Amit Kapila <amit.kapila16@gmail.com> wrote:

>> Locking the toast table of any main tables we access seems easily
>> done. Though perhaps we should make weak locking of the toast table
>> presumed. Do we have cases where the toast table can be accessed when
>> the main table is not also strong locked first?
>
> I think it is possible to have a strong lock on toast table before
> main table.
> Cluster pg_toast.<toast_table_name> using <toast_index>;
> Vacuum Full pg_toast.<toast_table_name>;
> Reindex Index pg_toast.<toast_index_name>;
> ..
>
> Now if take the lock on toast table in main task, it will block some of
> the operations before actually they need to be blocked.

Very strange.

Those commands are not common operations? I doubt many people even
know that exists.

All of those commands would block SELECTs on the main table, so there
is no significant benefit in that behaviour.

In fact it would be more sensible to lock the toast table earlier.


>> As for locking the enums table or collation table, that's simple stuff
>> also. They're just AccessShareLocks.
>
> Yeah, taking AccessShareLocks is not a problem, however the main
> problem is that they block any other operation on those tables which require
> AccessExclusiveLock lock or any other strong lock required, before it is
> actually required to block any such operation.

What operations are those? How frequently do they occur?

I can't see any need for major maintenance operations on small catalog tables.


>> I can't imagine we'll be able to presume that all functions of every
>> kind are parallel-safe.
>
> Yeah, thats right, so we need to either block such functions to get executed
> in parallel worker or make arrangements so that they can be safely executed,
> I think for first version blocking might be better.

I think that also. That way we won't need complex locking.

> Now, I think we can find ways to avoid all such things by hacking code such
> that conflicting operations can be banned/blocked, however eventually we
> need to have some kind of group locking to avoid deadlocks which can arise
> due to operations in parallel worker.  So it seems to me it is not a bad
> idea
> to have a strong infrastructure first for the things (like group locking)
> which
> we think are required for any non-trivial useful parallel operation without
> putting too much restrictions on users for using the feature.

We are all agreed we need to block some functions from executing in
parallel mode.

Why don't we start by making a list of the functions that might cause
problems in parallel workers, and why. That way we might find out
easier ways of doing things.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Amit Kapila
Date:
On Thu, Oct 30, 2014 at 2:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 30 October 2014 04:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >> Locking the toast table of any main tables we access seems easily
> >> done. Though perhaps we should make weak locking of the toast table
> >> presumed. Do we have cases where the toast table can be accessed when
> >> the main table is not also strong locked first?
> >
> > I think it is possible to have a strong lock on toast table before
> > main table.
> > Cluster pg_toast.<toast_table_name> using <toast_index>;
> > Vacuum Full pg_toast.<toast_table_name>;
> > Reindex Index pg_toast.<toast_index_name>;
> > ..
> >
> > Now if take the lock on toast table in main task, it will block some of
> > the operations before actually they need to be blocked.
>
> Very strange.
>
> Those commands are not common operations? I doubt many people even
> know that exists.
>
> All of those commands would block SELECTs on the main table, so there
> is no significant benefit in that behaviour.
>
> In fact it would be more sensible to lock the toast table earlier.
>

It might make some sense to lock the toast table earlier for this
particular case, but I don't think in general it will be feasible to lock
all the tables (including catalog tables) which might get used in the
overall operation before starting parallel operation.  I believe it will
make doing parallel stuff difficult if we try to go via this route, as
we need to always do this for any operation we want to make parallel.
It will always have the risk that we might miss something and another
thing is it might not be feasible in all kind of cases.

If I understand correctly, you are suggesting that to get the first version
of parallel operation out earlier, we should try to avoid all the stuff
(like group locking, ...), have some restrictions on which kind of cases
Create Index can work, do some hackery in Create Index path to avoid
any kind of locking problems and do the parallel operation for External
Sort (which might avoid need for shared memory allocator) and then
call it a day and then do the things which we need for other parallel
operations as and when they are required.  I think this might be a good
approach in itself if somebody wants to target something to release
early, however in medium to short term when we want to provide
non-trivial parallel operations we have to eventually solve those problems
and delaying will only make the things worse.

In short, I agree that there is a merit in your idea w.r.t getting the first
version of parallel operation out early, however if we see from medium
to long term, I feel solving group locking problem (or some other general
infrastructure what Robert mentioned upthread) can yield better results,
unless you feel that is not at all required for parallel operations.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 31 October 2014 04:42, Amit Kapila <amit.kapila16@gmail.com> wrote:

>> In fact it would be more sensible to lock the toast table earlier.
>>
>
> It might make some sense to lock the toast table earlier for this
> particular case, but I don't think in general it will be feasible to lock
> all the tables (including catalog tables) which might get used in the
> overall operation before starting parallel operation.  I believe it will
> make doing parallel stuff difficult if we try to go via this route, as
> we need to always do this for any operation we want to make parallel.
> It will always have the risk that we might miss something and another
> thing is it might not be feasible in all kind of cases.
>
> If I understand correctly, you are suggesting that to get the first version
> of parallel operation out earlier, we should try to avoid all the stuff
> (like group locking, ...),

I advocate trying to avoid it, to see. The problems cited so far are
not so extreme they cannot be easily got around, if you have a will to
do so.

I've just posted about an idea to reduce catalog locking.

> have some restrictions on which kind of cases
> Create Index can work, do some hackery in Create Index path to avoid
> any kind of locking problems and do the parallel operation for External
> Sort (which might avoid need for shared memory allocator) and then
> call it a day and then do the things which we need for other parallel
> operations as and when they are required.

> I think this might be a good
> approach in itself if somebody wants to target something to release
> early, however in medium to short term when we want to provide
> non-trivial parallel operations we have to eventually solve those problems
> and delaying will only make the things worse.

My (by now) long experience on Postgres has shown me that developing
something now is a good route to take. By moving towards a solution at
a reasonable pace you learn things which may affect the longer term
viability of the project. New functionality in each release is useful.
Big bang developments that don't deliver until the end have a habit of
not delivering at the end either. "MVP" is a term frequently used in
VCs for a reason.

Lack of urgency in delivering a useful outcome from the project looks
strange. If all you develop in this release is core infrastructure for
parallelism and a custom scan API, it begins to look like there may
not be an open source version delivered at all. How people make their
income is up to them, but its a concern that's been raised my way
before, so it seems reasonable to do so for others also. That is why
we have spent much time explaining clearly the goals and licencing of
BDR, for example. No need for big arguments; these are not allegations
or implications etc..

> In short, I agree that there is a merit in your idea w.r.t getting the first
> version of parallel operation out early, however if we see from medium
> to long term, I feel solving group locking problem (or some other general
> infrastructure what Robert mentioned upthread) can yield better results,
> unless you feel that is not at all required for parallel operations.

Is it genuinely required for most parallel operations? I think it's
clear that none of us knows the answer. Sure, the general case needs
it, but is the general case the same thing as the reasonably common
case?

Certainly, having a working prototype for something useful would help
build the case for why these things are needed. It would certainly
help the cause to have something working in this release.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Is it genuinely required for most parallel operations? I think it's
> clear that none of us knows the answer. Sure, the general case needs
> it, but is the general case the same thing as the reasonably common
> case?

Well, I think that the answer is pretty clear.  Most of the time,
perhaps in 99.9% of cases, group locking will make no difference as to
whether a parallel operation succeeds or fails.  Occasionally,
however, it will cause an undetected deadlock.  I don't hear anyone
arguing that that's OK.  Does anyone wish to make that argument?

If not, then we must prevent it.  The only design, other than what
I've proposed here, that seems like it will do that consistently in
all cases is to have the user backend lock every table that the child
backend might possibly want to lock and retain those locks throughout
the entire duration of the computation whether the child would
actually need those locks or not.  I think that could be made to work,
but there are two probems:

1. Turing's theorem being what it is, predicting what catalog tables
the child might lock is not necessarily simple.

2. It might end up taking any more locks than necessary and holding
them for much longer than necessary.  Right now, for example, a
syscache lookup locks the table only if we actually need to read from
it and releases the lock as soon as the actual read is complete.
Under this design, every syscache that the parallel worker might
conceivably consult needs to be locked for the entire duration of the
parallel computation.  I would expect this to provoke a violent
negative reaction from at least one prominent community member (and I
bet users wouldn't like it much, either).

So, I am still of the opinion that group locking makes sense.   While
I appreciate the urge to avoid solving difficult problems where it's
reasonably avoidable, I think we're in danger of spending more effort
trying to avoid solving this particular problem than it would take to
actually solve it.  Based on what I've done so far, I'm guessing that
a complete group locking patch will be between 1000 and 1500 lines of
code and that nearly all of the new logic will be skipped when it's
not in use (i.e. no parallelism).  That sounds to me like a hell of a
deal compared to trying to predict what locks the child might
conceivably take and preemptively acquire them all, which sounds
annoyingly tedious even for simple things (like equality operators)
and nearly impossible for anything more complicated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-10-31 08:54:54 -0400, Robert Haas wrote:
> On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > Is it genuinely required for most parallel operations? I think it's
> > clear that none of us knows the answer. Sure, the general case needs
> > it, but is the general case the same thing as the reasonably common
> > case?
> 
> Well, I think that the answer is pretty clear.  Most of the time,
> perhaps in 99.9% of cases, group locking will make no difference as to
> whether a parallel operation succeeds or fails.  Occasionally,
> however, it will cause an undetected deadlock.  I don't hear anyone
> arguing that that's OK.  Does anyone wish to make that argument?

Maybe we can, as a first step, make those edges in the lock graph
visible to the deadlock detector? It's pretty clear that undetected
deadlocks aren't ok, but detectable deadlocks in a couple corner cases
might be acceptable.

> If not, then we must prevent it.  The only design, other than what
> I've proposed here, that seems like it will do that consistently in
> all cases is to have the user backend lock every table that the child
> backend might possibly want to lock and retain those locks throughout
> the entire duration of the computation whether the child would
> actually need those locks or not.  I think that could be made to work,
> but there are two probems:
> 
> 1. Turing's theorem being what it is, predicting what catalog tables
> the child might lock is not necessarily simple.

Well, there's really no need to be absolutely general here. We're only
going to support a subset of functionality as parallelizable. And in
that case we don't need anything complicated to acquire all locks.

It seems quite possible to combine a heuristic approach with improved
deadlock detection.

> 2. It might end up taking any more locks than necessary and holding
> them for much longer than necessary.  Right now, for example, a
> syscache lookup locks the table only if we actually need to read from
> it and releases the lock as soon as the actual read is complete.
> Under this design, every syscache that the parallel worker might
> conceivably consult needs to be locked for the entire duration of the
> parallel computation.  I would expect this to provoke a violent
> negative reaction from at least one prominent community member (and I
> bet users wouldn't like it much, either).

I see little reason to do this for system level relations.

> So, I am still of the opinion that group locking makes sense.   While
> I appreciate the urge to avoid solving difficult problems where it's
> reasonably avoidable, I think we're in danger of spending more effort
> trying to avoid solving this particular problem than it would take to
> actually solve it.  Based on what I've done so far, I'm guessing that
> a complete group locking patch will be between 1000 and 1500 lines of
> code and that nearly all of the new logic will be skipped when it's
> not in use (i.e. no parallelism).  That sounds to me like a hell of a
> deal compared to trying to predict what locks the child might
> conceivably take and preemptively acquire them all, which sounds
> annoyingly tedious even for simple things (like equality operators)
> and nearly impossible for anything more complicated.

What I'm worried about is the performance impact of group locking when
it's not in use. The heavyweight locking code is already quite complex
and often a noticeable bottleneck...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-10-31 08:54:54 -0400, Robert Haas wrote:
>> On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> > Is it genuinely required for most parallel operations? I think it's
>> > clear that none of us knows the answer. Sure, the general case needs
>> > it, but is the general case the same thing as the reasonably common
>> > case?
>>
>> Well, I think that the answer is pretty clear.  Most of the time,
>> perhaps in 99.9% of cases, group locking will make no difference as to
>> whether a parallel operation succeeds or fails.  Occasionally,
>> however, it will cause an undetected deadlock.  I don't hear anyone
>> arguing that that's OK.  Does anyone wish to make that argument?
>
> Maybe we can, as a first step, make those edges in the lock graph
> visible to the deadlock detector? It's pretty clear that undetected
> deadlocks aren't ok, but detectable deadlocks in a couple corner cases
> might be acceptable.

Interesting idea.  I agree that would be more acceptable.  It would be
kind of sad, though, if you got a deadlock from this:

BEGIN;
LOCK TABLE foo;
SELECT * FROM foo;

You can, of course, avoid that, but it's basically transferring the
burden from the lock manager to every bit of parallel code we ever
write.  If it turns out to be easier to do this than what I'm
currently proposing, I'm definitely amenable to going this route for
version 1, but I don't think it's going to be appealing as a permanent
solution.

>> 1. Turing's theorem being what it is, predicting what catalog tables
>> the child might lock is not necessarily simple.
>
> Well, there's really no need to be absolutely general here. We're only
> going to support a subset of functionality as parallelizable. And in
> that case we don't need anything complicated to acquire all locks.

See, that's what I don't believe.  Even very simple things like
equality operators do a surprising amount of catalog locking.

>> 2. It might end up taking any more locks than necessary and holding
>> them for much longer than necessary.  Right now, for example, a
>> syscache lookup locks the table only if we actually need to read from
>> it and releases the lock as soon as the actual read is complete.
>> Under this design, every syscache that the parallel worker might
>> conceivably consult needs to be locked for the entire duration of the
>> parallel computation.  I would expect this to provoke a violent
>> negative reaction from at least one prominent community member (and I
>> bet users wouldn't like it much, either).
>
> I see little reason to do this for system level relations.

I'm not following you.

>> So, I am still of the opinion that group locking makes sense.   While
>> I appreciate the urge to avoid solving difficult problems where it's
>> reasonably avoidable, I think we're in danger of spending more effort
>> trying to avoid solving this particular problem than it would take to
>> actually solve it.  Based on what I've done so far, I'm guessing that
>> a complete group locking patch will be between 1000 and 1500 lines of
>> code and that nearly all of the new logic will be skipped when it's
>> not in use (i.e. no parallelism).  That sounds to me like a hell of a
>> deal compared to trying to predict what locks the child might
>> conceivably take and preemptively acquire them all, which sounds
>> annoyingly tedious even for simple things (like equality operators)
>> and nearly impossible for anything more complicated.
>
> What I'm worried about is the performance impact of group locking when
> it's not in use. The heavyweight locking code is already quite complex
> and often a noticeable bottleneck...

I certainly agree that it would be unacceptable for group locking to
significantly regress the normal case.  I'm pretty optimistic about
reducing the overhead to near-zero, though - a few extra branch
instructions on the non-fast-path case should be lost in the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-10-31 10:08:59 -0400, Robert Haas wrote:
> On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-10-31 08:54:54 -0400, Robert Haas wrote:
> >> On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> > Is it genuinely required for most parallel operations? I think it's
> >> > clear that none of us knows the answer. Sure, the general case needs
> >> > it, but is the general case the same thing as the reasonably common
> >> > case?
> >>
> >> Well, I think that the answer is pretty clear.  Most of the time,
> >> perhaps in 99.9% of cases, group locking will make no difference as to
> >> whether a parallel operation succeeds or fails.  Occasionally,
> >> however, it will cause an undetected deadlock.  I don't hear anyone
> >> arguing that that's OK.  Does anyone wish to make that argument?
> >
> > Maybe we can, as a first step, make those edges in the lock graph
> > visible to the deadlock detector? It's pretty clear that undetected
> > deadlocks aren't ok, but detectable deadlocks in a couple corner cases
> > might be acceptable.
> 
> Interesting idea.  I agree that would be more acceptable.  It would be
> kind of sad, though, if you got a deadlock from this:
> 
> BEGIN;
> LOCK TABLE foo;
> SELECT * FROM foo;

I have serious doubts about the number of cases where it's correct to
access relations in a second backend that are exclusively locked. Not so
much when that happens for a user issued LOCK statement of course, but
when the system has done so. Personally I think to stay sane we'll have
to forbid access to anything normal other backends can't access either -
otherwise things will get too hard to reason about.

So just refusing parallelism as soon as anything has taken an access
exclusive lock doesn't sound too bad to me.

> You can, of course, avoid that, but it's basically transferring the
> burden from the lock manager to every bit of parallel code we ever
> write.  If it turns out to be easier to do this than what I'm
> currently proposing, I'm definitely amenable to going this route for
> version 1, but I don't think it's going to be appealing as a permanent
> solution.

It's probably not the solution forever, I agree. But it might be the
simplest way forward till we have more actual users.

> >> 1. Turing's theorem being what it is, predicting what catalog tables
> >> the child might lock is not necessarily simple.
> >
> > Well, there's really no need to be absolutely general here. We're only
> > going to support a subset of functionality as parallelizable. And in
> > that case we don't need anything complicated to acquire all locks.
> 
> See, that's what I don't believe.  Even very simple things like
> equality operators do a surprising amount of catalog locking.

So what? I don't see catalog locking as something problematic? Maybe I'm
missing something, but I don't see cases would you expect deadlocks or
anything like it on catalog relations?

> >> So, I am still of the opinion that group locking makes sense.   While
> >> I appreciate the urge to avoid solving difficult problems where it's
> >> reasonably avoidable, I think we're in danger of spending more effort
> >> trying to avoid solving this particular problem than it would take to
> >> actually solve it.  Based on what I've done so far, I'm guessing that
> >> a complete group locking patch will be between 1000 and 1500 lines of
> >> code and that nearly all of the new logic will be skipped when it's
> >> not in use (i.e. no parallelism).  That sounds to me like a hell of a
> >> deal compared to trying to predict what locks the child might
> >> conceivably take and preemptively acquire them all, which sounds
> >> annoyingly tedious even for simple things (like equality operators)
> >> and nearly impossible for anything more complicated.
> >
> > What I'm worried about is the performance impact of group locking when
> > it's not in use. The heavyweight locking code is already quite complex
> > and often a noticeable bottleneck...
> 
> I certainly agree that it would be unacceptable for group locking to
> significantly regress the normal case.  I'm pretty optimistic about
> reducing the overhead to near-zero, though - a few extra branch
> instructions on the non-fast-path case should be lost in the noise.

I'm less worried about the additional branches than about increasing the
size of the datastructures. They're already pretty huge.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 31 October 2014 12:54, Robert Haas <robertmhaas@gmail.com> wrote:

> 1. Turing's theorem being what it is, predicting what catalog tables
> the child might lock is not necessarily simple.

The Pareto principle offers ways to cope with the world's lack of simplicity.

You mentioned earlier that functions would need to be marked proisparallel etc..

What conditions will that be protecting against? If we aren't going to
support the general case where every single thing works, can we at
least discuss what the list of cases is that we will support.

I don't think we can argue that everything must be generic when we
already admit it won't be.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I have serious doubts about the number of cases where it's correct to
> access relations in a second backend that are exclusively locked. Not so
> much when that happens for a user issued LOCK statement of course, but
> when the system has done so. Personally I think to stay sane we'll have
> to forbid access to anything normal other backends can't access either -
> otherwise things will get too hard to reason about.
>
> So just refusing parallelism as soon as anything has taken an access
> exclusive lock doesn't sound too bad to me.

That restriction seems onerous to me; for example, it would prevent a
parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL.
Those seems like worthy targets for parallelism, and maybe even early
targets, since I expect a lot of the real complexity here to be in the
query planner, and you can avoid most of that complexity when planning
DDL.

I also think it's unnecessary.  It's wrong to think of two parallel
backends that both want AccessExclusiveLock on a given target relation
as conflicting with each other; if the process were not running in
parallel, those lock requests would both have come from the same
process and both requests would have been granted.  Parallelism is
generally not about making different things happen; it's about
spreading the stuff that would happen anyway across multiple backends,
and things that would succeed if done in a single backend shouldn't
fail when spread across 2 or more without some compelling
justification.  The cases where it's actually unsafe for a table to be
accessed even by some other code within the same backend are handled
not by the lock manager, but by CheckTableNotInUse().  That call
should probably fail categorically if issued while parallelism is
active.

>> >> 1. Turing's theorem being what it is, predicting what catalog tables
>> >> the child might lock is not necessarily simple.
>> >
>> > Well, there's really no need to be absolutely general here. We're only
>> > going to support a subset of functionality as parallelizable. And in
>> > that case we don't need anything complicated to acquire all locks.
>>
>> See, that's what I don't believe.  Even very simple things like
>> equality operators do a surprising amount of catalog locking.
>
> So what? I don't see catalog locking as something problematic? Maybe I'm
> missing something, but I don't see cases would you expect deadlocks or
> anything like it on catalog relations?

Suppose somebody fires off a parallel sort on a text column, or a
parallel sequential scan involving a qual of the form textcol = 'zog'.
We launch a bunch of workers to do comparisons; they do lookups
against pg_collation.  After some but not all of them have loaded the
collation information they need into their local cache, the DBA types
"cluster pg_collate".  It queues up for an AccessExclusiveLock.  The
remaining parallel workers join the wait queue waiting for their
AccessShareLock. The system is now deadlocked; the only solution is to
move the parallel workers ahead of the AccessExclusiveLock request,
but the deadlock detector won't do that unless it knows about the
relationship among the parallel workers.

It's possible to construct more obscure cases as well.  For example,
suppose a user (for reasons known only to himself and God) does LOCK
TABLE pg_opclass and then begins a sort of a column full of enums.
Meanwhile, another user tries to CLUSTER pg_enum.  This will deadlock
if, and only if, some of the enum OIDs are odd.  I mention this
example not because I think it's a particularly useful case but
because it illustrates just how robust the current system is: it will
catch that case, and break the deadlock somehow, and you as a
PostgreSQL backend hacker do not need to think about it.  The guy who
added pg_enum did not need to think about it.  It just works.  If we
decide that parallelism is allowed to break that guarantee, then every
patch that does parallel anything has got to worry about what
undetected deadlocks might result, and then argue about which of them
are obscure enough that we shouldn't care and which are not.  That
doesn't sound like a fun way to spend my time.

But, really, the first case is the one I'm more worried about: a bunch
of parallel backends all grab the same locks, but slightly separated
in time.  A strong locker joins the queue midway through and we're
hosed.  Obviously any system cache lookups whatsoever are going to be
separated in time like this, just because the world isn't synchronous.
The pg_enum case is an example of how the lookups could be more widely
spaced: each backend will scan pg_enum when it first hits an odd OID,
and that could happen much later for some than others.  But even a
small window is enough to render undetected deadlock a possibility,
and you will not convince me that "hope the race condition is narrow
enough in practice that this never happens" is a remotely sane
strategy.

Down the road, it's not hard to imagine the same issues cropping up
with user tables.  For example, the user defines a non-inlinable SQL
or PL/pgsql function getname() that does SELECT name FROM other_table
WHERE id = $1 and returns the result.  They mark this function
parallel safe, or we infer that automatically.  Then they do SELECT *
FROM some_table WHERE somecol = 242 AND getname(othercol) ~
'cranberry'.  There is nothing whatsoever to make this query
non-parallelizable; indeed, it's probably an excellent candidate for
parellelism.  But if you don't have group locking then it is again
susceptible to the risk of undetected deadlock as soon as an unrelated
process tries to grab AccessExclusiveLock on other_table, because it
may be that some parallel backends already have AccessShareLock on it,
and that others will request it later, queueing up behind the
AccessExcusiveLock request.

Even if you think that all of these particular issues are somehow
avoidable or that they don't matter, I think it would be unwise to bet
on them being the only issues.  The lock manager looks like it's old,
ill-maintained code, but that's in no small part because it solved the
problem it aims at so thoroughly that few people have felt the need to
modify it very much in the last decade.  I think that's part of why
there's so much skepticism on this thread - we don't think about the
problems that the lock manager solves as being important problems not
because they truly aren't important, but because they are so
thoroughly solved.  Then every once in a while we complain about the
performance.

> I'm less worried about the additional branches than about increasing the
> size of the datastructures. They're already pretty huge.

I share that concern.  I aimed for a design which would be
memory-efficient and have low impact on the non-group-locking case
rather than a design which would be ideal for group locking.  The
patch could be made to handle large locking groups more efficiently by
changing the shared memory structure around; e.g. by replacing
PROCLOCK's holdMask and waitMask fields, now both of type LOCKMASK,
with int [MAX_LOCKMODES] so that we can represent the entire group's
locks held and awaited on a single object with a single PROCLOCK.
That representation would be significantly more convenient than the
one I actually chose, but it would also use up a LOT more shared
memory and would penalize LockReleaseAll().  The design embodied by
the proposed patch adds one additional pointer per PROCLOCK, which is
still not great, but I haven't been able to think of a reasonable way
to do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 31 October 2014 18:29, Robert Haas <robertmhaas@gmail.com> wrote:

> Suppose somebody fires off a parallel sort on a text column, or a
> parallel sequential scan involving a qual of the form textcol = 'zog'.
> We launch a bunch of workers to do comparisons; they do lookups
> against pg_collation.  After some but not all of them have loaded the
> collation information they need into their local cache, the DBA types
> "cluster pg_collate".  It queues up for an AccessExclusiveLock.  The
> remaining parallel workers join the wait queue waiting for their
> AccessShareLock. The system is now deadlocked; the only solution is to
> move the parallel workers ahead of the AccessExclusiveLock request,
> but the deadlock detector won't do that unless it knows about the
> relationship among the parallel workers.

It's an obscure case and its not the only solution either.

I'm really surprised that having workers do their own locking doesn't
scare you. Personally, it scares me.

It's not like we're the first do parallel query. Honestly, how many
parallel databases do you think do this?

I can't see this being a practical, workable solution for running a
parallel query across a cluster of many nodes. Shared, distributed
lock table?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 11:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> You mentioned earlier that functions would need to be marked proisparallel etc..
>
> What conditions will that be protecting against? If we aren't going to
> support the general case where every single thing works, can we at
> least discuss what the list of cases is that we will support.

In general, any operation that relies on backend-private state not
shared with a cooperating backend isn't parallel-safe.  For example,
consider:

SELECT setseed(1);
SELECT random() FROM generate_series(1,1000) g;

This sequence of queries can be relied upon to produce the same output
every time.  But if some of the random() calls are executed in another
backend, you'll get different results because random() works by using
backend-private memory to store its pseudo-random state.  It's
unlikely to be worth the effort to move the pseudo-random state to a
DSM for parallelism, so we probably just want to disallow parallel
query when random() is in use, or, better still, disallow
parallelizing only the particular query node that uses random().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-10-31 14:29:57 -0400, Robert Haas wrote:
> On Fri, Oct 31, 2014 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I have serious doubts about the number of cases where it's correct to
> > access relations in a second backend that are exclusively locked. Not so
> > much when that happens for a user issued LOCK statement of course, but
> > when the system has done so. Personally I think to stay sane we'll have
> > to forbid access to anything normal other backends can't access either -
> > otherwise things will get too hard to reason about.
> >
> > So just refusing parallelism as soon as anything has taken an access
> > exclusive lock doesn't sound too bad to me.
>
> That restriction seems onerous to me; for example, it would prevent a
> parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL.
> Those seems like worthy targets for parallelism, and maybe even early
> targets, since I expect a lot of the real complexity here to be in the
> query planner, and you can avoid most of that complexity when planning
> DDL.

Ugh. I think that's aiming too high. You'll suddenly need to share cche
invalidations between the backends. I think we should start by requiring
that there's no DDL done *at all* in the transaction that starts the
parallel activity. For CREATE INDEX PARALLEL that can be done by reusing
the logic we have for CONCURRENTLY, thereby moving the parallelized part
into a transaction that hasn't done any DDL yet.

> I also think it's unnecessary.  It's wrong to think of two parallel
> backends that both want AccessExclusiveLock on a given target relation
> as conflicting with each other; if the process were not running in
> parallel, those lock requests would both have come from the same
> process and both requests would have been granted.

I don't think that's a realistic view. There's lots of backend private
state around. If we try to make all that coherent between backends we'll
fail.

> Parallelism is
> generally not about making different things happen; it's about
> spreading the stuff that would happen anyway across multiple backends,
> and things that would succeed if done in a single backend shouldn't
> fail when spread across 2 or more without some compelling
> justification.  The cases where it's actually unsafe for a table to be
> accessed even by some other code within the same backend are handled
> not by the lock manager, but by CheckTableNotInUse().  That call
> should probably fail categorically if issued while parallelism is
> active.

Which will e.g. prohibit CLUSTER and VACUUM FULL.

> >> >> 1. Turing's theorem being what it is, predicting what catalog tables
> >> >> the child might lock is not necessarily simple.
> >> >
> >> > Well, there's really no need to be absolutely general here. We're only
> >> > going to support a subset of functionality as parallelizable. And in
> >> > that case we don't need anything complicated to acquire all locks.
> >>
> >> See, that's what I don't believe.  Even very simple things like
> >> equality operators do a surprising amount of catalog locking.
> >
> > So what? I don't see catalog locking as something problematic? Maybe I'm
> > missing something, but I don't see cases would you expect deadlocks or
> > anything like it on catalog relations?
>
> Suppose somebody fires off a parallel sort on a text column, or a
> parallel sequential scan involving a qual of the form textcol = 'zog'.
> We launch a bunch of workers to do comparisons; they do lookups
> against pg_collation.  After some but not all of them have loaded the
> collation information they need into their local cache, the DBA types
> "cluster pg_collate".  It queues up for an AccessExclusiveLock.  The
> remaining parallel workers join the wait queue waiting for their
> AccessShareLock. The system is now deadlocked; the only solution is to
> move the parallel workers ahead of the AccessExclusiveLock request,
> but the deadlock detector won't do that unless it knows about the
> relationship among the parallel workers.

I think you would need to make the case more complex - we only hold
locks on system object for a very short time, and mostly not in a nested
fashion.

But generally I think it's absolutely perfectly alright to fail in such
case. We need to fail reliably, but *none* of this is a new hazard. You
can have pretty similar issues today, without any parallelism.

> It's possible to construct more obscure cases as well.  For example,
> suppose a user (for reasons known only to himself and God) does LOCK
> TABLE pg_opclass and then begins a sort of a column full of enums.
> Meanwhile, another user tries to CLUSTER pg_enum.  This will deadlock
> if, and only if, some of the enum OIDs are odd.  I mention this
> example not because I think it's a particularly useful case but
> because it illustrates just how robust the current system is: it will
> catch that case, and break the deadlock somehow, and you as a
> PostgreSQL backend hacker do not need to think about it.  The guy who
> added pg_enum did not need to think about it.  It just works.  If we
> decide that parallelism is allowed to break that guarantee, then every
> patch that does parallel anything has got to worry about what
> undetected deadlocks might result, and then argue about which of them
> are obscure enough that we shouldn't care and which are not.  That
> doesn't sound like a fun way to spend my time.

Well, that's why I think we should teach the deadlock detector to catch
these kind of cases. That doesn't sound particularly hard.

I'm sorry to be a bit pessimistic here. But my intuition says that
starting to do group locking opens a can of worms that'll take us a long
time to close again. Maybe I'm just imagining complexity that won't
actually be there. But I have a hard time believing that.

I wonder if we couldn't implement a 'halfway' by allowing parallel
workers to jump the lockqueue if the parent process already has the
lock. That'd only work for nonconflicting locks, but I think that's
actually good.

> > I'm less worried about the additional branches than about increasing the
> > size of the datastructures. They're already pretty huge.
>
> I share that concern.  I aimed for a design which would be
> memory-efficient and have low impact on the non-group-locking case
> rather than a design which would be ideal for group locking.  The
> patch could be made to handle large locking groups more efficiently by
> changing the shared memory structure around; e.g. by replacing
> PROCLOCK's holdMask and waitMask fields, now both of type LOCKMASK,
> with int [MAX_LOCKMODES] so that we can represent the entire group's
> locks held and awaited on a single object with a single PROCLOCK.
> That representation would be significantly more convenient than the
> one I actually chose, but it would also use up a LOT more shared
> memory and would penalize LockReleaseAll().  The design embodied by
> the proposed patch adds one additional pointer per PROCLOCK, which is
> still not great, but I haven't been able to think of a reasonable way
> to do better.

That's in the current prototype you sent or something newer you have
privately?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 2:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 31 October 2014 18:29, Robert Haas <robertmhaas@gmail.com> wrote:
>> Suppose somebody fires off a parallel sort on a text column, or a
>> parallel sequential scan involving a qual of the form textcol = 'zog'.
>> We launch a bunch of workers to do comparisons; they do lookups
>> against pg_collation.  After some but not all of them have loaded the
>> collation information they need into their local cache, the DBA types
>> "cluster pg_collate".  It queues up for an AccessExclusiveLock.  The
>> remaining parallel workers join the wait queue waiting for their
>> AccessShareLock. The system is now deadlocked; the only solution is to
>> move the parallel workers ahead of the AccessExclusiveLock request,
>> but the deadlock detector won't do that unless it knows about the
>> relationship among the parallel workers.
>
> It's an obscure case and its not the only solution either.

I don't think that's an obscure situation at all.  Do you really think
a patch that could cause an attempt to VACUUM FULL a system catalog to
suffer an undetected deadlock meets this community's quality
standards?  Because that's what we're talking about.

You are right that it isn't the only solution.  I have said the same
thing myself, multiple times, on this thread.

> I'm really surprised that having workers do their own locking doesn't
> scare you. Personally, it scares me.

Frankly, the reverse decision scares me a heck of a lot more.  It
superficially seems like a good idea to have the master take locks on
behalf of the workers, but when you start to really think about how
many low-level parts of the code take locks, it quickly becomes
evident, at least to me, that trying to make the resulting system
robust will be a nightmare.

Don't forget that there are not only relation locks but also page
locks, tuple locks, relation extension locks, XID and VXID locks,
locks on arbitrary database or shared objects.  The deadlock detector
handles them all.  If avoiding deadlock outside the lock manager is so
easy, why do we have a deadlock detector in the first place?  Why not
just change all of our other code to avoid them instead?

> It's not like we're the first do parallel query. Honestly, how many
> parallel databases do you think do this?

All of them.

I might be wrong, of course, but I see no reason at all to believe
that Oracle or SQL Server have ignoring deadlock detection when
parallel query is in use.

> I can't see this being a practical, workable solution for running a
> parallel query across a cluster of many nodes. Shared, distributed
> lock table?

I am not proposing a distributed lock manager that can run across many
nodes.  The considerations for such a piece of software are largely
orthogonal to the problem that I'm actually trying to solve here, and
at least an order of magnitude more complex.  If you make it sound
like that's the project that I'm doing, then of course it's going to
sound frighteningly complicated.  But I'm not.

I'm sort of baffled by your concern on this point.  Sure, the lock
manager is a complex and performance-sensitive piece of code, and we
have to be careful not to break it or make it slower.  But we're not
talking about launching a rocket to Mars here; we're just talking
about making the lock manager interface look the same way to two
cooperating backends issuing lock requests in parallel that it looks
to a single backend making those requests serially.  I don't think
it's noticeably harder than getting Hot Standby conflict resolution
work, or getting historical snapshots working for logical decoding, or
making the visibility map crash-safe - in fact, I'd bet on it being
substantially easier than the first two of those, because all the
logic is inside a single, relatively simple backend subsystem.  Why
we'd want to minimize complexity there at the cost of spreading it
across half the backend is mysterious to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 3:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > So just refusing parallelism as soon as anything has taken an access
>> > exclusive lock doesn't sound too bad to me.
>>
>> That restriction seems onerous to me; for example, it would prevent a
>> parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL.
>> Those seems like worthy targets for parallelism, and maybe even early
>> targets, since I expect a lot of the real complexity here to be in the
>> query planner, and you can avoid most of that complexity when planning
>> DDL.
>
> Ugh. I think that's aiming too high. You'll suddenly need to share cche
> invalidations between the backends. I think we should start by requiring
> that there's no DDL done *at all* in the transaction that starts the
> parallel activity. For CREATE INDEX PARALLEL that can be done by reusing
> the logic we have for CONCURRENTLY, thereby moving the parallelized part
> into a transaction that hasn't done any DDL yet.

I don't think that's correct.  We only need to process local
invalidation messages after CommandCounterIncrement(), which I
anticipate prohibiting during parallel execution (short thought should
convince you that anything else is completely nuts).  So I think
there's no real problem with invalidation messages being generated
during parallel execution.  I wouldn't anticipate supporting that in
V1, because they'd have to be propagated back to the user backend
prior to commit, but in the longer term it seems pretty feasible.

>> I also think it's unnecessary.  It's wrong to think of two parallel
>> backends that both want AccessExclusiveLock on a given target relation
>> as conflicting with each other; if the process were not running in
>> parallel, those lock requests would both have come from the same
>> process and both requests would have been granted.
>
> I don't think that's a realistic view. There's lots of backend private
> state around. If we try to make all that coherent between backends we'll
> fail.

I agree that there's a lot of backend private state, and that's the
principle problem with making this work at all.  The challenge here is
to figure out which parts of that backend-private state we absolutely
must make coherent between backends to enable parallelism to have a
useful, non-trivial scope, and which parts we can punt on.  I have
ideas about that - which are mostly documented at
https://wiki.postgresql.org/wiki/Parallel_Sort - but I'm trying to
keep an open mind on the details.

>> Parallelism is
>> generally not about making different things happen; it's about
>> spreading the stuff that would happen anyway across multiple backends,
>> and things that would succeed if done in a single backend shouldn't
>> fail when spread across 2 or more without some compelling
>> justification.  The cases where it's actually unsafe for a table to be
>> accessed even by some other code within the same backend are handled
>> not by the lock manager, but by CheckTableNotInUse().  That call
>> should probably fail categorically if issued while parallelism is
>> active.
>
> Which will e.g. prohibit CLUSTER and VACUUM FULL.

I don't think so.  I think it's safe to get to the point where we
check that the table is not in use, then do a parallel scan and sort,
then collect the results and write them out.  But I would not try to
make it safe, for example, to be scanning a table and have some
user-defined function invoked along the way to go try to do anything
that involves CheckTableNotInUse().  Many of those would fall over
anyway because of PreventTransactionChain(), but I've got no problem
nailing that hatch shut twice.  And conversely, I would expect we'd
disallow the sort operators invoked by CLUSTER from reaching any code
guarded by CheckTableNotInUse().

>> Suppose somebody fires off a parallel sort on a text column, or a
>> parallel sequential scan involving a qual of the form textcol = 'zog'.
>> We launch a bunch of workers to do comparisons; they do lookups
>> against pg_collation.  After some but not all of them have loaded the
>> collation information they need into their local cache, the DBA types
>> "cluster pg_collate".  It queues up for an AccessExclusiveLock.  The
>> remaining parallel workers join the wait queue waiting for their
>> AccessShareLock. The system is now deadlocked; the only solution is to
>> move the parallel workers ahead of the AccessExclusiveLock request,
>> but the deadlock detector won't do that unless it knows about the
>> relationship among the parallel workers.
>
> I think you would need to make the case more complex - we only hold
> locks on system object for a very short time, and mostly not in a nested
> fashion.

Hmm.  You're right.  If the parallel backends release their locks
quickly, the CLUSTER could run and then the parallel job could finish.
But it sounds as if you get the idea.

> But generally I think it's absolutely perfectly alright to fail in such
> case. We need to fail reliably, but *none* of this is a new hazard. You
> can have pretty similar issues today, without any parallelism.

I think it's absolutely fine for it to *fail*.  I think it's not OK
for it to *hang* without triggering the deadlock detector.

> Well, that's why I think we should teach the deadlock detector to catch
> these kind of cases. That doesn't sound particularly hard.

I could not agree more!  In fact, that's the whole point of the patch.

> I'm sorry to be a bit pessimistic here. But my intuition says that
> starting to do group locking opens a can of worms that'll take us a long
> time to close again. Maybe I'm just imagining complexity that won't
> actually be there. But I have a hard time believing that.

What's the distinction between "teach the deadlock detector to catch
these kind of cases" and "group locking"?  Because I think those are
at least 90% the same thing.

> I wonder if we couldn't implement a 'halfway' by allowing parallel
> workers to jump the lockqueue if the parent process already has the
> lock. That'd only work for nonconflicting locks, but I think that's
> actually good.

The patch takes precisely that approach; that part of the logic is
already implemented.  I'd like to make it a bit more robust to handle
cases where the deadlock detector gets involved, as I currently
believe that (a) can be done without unmanageable complexity, (b) will
make programming in parallel mode much simpler, and (c) will
significantly decrease the chances of an angry user or
fellow-committer tracking me down and beating me to death with a
keen-edged bug report.

> That's in the current prototype you sent or something newer you have
> privately?

The former.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-10-31 15:56:42 -0400, Robert Haas wrote:
> On Fri, Oct 31, 2014 at 3:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > So just refusing parallelism as soon as anything has taken an access
> >> > exclusive lock doesn't sound too bad to me.
> >>
> >> That restriction seems onerous to me; for example, it would prevent a
> >> parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL.
> >> Those seems like worthy targets for parallelism, and maybe even early
> >> targets, since I expect a lot of the real complexity here to be in the
> >> query planner, and you can avoid most of that complexity when planning
> >> DDL.
> >
> > Ugh. I think that's aiming too high. You'll suddenly need to share cche
> > invalidations between the backends. I think we should start by requiring
> > that there's no DDL done *at all* in the transaction that starts the
> > parallel activity. For CREATE INDEX PARALLEL that can be done by reusing
> > the logic we have for CONCURRENTLY, thereby moving the parallelized part
> > into a transaction that hasn't done any DDL yet.
> 
> I don't think that's correct.  We only need to process local
> invalidation messages after CommandCounterIncrement(), which I
> anticipate prohibiting during parallel execution (short thought should
> convince you that anything else is completely nuts).

It is more complex, even without CCI. As long as you're doing anything
inside a transaction that already has done DDL, you'll have to play
nasty tricks. Imagine the primary backend has done a BEGIN; CLUSTER
pg_class; and then then starts a child backend. Which will get the same
snapshot, combocids, yada yada. But it *also* will have preexisting
cache entries. Those you need to invalidate before it can do anything
correct.


> > I'm sorry to be a bit pessimistic here. But my intuition says that
> > starting to do group locking opens a can of worms that'll take us a long
> > time to close again. Maybe I'm just imagining complexity that won't
> > actually be there. But I have a hard time believing that.
> 
> What's the distinction between "teach the deadlock detector to catch
> these kind of cases" and "group locking"?  Because I think those are
> at least 90% the same thing.

I understand under 'group locking' that a primary/second backends can
coown a lock that normally is self-exclusive. That's a fair bit more
than adding an edge to the deadlock graph between primary/secondary
backends to make the deadlock detector recognize problems.

What I have serious doubts about is 'coowning' locks. Especially if two
backends normally wouldn't be able to both get that lock.

> > I wonder if we couldn't implement a 'halfway' by allowing parallel
> > workers to jump the lockqueue if the parent process already has the
> > lock. That'd only work for nonconflicting locks, but I think that's
> > actually good.
> 
> The patch takes precisely that approach; that part of the logic is
> already implemented.

Well, my point is that if the solution is just to jump the queue, there
really isn't any data structure changes needed. Secondary acquirers just
need to check whether a lock is already owned by the primary and then
acquire the lock in the absolutely normal way - with a completely
separate entry. Only that they ignored the queue.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I don't think that's correct.  We only need to process local
>> invalidation messages after CommandCounterIncrement(), which I
>> anticipate prohibiting during parallel execution (short thought should
>> convince you that anything else is completely nuts).
>
> It is more complex, even without CCI. As long as you're doing anything
> inside a transaction that already has done DDL, you'll have to play
> nasty tricks. Imagine the primary backend has done a BEGIN; CLUSTER
> pg_class; and then then starts a child backend. Which will get the same
> snapshot, combocids, yada yada. But it *also* will have preexisting
> cache entries. Those you need to invalidate before it can do anything
> correct.

Where will those preexisting cache entries come from, exactly?  The
postmaster is forking the parallel worker, not the user backend.

>> > I'm sorry to be a bit pessimistic here. But my intuition says that
>> > starting to do group locking opens a can of worms that'll take us a long
>> > time to close again. Maybe I'm just imagining complexity that won't
>> > actually be there. But I have a hard time believing that.
>>
>> What's the distinction between "teach the deadlock detector to catch
>> these kind of cases" and "group locking"?  Because I think those are
>> at least 90% the same thing.
>
> I understand under 'group locking' that a primary/second backends can
> coown a lock that normally is self-exclusive. That's a fair bit more
> than adding an edge to the deadlock graph between primary/secondary
> backends to make the deadlock detector recognize problems.

I guess.  It seems like a pretty minor extension to me.  Getting the
deadlock detector to do the right thing is the hard part.

> What I have serious doubts about is 'coowning' locks. Especially if two
> backends normally wouldn't be able to both get that lock.

Perhaps we should discuss that more.  To me it seems pretty obvious
that's both safe and important.

>> > I wonder if we couldn't implement a 'halfway' by allowing parallel
>> > workers to jump the lockqueue if the parent process already has the
>> > lock. That'd only work for nonconflicting locks, but I think that's
>> > actually good.
>>
>> The patch takes precisely that approach; that part of the logic is
>> already implemented.
>
> Well, my point is that if the solution is just to jump the queue, there
> really isn't any data structure changes needed. Secondary acquirers just
> need to check whether a lock is already owned by the primary and then
> acquire the lock in the absolutely normal way - with a completely
> separate entry. Only that they ignored the queue.

Well, they need to be able to identify who is in their group.  You
could possibly do that without any changes at all to the lock manager
data structures, but it seems a bit complex.  I took the approach of
storing the group leader's PGPROC * in each PROCLOCK, which makes it
trivial (and is enough information for the deadlock detector to work
correctly, too).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 31 October 2014 19:36, Robert Haas <robertmhaas@gmail.com> wrote:

>> It's an obscure case and its not the only solution either.
>
> I don't think that's an obscure situation at all.  Do you really think
> a patch that could cause an attempt to VACUUM FULL a system catalog to
> suffer an undetected deadlock meets this community's quality
> standards?  Because that's what we're talking about.

Nobody has said that allowing undetected deadlocks is acceptable, so
your other comments are void.

I've suggested *stricter* locking, which would obviously allow
deadlock detection. You recognised that by claiming that the locking I
had proposed was actually too strict, which is where the above example
came from.

Yes, I have proposed stricter locking, but as explained, the only
things this would slow down are catalog VAC FULLs, which are already a
problem.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 31 October 2014 18:47, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 31, 2014 at 11:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> You mentioned earlier that functions would need to be marked proisparallel etc..
>>
>> What conditions will that be protecting against? If we aren't going to
>> support the general case where every single thing works, can we at
>> least discuss what the list of cases is that we will support.
>
> In general, any operation that relies on backend-private state not
> shared with a cooperating backend isn't parallel-safe.  For example,

Yes, the principle is easy to understand, but that was not the question.

You are saying that placing restrictions on which functions can
execute is necessary, yet at the same time saying that we must have
generalised locking for workers because restrictions on locking are
not acceptable. I don't point that out because I want an argument or
to prove a point, but because there are important things on the table
here.

What are the specific restrictions you are suggesting we place on
parallel workers? We need that definition clear so we can decide how
to mark the functions. If we don't know, which is easily
understandable, then the best way to find that out is to build a
restricted solution and to expand slowly outwards to find problems.

An obvious milestone there is whether a function contains SQL, which
is the point chosen by some other databases. I personally hope to
expand upon that, but it would be useful to reach it first and then
continue from there.

I was already thinking of implementing CONTAINS NO SQL as a SQL
Standard function marking since it can be used to optimize COPY
further.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-10-31 21:25:02 -0400, Robert Haas wrote:
> On Fri, Oct 31, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I don't think that's correct.  We only need to process local
> >> invalidation messages after CommandCounterIncrement(), which I
> >> anticipate prohibiting during parallel execution (short thought should
> >> convince you that anything else is completely nuts).
> >
> > It is more complex, even without CCI. As long as you're doing anything
> > inside a transaction that already has done DDL, you'll have to play
> > nasty tricks. Imagine the primary backend has done a BEGIN; CLUSTER
> > pg_class; and then then starts a child backend. Which will get the same
> > snapshot, combocids, yada yada. But it *also* will have preexisting
> > cache entries. Those you need to invalidate before it can do anything
> > correct.
> 
> Where will those preexisting cache entries come from, exactly?  The
> postmaster is forking the parallel worker, not the user backend.

Several things:
1) The relcache init files load a fair bit
2) There's cache entries made just during startup while we look up user  information and such
3) There's lots of places you can hook into where it's perfectly legal  and expected that you access system caches. I
don'tthink you can  reasonably define those away.
 
4) I'm pretty sure that one of the early next steps will be to reuse a  bgworker for multiple tasks. So there'll
definitelybe preexisting  cache entries.
 

I'm pretty sure that it's impossible to define the problem away. We
*might* be able to say that we'll just do a InvalidateSystemCaches() at
start of the parallelized task.

> >> > I'm sorry to be a bit pessimistic here. But my intuition says that
> >> > starting to do group locking opens a can of worms that'll take us a long
> >> > time to close again. Maybe I'm just imagining complexity that won't
> >> > actually be there. But I have a hard time believing that.
> >>
> >> What's the distinction between "teach the deadlock detector to catch
> >> these kind of cases" and "group locking"?  Because I think those are
> >> at least 90% the same thing.
> >
> > I understand under 'group locking' that a primary/second backends can
> > coown a lock that normally is self-exclusive. That's a fair bit more
> > than adding an edge to the deadlock graph between primary/secondary
> > backends to make the deadlock detector recognize problems.
> 
> I guess.  It seems like a pretty minor extension to me.  Getting the
> deadlock detector to do the right thing is the hard part.

I doubt it. There's a whole bunch of problems that just exist by virtue
of having a group leader. What if the leader dies but the worker backend
isn't in a section of code that does a CHECK_FOR_INTERRUPTS()?

I also still have serious doubts about allowing self exclusive locks to be
shared between backends. That's going to cause pain. And that's pretty
much the only reason why you need to share entries.

> > What I have serious doubts about is 'coowning' locks. Especially if two
> > backends normally wouldn't be able to both get that lock.
> 
> Perhaps we should discuss that more.  To me it seems pretty obvious
> that's both safe and important.

I completely fail to see why you'd generally think it's safe for two
backends to hold the same self conflicting lock. Yes, within carefully
restricted sections of code that might be doable. But I don't think you
can fully enforce that. People *will* mark their own functions at
parallel safe. And do stupid stuff. That shouldn't cause
segfaults/exploits/low level data corruption/.. as long as it's done
from !C functions.

The aforementioned issue of how to deal with the leader dying while the
other backend still progresses also is nontrivial in my opinion.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Sat, Nov 1, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Where will those preexisting cache entries come from, exactly?  The
>> postmaster is forking the parallel worker, not the user backend.
>
> Several things:
> 1) The relcache init files load a fair bit
> 2) There's cache entries made just during startup while we look up user
>    information and such
> 3) There's lots of places you can hook into where it's perfectly legal
>    and expected that you access system caches. I don't think you can
>    reasonably define those away.
> 4) I'm pretty sure that one of the early next steps will be to reuse a
>    bgworker for multiple tasks. So there'll definitely be preexisting
>    cache entries.
>
> I'm pretty sure that it's impossible to define the problem away. We
> *might* be able to say that we'll just do a InvalidateSystemCaches() at
> start of the parallelized task.

I see.  I haven't thought deeply about those things, and there could
indeed be problems there.

>> > What I have serious doubts about is 'coowning' locks. Especially if two
>> > backends normally wouldn't be able to both get that lock.
>>
>> Perhaps we should discuss that more.  To me it seems pretty obvious
>> that's both safe and important.
>
> I completely fail to see why you'd generally think it's safe for two
> backends to hold the same self conflicting lock. Yes, within carefully
> restricted sections of code that might be doable. But I don't think you
> can fully enforce that. People *will* mark their own functions at
> parallel safe. And do stupid stuff. That shouldn't cause
> segfaults/exploits/low level data corruption/.. as long as it's done
> from !C functions.

I agree that people will mark their own functions as parallel-safe,
and I also agree that shouldn't cause segfaults, exploits, or
low-level data corruption.  Clearly, we'll need to impose a bunch of
restrictions to achieve that.  But if you put a LOCK TABLE statement
in an SQL function, it won't conflict with other locking request made
by the same or another function elsewhere in the same transaction;
surely you don't want that behavior to be *different* in parallel
mode.  That would be a blatant, user-visible behavior difference
justified by ... nothing at all.  There's no hazard there.  Where you
start getting into crash/exploit/data corruption territory is when you
are talking about DDL operations that change the physical structure of
the table.  That's why we have stuff like CheckTableNotInUse() to
verify that, for example, there are no old cursors around that are
still expecting the old relfilenode and tuple descriptor to be valid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Sat, Nov 1, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I doubt it. There's a whole bunch of problems that just exist by virtue
> of having a group leader. What if the leader dies but the worker backend
> isn't in a section of code that does a CHECK_FOR_INTERRUPTS()?

In between all of the heat about whether I'm writing the correct
patch, or whether I'm overall making progress fast enough, somebody
could, I don't know, read the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-11-01 17:00:59 -0400, Robert Haas wrote:
> On Sat, Nov 1, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I doubt it. There's a whole bunch of problems that just exist by virtue
> > of having a group leader. What if the leader dies but the worker backend
> > isn't in a section of code that does a CHECK_FOR_INTERRUPTS()?
> 
> In between all of the heat about whether I'm writing the correct
> patch, or whether I'm overall making progress fast enough, somebody
> could, I don't know, read the patch.

I plan to, but that takes a fair amount of time. And there's at least
one patch of yours ahead in the queue... And I do think that most of the
concerns are more general than a specific implementation.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Sat, Nov 1, 2014 at 1:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> What are the specific restrictions you are suggesting we place on
> parallel workers? We need that definition clear so we can decide how
> to mark the functions. If we don't know, which is easily
> understandable, then the best way to find that out is to build a
> restricted solution and to expand slowly outwards to find problems.
>
> An obvious milestone there is whether a function contains SQL, which
> is the point chosen by some other databases. I personally hope to
> expand upon that, but it would be useful to reach it first and then
> continue from there.
>
> I was already thinking of implementing CONTAINS NO SQL as a SQL
> Standard function marking since it can be used to optimize COPY
> further.

I've written some fairly extensive thoughts on this topic at
http://wiki.postgresql.org/wiki/Parallel_Sort

Personally, I tend to focus more on what needs to be allowed in
parallel workers than what needs to be prohibited.  The motivation for
this patch is simple:

1. Any non-trivial piece of PostgreSQL code is likely to contain
syscache lookups.
2. Syscache lookups had better work in parallel workers, or they'll be
all but useless.
3. Some of those syscache lookups will be satisfied from cache, but
sometimes we'll need to actually read the system catalogs.
4. That requires locks.
5. Gosh, we'd better not deadlock.

There's a lot of infrastructure that I think will work fine in
parallel mode without any changes.  Buffer locking, buffer pins,
buffer lookups, and lwlocks, for example, I expect to all just work
without any real.  But heavyweight locks and invalidation messages
seem like they'll need a bit of surgery to work correctly in this
environment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Sat, Nov 1, 2014 at 5:06 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-11-01 17:00:59 -0400, Robert Haas wrote:
>> On Sat, Nov 1, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > I doubt it. There's a whole bunch of problems that just exist by virtue
>> > of having a group leader. What if the leader dies but the worker backend
>> > isn't in a section of code that does a CHECK_FOR_INTERRUPTS()?
>>
>> In between all of the heat about whether I'm writing the correct
>> patch, or whether I'm overall making progress fast enough, somebody
>> could, I don't know, read the patch.
>
> I plan to, but that takes a fair amount of time. And there's at least
> one patch of yours ahead in the queue... And I do think that most of the
> concerns are more general than a specific implementation.

Fair enough.  Well, the way I solved that specific problem is to allow
the group leader to terminate but not to return its PGPROC to the free
list until the last group member exits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-11-01 16:59:34 -0400, Robert Haas wrote:
> > I completely fail to see why you'd generally think it's safe for two
> > backends to hold the same self conflicting lock. Yes, within carefully
> > restricted sections of code that might be doable. But I don't think you
> > can fully enforce that. People *will* mark their own functions at
> > parallel safe. And do stupid stuff. That shouldn't cause
> > segfaults/exploits/low level data corruption/.. as long as it's done
> > from !C functions.
> 
> I agree that people will mark their own functions as parallel-safe,
> and I also agree that shouldn't cause segfaults, exploits, or
> low-level data corruption.  Clearly, we'll need to impose a bunch of
> restrictions to achieve that.

> But if you put a LOCK TABLE statement
> in an SQL function, it won't conflict with other locking request made
> by the same or another function elsewhere in the same transaction;
> surely you don't want that behavior to be *different* in parallel
> mode.  That would be a blatant, user-visible behavior difference
> justified by ... nothing at all.

That's just a mislabeled function. It's obviously not parallel safe at
all. I see absolutely no problem with erroring out.

> There's no hazard there.  Where you
> start getting into crash/exploit/data corruption territory is when you
> are talking about DDL operations that change the physical structure of
> the table.  That's why we have stuff like CheckTableNotInUse() to
> verify that, for example, there are no old cursors around that are
> still expecting the old relfilenode and tuple descriptor to be valid.

It's not just fully structural changes although they are a concern.
It's also that we've amassed a number of hacks to deal with local state
that just won't be nicely transported. What's with stuff like
RelationSetIndexList() (which is infrequently enough used to not be a
big problem in practice...)? If we only allow parallel access while
independent backends could acquire the relevant we can rely on us
already having taken care about the concurrency hazards. Otherwise not.

And before you say that'd prohibit a number of useful things - yes, it
might. But e.g. CREATE INDEX can still be parallelized. The primary
backend takes a Exclusive/ShareUpdateExclusive (depending on
CONCURRENTLY), workers just a AccessShare.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Sat, Nov 1, 2014 at 5:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> That's just a mislabeled function. It's obviously not parallel safe at
> all. I see absolutely no problem with erroring out.

I disagree.  It's entirely parallel-safe, as long as you don't
arbitrarily decide to have the lock manager break it.

>> There's no hazard there.  Where you
>> start getting into crash/exploit/data corruption territory is when you
>> are talking about DDL operations that change the physical structure of
>> the table.  That's why we have stuff like CheckTableNotInUse() to
>> verify that, for example, there are no old cursors around that are
>> still expecting the old relfilenode and tuple descriptor to be valid.
>
> It's not just fully structural changes although they are a concern.
> It's also that we've amassed a number of hacks to deal with local state
> that just won't be nicely transported. What's with stuff like
> RelationSetIndexList() (which is infrequently enough used to not be a
> big problem in practice...)? If we only allow parallel access while
> independent backends could acquire the relevant we can rely on us
> already having taken care about the concurrency hazards. Otherwise not.

RelationSetIndexList() is only used inside REINDEX, which I think
illustrates my point that it's mostly DDL we need to be worried about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 15 October 2014 05:03, Robert Haas <robertmhaas@gmail.com> wrote:

> At least to me, that simple scenario is clear-cut[1], but what do we
> do in more complicated situations?  For example, suppose backends A
> and B are members of the same locking group.  A locks a relation with
> AccessShareLock, an unrelated process X queues up waiting for an
> AccessExclusiveLock, and then B also requests AccessShareLock.  The
> normal behavior here is that B should wait for X to go first, but here
> that's a problem.  If A is the user backend and B is a worker backend,
> A will eventually wait for B, which is waiting for X, which is waiting
> for A: deadlock.

Yes, deadlock.

My understanding would be that the lead process would wait on a latch,
not a heavyweight lock. So it would never perform a deadlock
detection. Which leaves only X and B to perform the deadlock check.

Are you aware that the deadlock detector will reorder the lock queue,
if that presents a possible solution to the deadlock?

Would the above example not be resolved simply with the existing code?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 2 November 2014 10:39, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 15 October 2014 05:03, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> At least to me, that simple scenario is clear-cut[1], but what do we
>> do in more complicated situations?  For example, suppose backends A
>> and B are members of the same locking group.  A locks a relation with
>> AccessShareLock, an unrelated process X queues up waiting for an
>> AccessExclusiveLock, and then B also requests AccessShareLock.  The
>> normal behavior here is that B should wait for X to go first, but here
>> that's a problem.  If A is the user backend and B is a worker backend,
>> A will eventually wait for B, which is waiting for X, which is waiting
>> for A: deadlock.
>
> Yes, deadlock.
>
> My understanding would be that the lead process would wait on a latch,
> not a heavyweight lock. So it would never perform a deadlock
> detection. Which leaves only X and B to perform the deadlock check.
>
> Are you aware that the deadlock detector will reorder the lock queue,
> if that presents a possible solution to the deadlock?
>
> Would the above example not be resolved simply with the existing code?

Self Answer: Only if the lock request would not be self-conflicting.

Reading the patch (yay!) you say

*   ...However, if a conflicting lock is present that is
* not held by another member of our lock group, then we can't do this.
* In that case we'll have to wait despite the deadlock risk and hope for
* the best.

I wasn't sure what that meant.

Actually, I like the rest of the patch, assuming I understand it. Not
sure that changes anything I've said about wanting to see something
actually working, even as a prototype. For example, how will you test
the code in this patch actually works? The best way is to make
parallel query work in a restricted form and then drop that in
isolation tester, otherwise you'll have to waste time on a test
harness.

The procgloballist stuff should be the subject of a separate patch
which I agree with.

The main question is still why any of this is necessary? If we are
only acquiring fast path locks in workers, what need is there for any
of this? The leader and workers never conflict with each other, so why
would they ever wait for each other (except at a point where they are
actively transferring data), which is the necessary pre-condition for
a deadlock?

Yes, running a command with a conflicting lock would disrupt the start
of a parallel operation, but as I said above, the deadlock detector
will eventually see that and re-arrange us so there is no deadlock.

The only need I can see is if one of the workers/leader requests a
Strong lock on an object, someone else requests a conflicting lock on
that object and then we perform a parallel operation ourselves. We
could easily solve that by saying that you can't go parallel if you
hold a strong lock AND that workers can't take anything higher than
RowShareLock, like HS backends. That's good enough for now.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Greg Stark
Date:
On Sat, Nov 1, 2014 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. Any non-trivial piece of PostgreSQL code is likely to contain
> syscache lookups.
> 2. Syscache lookups had better work in parallel workers, or they'll be
> all but useless.


I've been using parallel sorts and index builds in my mental model of
how this will be used. I note that sorts go out of their way to look
up all the syscache entries in advance precisely so that tuplesort
doesn't start doing catalog lookups in the middle of the sort. In
general I think what people are imagining is that the parallel workers
will be running low-level code like tuplesort that has all the
databasey stuff like catalog lookups done in advance and just operates
on C data structures like function pointers. And I think that's a
valuable coding discipline to enforce, it avoids having low level
infrastructure calling up to higher level abstractions which quickly
becomes hard to reason about.

However in practice I think you're actually right -- but not for the
reasons you've been saying. I think the parallel workers *should* be
written as low level infrastructure and not be directly doing syscache
lookups or tuple locking etc. However there are a million ways in
which Postgres is extensible which causes loops in the call graph that
aren't apparent in the direct code structure. For instance, what
happens if the index you're building is an expression index or partial
index? Worse, what happens if those expressions have a plpython
function that does queries using SPI....

But those are the kinds of user code exploiting extensibility are the
situations where we need a deadlock detector and where you might need
this infrastructure. We wouldn't and shouldn't need a deadlock
detector for our own core server code. In an ideal world some sort of
compromise that enforces careful locking rules where all locks are
acquired in advance and parallel workers are prohibited from obtaining
locks in the core code while still allowing users to a free-for-all
and detecting deadlocks at runtime for them would be ideal. But I'm
not sure there's any real middle ground here.

-- 
greg



Re: group locking: incomplete patch, just for discussion

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Sat, Nov 1, 2014 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 2. Syscache lookups had better work in parallel workers, or they'll be
>> all but useless.

> I've been using parallel sorts and index builds in my mental model of
> how this will be used. I note that sorts go out of their way to look
> up all the syscache entries in advance precisely so that tuplesort
> doesn't start doing catalog lookups in the middle of the sort.

Well, they try to avoid doing the lookups more than once, but that does
not by any means imply that no lookups happen after the sort starts.
In particular, if you're sorting a container type (array, record)
there will be lookups during the first call of the sort type's comparison
function.  Enums could cause catalog lookups much later than the first
call, too.

I'm with Robert: if you cannot do catalog lookups in a worker process,
the feature will be so crippled as to be essentially useless.
        regards, tom lane



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Mon, Nov 3, 2014 at 10:18 AM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Nov 1, 2014 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. Any non-trivial piece of PostgreSQL code is likely to contain
>> syscache lookups.
>> 2. Syscache lookups had better work in parallel workers, or they'll be
>> all but useless.
>
> I've been using parallel sorts and index builds in my mental model of
> how this will be used. I note that sorts go out of their way to look
> up all the syscache entries in advance precisely so that tuplesort
> doesn't start doing catalog lookups in the middle of the sort.

This isn't really true.  Routines like varstr_cmp() do syscache
lookups and de-TOASTing.  We've recently tried to reduce that with the
sortsupport stuff, but it hasn't eliminated it completely, and there
are other cases.  This is really the key point that I think everybody
needs to understand: a lot of code that you think doesn't do anything
complicated actually reads from the database - generally either in the
form of syscache lookups, or in the form of de-TOASTing.  Most of that
code doesn't perform those operations *frequently*, but that's because
there are backend-private caches that minimize the number of times we
actually hit the database.  But restructuring all of that code to do
no database access whatsoever would be a very difficult proposition
and would involve painful sacrifices, which is, I imagine, why it is
the way it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The main question is still why any of this is necessary? If we are
> only acquiring fast path locks in workers, what need is there for any
> of this? The leader and workers never conflict with each other, so why
> would they ever wait for each other (except at a point where they are
> actively transferring data), which is the necessary pre-condition for
> a deadlock?
>
> Yes, running a command with a conflicting lock would disrupt the start
> of a parallel operation, but as I said above, the deadlock detector
> will eventually see that and re-arrange us so there is no deadlock.
>
> The only need I can see is if one of the workers/leader requests a
> Strong lock on an object, someone else requests a conflicting lock on
> that object and then we perform a parallel operation ourselves. We
> could easily solve that by saying that you can't go parallel if you
> hold a strong lock AND that workers can't take anything higher than
> RowShareLock, like HS backends. That's good enough for now.

That would be enough to keep the parallel group from deadlocking with
itself, but it wouldn't be enough to prevent deadlock with other
processes, which is the scenario that actually worries me a lot more.
In particular, I'm worried about deadlocks where some other process
gets sandwiched between two processes from the same parallel group.
That is, let A1 be a parallel group leader and A2 be a process from
the same parallel group, and let B be an unrelated process.  I'm
worried about the situation where A2 waits for B which waits for A1
which (outside the view of the deadlock detector) waits for A2.  All
that can happen even if the parallel backends hold no strong locks,
because B can hold strong locks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Simon Riggs
Date:
On 3 November 2014 17:00, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The main question is still why any of this is necessary? If we are
>> only acquiring fast path locks in workers, what need is there for any
>> of this? The leader and workers never conflict with each other, so why
>> would they ever wait for each other (except at a point where they are
>> actively transferring data), which is the necessary pre-condition for
>> a deadlock?
>>
>> Yes, running a command with a conflicting lock would disrupt the start
>> of a parallel operation, but as I said above, the deadlock detector
>> will eventually see that and re-arrange us so there is no deadlock.
>>
>> The only need I can see is if one of the workers/leader requests a
>> Strong lock on an object, someone else requests a conflicting lock on
>> that object and then we perform a parallel operation ourselves. We
>> could easily solve that by saying that you can't go parallel if you
>> hold a strong lock AND that workers can't take anything higher than
>> RowShareLock, like HS backends. That's good enough for now.
>
> That would be enough to keep the parallel group from deadlocking with
> itself, but it wouldn't be enough to prevent deadlock with other
> processes, which is the scenario that actually worries me a lot more.
> In particular, I'm worried about deadlocks where some other process
> gets sandwiched between two processes from the same parallel group.
> That is, let A1 be a parallel group leader and A2 be a process from
> the same parallel group, and let B be an unrelated process.  I'm
> worried about the situation where A2 waits for B which waits for A1
> which (outside the view of the deadlock detector) waits for A2.  All
> that can happen even if the parallel backends hold no strong locks,
> because B can hold strong locks.

OK, I think I'm happy with this as a general point.

How will we automatically test this code?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Mon, Nov 3, 2014 at 1:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> OK, I think I'm happy with this as a general point.

Cool!

> How will we automatically test this code?

Good question.  I can see two possible approaches:

1. We write a test_group_locking harness along the lines of
test_shm_mq and test_decoding and put that in contrib.

2. We wait until higher-level facilities built on top of this are
available and get regression test coverage of this code via those
higher-level modules.

Personally, I can't imagine debugging this code without writing some
kind of test harness that only does locking; I don't want my locking
bugs to be mixed with my planner and executor bugs.  But I could go
either way on actually putting that code in contrib.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The procgloballist stuff should be the subject of a separate patch
> which I agree with.

Yes, I think that's probably a net improvement in robustness quite
apart from what we decide to do about any of the rest of this.  I've
attached it here as revise-procglobal-tracking.patch and will commit
that bit if nobody objects.  The remainder is reattached without
change as group-locking-v0.1.patch.

Per your other comment, I've developed the beginnings of a testing
framework which I attached here as test_group_locking-v0.patch.  That
doesn't look to have much hope of evolving into something we'd want
even in contrib, but I think it'll be rather useful for debugging.  It
works like this:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# select
test_group_locking('1.0:start,2.0:start,1.0:lock:AccessExclusiveLock:foo,2.0:lock:AccessExclusiveLock:foo');
NOTICE:  starting worker 1.0
NOTICE:  starting worker 2.0
NOTICE:  instructing worker 1.0 to acquire AccessExclusiveLock on
relation with OID 16387
NOTICE:  instructing worker 2.0 to acquire AccessExclusiveLock on
relation with OID 16387
ERROR:  could not obtain AccessExclusiveLock on relation with OID 16387
CONTEXT:  background worker, group 2, task 0

The syntax is a little arcane, I guess, but it's "documented" in the
comments within.  In this case I asked it to start up two background
workers and have them both try to take AccessExclusiveLock on table
foo.  As expected, the second one fails.  The idea is that workers are
identified by a pair of numbers X.Y; two workers with the same X-value
are in the same locking group.  So if I call the second worker 1.1
rather than 2.0, it'll join the same locking group as worker 1.0 and
... then it does the wrong thing, and then it crashes the server,
because my completely-untested code is unsurprisingly riddled with
bugs.

Eventually, this needs to be generalized a bit so that we can use it
to test deadlock detection.  That's tricky, because what you really
want to do is tell worker A to wait for some lock and then, once
you're sure it's on the wait queue, tell worker B to go take some
other lock and check that you see the resulting deadlock.  There
doesn't seem to be a good API for the user backend to find out whether
some background worker is waiting for some particular lock, so I may
have to resort to the hacky expedient of having the driver process
wait for a few seconds and assume that's long enough that the
background worker will be on the wait queue by then.  Or maybe I can
drum up some solution, but anyway it's not done yet.

The value of this test code is that we can easily reproduce locking
scenarios which would be hard to reproduce in a real workload - e.g.
because they're timing-dependent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Nov 5, 2014 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yes, I think that's probably a net improvement in robustness quite
> apart from what we decide to do about any of the rest of this.  I've
> attached it here as revise-procglobal-tracking.patch and will commit
> that bit if nobody objects.  The remainder is reattached without
> change as group-locking-v0.1.patch.
>
> Per your other comment, I've developed the beginnings of a testing
> framework which I attached here as test_group_locking-v0.patch.

Urk.  I attached a version with some stupid bugs.  Here's a slightly better one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: group locking: incomplete patch, just for discussion

From
Jim Nasby
Date:
On 11/5/14, 8:26 PM, Robert Haas wrote:
> rhaas=# create table foo (a int);
> CREATE TABLE
> rhaas=# select
test_group_locking('1.0:start,2.0:start,1.0:lock:AccessExclusiveLock:foo,2.0:lock:AccessExclusiveLock:foo');
> NOTICE:  starting worker 1.0
> NOTICE:  starting worker 2.0
> NOTICE:  instructing worker 1.0 to acquire AccessExclusiveLock on
> relation with OID 16387
> NOTICE:  instructing worker 2.0 to acquire AccessExclusiveLock on
> relation with OID 16387

Damn that's cool to see! :)

FWIW, some creative use of a composite type, plpgsql and string_to_array() would have saved a bunch of parsing code...
it'ssurprisingly easy to parse stuff like this using string_to_array().
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Maybe we can, as a first step, make those edges in the lock graph
> visible to the deadlock detector? It's pretty clear that undetected
> deadlocks aren't ok, but detectable deadlocks in a couple corner cases
> might be acceptable.

An interesting point came to mind this morning on this topic, while
discussing this issue with Amit.  Suppose process A1 holds a lock on
some object and process A2, a member of the same locking group, waits
for a lock on that same object.  It follows that there must exist a
process B which either *holds* or *awaits* a lock which conflicts with
the one requested by A2; otherwise, A2 would have been granted the
lock at once.  If B *holds* the conflicting lock, it may eventually
release it; but if B *awaits* the conflicting lock, we have a soft
deadlock, because A2 waits for B waits for A1, which we presume to
wait for A1.[1]

The logical consequence of this is that, if a process seeks to acquire
a lock which is already held (in any mode) by a fellow group-member,
it had better jump over the entire wait queue and acquire the lock
immediately if no conflicting locks have already been granted.  If it
fails to do this, it immediately creates a soft deadlock.  What if
there *are* conflicting locks already granted?  In that case, things
are more complicated.  We could, for example, have this situation: A1
holds AccessShareLock, B holds ExclusiveLock, C1 awaits ShareLock, and
then (following it in the wait queue) A2 awaits RowExclusiveLock.
This is not a deadlock, because B can finish, then C can get the lock,
then C1 can finish, then A2 can get the lock.  But if C2 now waits for
some member of group A, then we have a soft deadlock between group A
and group C, which can be resolved by moving A2's request ahead of C1.
(This example is relatively realistic, too: a lock upgrade from
AccessShareLock to RowExclusiveLock is probably the most common type
of lock upgrade, for reasons that are hopefully obvious.)

In practice, I suspect it's best to take a more aggressive approach
and have A2 jump straight to the head of the line right out of the
chute.  Letting some parallel workers make progress while holding up
others out of a notion of locking fairness is probably stupid, because
they're probably dependent on each other in some way that makes it
useless for one to make progress while the others don't.  For the same
reason, I'd argue that we ought to wait until all pending lock
requests from a given group can be satisfied before granting any of
them.  In fact, I suspect it's best in general for the locking
machinery and the deadlock detector to view several simultaneous,
pending lock requests as if they were a single request conflicting
with everything that would conflict with any of the requested modes.
Consider the situation where there are 10 lock waiters, 5 in each of
two parallel groups.  The deadlock detector could spend a good deal of
time and energy trying various reorderings of the lock queue, but in
practice it seems highly unlikely that it's sensible to do anything
other than put all the locks from one group first and all of the locks
from the other group afterwards, regardless of whether the processes
want the same mode or different modes.  Most of the other orderings
will either be equivalent to one of those orderings (because commuting
two non-conflicting locks in the wait queue has no effect) or soft
deadlocks (because the interleave a conflicting lock request between
two requests form the same group).  And even if you can find some
corner case where neither of those things is the case, it's probably
still not useful to let half the lock group run and leave the other
half blocked.  What will probably happen is that some internal queue
will pretty quickly either fill up or drain completely and the process
filling it, or draining it, will wait.  Now we're no better off than
if we'd waited to grant the lock in the first place, and we're now
holding more locks that can block other things or cause deadlocks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]  Here I'm assuming, as in previous discussion, that we regard all
members of a parallel group as mutually waiting for each other, on the
theory that the parallel computation won't complete until all workers
complete and, therefore, the members of the group are either already
waiting for each other or eventually will.  While this might not be
true in every case, I believe it's a good (and generally valid)
simplifying assumption.



Re: group locking: incomplete patch, just for discussion

From
Amit Kapila
Date:
On Fri, Nov 7, 2014 at 3:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Maybe we can, as a first step, make those edges in the lock graph
> > visible to the deadlock detector? It's pretty clear that undetected
> > deadlocks aren't ok, but detectable deadlocks in a couple corner cases
> > might be acceptable.
>
> An interesting point came to mind this morning on this topic, while
> discussing this issue with Amit.  Suppose process A1 holds a lock on
> some object and process A2, a member of the same locking group, waits
> for a lock on that same object.  It follows that there must exist a
> process B which either *holds* or *awaits* a lock which conflicts with
> the one requested by A2; otherwise, A2 would have been granted the
> lock at once.  If B *holds* the conflicting lock, it may eventually
> release it; but if B *awaits* the conflicting lock, we have a soft
> deadlock, because A2 waits for B waits for A1, which we presume to
> wait for A1.[1]
>
> The logical consequence of this is that, if a process seeks to acquire
> a lock which is already held (in any mode) by a fellow group-member,
> it had better jump over the entire wait queue and acquire the lock
> immediately if no conflicting locks have already been granted.  If it
> fails to do this, it immediately creates a soft deadlock.  What if
> there *are* conflicting locks already granted?  In that case, things
> are more complicated.  We could, for example, have this situation: A1
> holds AccessShareLock, B holds ExclusiveLock, C1 awaits ShareLock, and
> then (following it in the wait queue) A2 awaits RowExclusiveLock.
> This is not a deadlock, because B can finish, then C can get the lock,
> then C1 can finish, then A2 can get the lock.  But if C2 now waits for
> some member of group A, then we have a soft deadlock between group A
> and group C, which can be resolved by moving A2's request ahead of C1.
> (This example is relatively realistic, too: a lock upgrade from
> AccessShareLock to RowExclusiveLock is probably the most common type
> of lock upgrade, for reasons that are hopefully obvious.)
>
> In practice, I suspect it's best to take a more aggressive approach
> and have A2 jump straight to the head of the line right out of the
> chute.  Letting some parallel workers make progress while holding up
> others out of a notion of locking fairness is probably stupid, because
> they're probably dependent on each other in some way that makes it
> useless for one to make progress while the others don't.  For the same
> reason, I'd argue that we ought to wait until all pending lock
> requests from a given group can be satisfied before granting any of
> them.  

I think this logic can sometimes block processes from acquiring a lock
which no one is holding.  Assume Group-1 (G-1) is waiting on two locks
(Lock L1 on Table T1 and Lock L2 on Table T2) which are held by
unrelated processes P-2 and P-3 respectively; now P-3 releases the
lock on table T-2 and cannot grant it to G-1 who is waiting on this lock
because still L-1 lock cannot be granted to G-1.  At this point the situation
is that no one holds lock on T-2 and G-1 waits for it, however any new
process won't be able to acquire lock on T-2 as there is already a waiter
for the same in wait queue.  OTOH if we would have granted lock for
T-2 to G-1, it might have finished it's work and released the lock
(considering it is a short time lock on catalog relation).
If it is possible to have such a situation, then even though it is beneficial
for certain cases to view several pending requests from a same group
as single request, at the same time it can be harmful for some unrelated
processes as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Mon, Nov 10, 2014 at 6:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think this logic can sometimes block processes from acquiring a lock
> which no one is holding.  Assume Group-1 (G-1) is waiting on two locks
> (Lock L1 on Table T1 and Lock L2 on Table T2) which are held by
> unrelated processes P-2 and P-3 respectively; now P-3 releases the
> lock on table T-2 and cannot grant it to G-1 who is waiting on this lock
> because still L-1 lock cannot be granted to G-1.

That's not what I meant.  I meant we shouldn't grant a lock *on a
particular object* to any process on the group until all lock requests
*for that object* can be granted.  Waiting until every awaited lock in
the system can be granted in all requested modes would be really hard
to implement and probably have all kinds of problems (as you point out
here).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Jeff Davis
Date:
On Wed, 2014-10-15 at 00:03 -0400, Robert Haas wrote:
> For parallelism, I think we need a concept of group locking.  That is,
> suppose we have a user backend and N worker backends collaborating to
> execute some query.  For the sake of argument, let's say it's a
> parallel CLUSTER, requiring a full table lock.  We need all of the
> backends to be able to lock the table even though at least one of them
> holds AccessExclusiveLock.  This suggests that the backends should all
> be members of a locking group, and that locks within the same locking
> group should be regarded as mutually non-conflicting.

Trying to catch up on this thread, please excuse me if these questions
are already covered.

You mention the possibility of undetected deadlocks, which is surely
unacceptable. But why not improve the deadlock detector? How bad are
_detected_ deadlocks? A lot of the concern was around catalog accesses,
but those lock requests would rarely wait anyway.

I also wonder if group locking is generally the wrong approach to
parallelism. Parallel scan/sort should work by assigning workers to
chunks of data, and then merging the results. In principle the workers
don't need to touch the same data at all, so why are we trying so hard
to get them to all take the same locks?

The reason, I assume, is that a full table is the lockable object, but
we are imagining the segment files as the chunks. But maybe there's a
way to address this more directly with an extra field in the lock tag,
and perhaps some catalog knowledge?

Regards,Jeff Davis





Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Nov 12, 2014 at 2:57 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> Trying to catch up on this thread, please excuse me if these questions
> are already covered.

Welcome to the party.  The more, the merrier.  :-)

> You mention the possibility of undetected deadlocks, which is surely
> unacceptable. But why not improve the deadlock detector? How bad are
> _detected_ deadlocks? A lot of the concern was around catalog accesses,
> but those lock requests would rarely wait anyway.

Detected deadlocks are fine.  Improving the deadlock detector is the
heart of what needs to be done here.  As you say, the lock requests
we're talking about will rarely wait, so deadlocks won't be frequent.
The issue is making sure that, if they do happen, we get a better
behavior than "your parallel query hangs forever; good luck figuring
out why".

> I also wonder if group locking is generally the wrong approach to
> parallelism. Parallel scan/sort should work by assigning workers to
> chunks of data, and then merging the results. In principle the workers
> don't need to touch the same data at all, so why are we trying so hard
> to get them to all take the same locks?
>
> The reason, I assume, is that a full table is the lockable object, but
> we are imagining the segment files as the chunks. But maybe there's a
> way to address this more directly with an extra field in the lock tag,
> and perhaps some catalog knowledge?

In the common case, we're only taking AccessShareLock on some relation
to prevent it from being concurrently dropped or rewritten.  Locking
only part of the relation wouldn't lead to any improvement in
concurrency, because the full-relation lock isn't conflicting with
anything that the partial-relation lock wouldn't also need to conflict
with.

More generally, I think there's some misunderstanding about the
overall goal of the parallelism infrastructure that I'm trying to
create.  Many people have proposed interesting strategies for how we
could do various things (locking, planning, execution) better.  Some
of those ideas are, doubtless, good ones.  But my goal is in some ways
the opposite: I'm trying to make it possible to run as much existing
PostgreSQL backend code as possible inside a parallel worker without
any modification.  To do that, I'm trying to modify the subsystems
that are widely used throughout the backend - such as locking - in
such a way that the people using that infrastructure need not put
conditional logic into their code to make it do one thing when in
parallel mode and another thing when not in parallel mode.  And I'm
also trying to avoid fundamentally changing the way major subsystems
work today as a precondition of parallelism.

So I'm taking a very surgical approach to the patches I propose.  I'm
not proposing patches that do anything fundamentally new or different;
instead, I'm proposing patches that let the stuff we already do
continue to work.  In a single backend executing a query, we have lots
of code that assumes you can allocate memory, look data up in a
syscache, throw an error, lock a buffer, examine the value of a GUC,
test a tuple against the active snapshot, and so on.
For parallel mode to be useful, those same things need to work in a
parallel worker.  We of course do not need every crazy thing somebody
may want to do in a parallel worker, nor will it.  But anything that's
done in thousands of places throughout the code had better work, or
parallelism will be so restricted as to be useless.  So commit
2bd9e412f92bc6a68f3e8bcb18e04955cc35001d, for example, is about
letting "ereport" work in a parallel worker, and this patch is about
making things like "SearchSysCache1" and "PG_GETARG_TEXT_PP" work
*reliably* in a parallel worker.

Now, the time to do new and different things is coming, and is maybe
even now not so very far away.  I'm not an expert on parallel
execution, and I'm sure there are a number of people far more skilled
than I at figuring out how to do a parallel scan, join, sort, or
whatever quickly.  What I'm trying to do is get the infrastructure to
a point where those people don't have to worry about whether the basic
facilities that we rely on throughout the backend are gonna work.

By way of comparison, think about the periodic discussions about
making the backend multi-threaded.  Since the backend relies on global
variables in an enormous number of places, it isn't thread-safe.  If
you spawn a thread inside a PostgreSQL backend, it can't safely
ereport(), palloc(), LWLockAcquire(), or just about anything else,
which means that, although anybody can write code that calls
pthread_create(), it's not useful to do so because there's practically
no existing backend code that you could safely call from the new
thread.  Using background worker processes eliminates a lot of those
problems - palloc and lwlocks just work, for example - but not all of
them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Jeff Davis
Date:
On Wed, 2014-11-12 at 14:16 -0500, Robert Haas wrote:
> Detected deadlocks are fine.  Improving the deadlock detector is the
> heart of what needs to be done here.

OK, great.

> As you say, the lock requests
> we're talking about will rarely wait, so deadlocks won't be frequent.
> The issue is making sure that, if they do happen, we get a better
> behavior than "your parallel query hangs forever; good luck figuring
> out why".

Right. We can still use this patch's notion of a lock group in the
deadlock detector, but we don't need it to actually affect the way a
lock is granted. That should eliminate concerns about subtle bugs.

Later, after we understand how this is actually used, and if we see
deadlock problems, we can look for ways to solve/mitigate them.

This seems to be what Andres was saying, here:

http://www.postgresql.org/message-id/20141031130727.GF13584@awork2.anarazel.de

So I'll follow up in that thread, because it's an interesting
discussion.

> More generally, I think there's some misunderstanding about the
> overall goal of the parallelism infrastructure that I'm trying to
> create.  ...  But my goal is in some ways
> the opposite: I'm trying to make it possible to run as much existing
> PostgreSQL backend code as possible inside a parallel worker without
> any modification.

Thank you for clarifying, I think this is a good approach.

Back to the patch:

If I understand correctly, the _primary_ goal of this patch is to make
it safe to take out heavyweight locks in worker processes, even if the
deadlock involves LWLocks/latches synchronizing among the processes
within a lock group.

For example, say processes A1 and A2 are in the same lock group, and B
is in a different lock group. A1 is holding heavyweight lock H1 and
waiting on a LW lock L1; A2 is holding L1 and waiting on heavyweight
lock H2; and B is holding H2 and waiting on H1.

The current deadlock detector would see a dependency graph like:
  A2 -> B -> A1

But with lock groups, it would see:
  (A1 A2) -> B -> (A1 A2)

which is a cycle, and can be detected regardless of the synchronization
method used between A1 and A2. There are some details to work out to
avoid false positives, of course.

Is that about right?

Regards,Jeff Davis





Re: group locking: incomplete patch, just for discussion

From
Jeff Davis
Date:
Great discussion.

Robert, I think you make a lot of very valid points, but philosophically
I am in pretty strong agreement with Andres here.

On Fri, 2014-10-31 at 14:29 -0400, Robert Haas wrote:
> > So just refusing parallelism as soon as anything has taken an access
> > exclusive lock doesn't sound too bad to me.
> 
> That restriction seems onerous to me; for example, it would prevent a
> parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL.
> Those seems like worthy targets for parallelism, and maybe even early
> targets, since I expect a lot of the real complexity here to be in the
> query planner, and you can avoid most of that complexity when planning
> DDL.

If two backends both have an exclusive lock on the relation for a join
operation, that implies that they need to do their own synchronization,
because obviously the lock manager is not doing it for them.

In turn, that implies that we need parallel versions of these
operations; going to each one and parallelizing it. I don't think that's
a good long-term strategy... it changes our simple executor (tuples in,
tuples out) to something quite different (each node needs to know a lot
about whatever is underneath it in order to divide the data up).

> process and both requests would have been granted.  Parallelism is
> generally not about making different things happen; it's about
> spreading the stuff that would happen anyway across multiple backends,

This point would be much stronger for declarative queries. When the user
issues LOCK TABLE, they are getting their hands dirty with the
implementation to some degree. If they really want to serialize, they
can serialize around the lock of an object that they don't want to also
read (in parallel).

You're right that this would be surprising to users, but hopefully we
can find a way to disable parallelism in that case, or at least issue a
WARNING that the user is probably doing the wrong thing.

> "cluster pg_collate"... the only solution is to
> move the parallel workers ahead of the AccessExclusiveLock request,
> but the deadlock detector won't do that unless it knows about the
> relationship among the parallel workers.

As discussed later in this thread, I also consider failure a
solution ;-) So we don't need to move anything.

If we have to tell our users that exclusive-locking a catalog can cause
deadlocks for parallel query, I think they will find that reasonable.
The same admin could turn parallelism down to one for non-superusers
before they start doing that, and restore it later.


> patch that does parallel anything has got to worry about what
> undetected deadlocks might result, and then argue about which of them
> are obscure enough that we shouldn't care and which are not.  That
> doesn't sound like a fun way to spend my time.

We're in agreement that deadlocks can't go undetected.

> But, really, the first case is the one I'm more worried about: a bunch
> of parallel backends all grab the same locks, but slightly separated
> in time.  A strong locker joins the queue midway through and we're
> hosed.

I agree the situation sounds bad in the abstract, but none of the
examples seem very compelling. Who locks random catalogs exclusively?
And if they do, why is it the lock manager's job to try to reconcile the
resulting mess without canceling someone? (OK, don't answer that. It
obviously is the lock manager's job to some degree, but we can only
demand so much.)

> Even if you think that all of these particular issues are somehow
> avoidable or that they don't matter, I think it would be unwise to bet
> on them being the only issues.  The lock manager looks like it's old,
> ill-maintained code, but that's in no small part because it solved the
> problem it aims at so thoroughly that few people have felt the need to
> modify it very much in the last decade.  I think that's part of why
> there's so much skepticism on this thread - we don't think about the
> problems that the lock manager solves as being important problems not
> because they truly aren't important, but because they are so
> thoroughly solved.  Then every once in a while we complain about the
> performance.

Guilty as charged.

I won't forever oppose changes to the lock manager, but I resist making
them a precondition of parallelism (aside from making deadlock detection
work, of course). Users taking out exclusive locks recklessly can result
in deadlocks today, and will result in more when we do parallelism
(though perhaps in more surprising situations).

I remember when doing Exclusion Constraints, I went to a lot of effort
to make deadlocks impossible, and was quite proud. When Tom saw it, he
told me not to bother, and to do it the simple way instead, because
deadlocks can happen even with UNIQUE constraints (which I didn't even
know).

We should use the same strategy here. When we see deadlocks becoming a
problem for any reasonable workload, we make a series of tweaks (perhaps
some to the lock manager itself) to reduce them.

Regards,Jeff Davis





Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 13, 2014 at 2:26 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> The current deadlock detector would see a dependency graph like:
>
>    A2 -> B -> A1
>
> But with lock groups, it would see:
>
>    (A1 A2) -> B -> (A1 A2)
>
> which is a cycle, and can be detected regardless of the synchronization
> method used between A1 and A2. There are some details to work out to
> avoid false positives, of course.
>
> Is that about right?

Yep.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 13, 2014 at 3:38 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> If two backends both have an exclusive lock on the relation for a join
> operation, that implies that they need to do their own synchronization,
> because obviously the lock manager is not doing it for them.

This doesn't make sense to me.  Why would they need to synchronize
access to a relation in order to join it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Nov 13, 2014 at 3:38 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> If two backends both have an exclusive lock on the relation for a join
>> operation, that implies that they need to do their own synchronization,
>> because obviously the lock manager is not doing it for them.

> This doesn't make sense to me.  Why would they need to synchronize
> access to a relation in order to join it?

What's more to the point: why would you take an exclusive lock just to
do a join?
        regards, tom lane



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On November 13, 2014 8:50:18 PM CET, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Nov 13, 2014 at 3:38 AM, Jeff Davis <pgsql@j-davis.com>
>wrote:
>>> If two backends both have an exclusive lock on the relation for a
>join
>>> operation, that implies that they need to do their own
>synchronization,
>>> because obviously the lock manager is not doing it for them.
>
>> This doesn't make sense to me.  Why would they need to synchronize
>> access to a relation in order to join it?
>
>What's more to the point: why would you take an exclusive lock just to
>do a join?

Robert's case basically rest on the premise that it's useful & correct when normally conflicting locks don't conflict
whennormal processes and its workers acquire them. I have serious doubts about the safety of that and the complexity it
requires.

Obviously a join won't need an exclusive lock itself - but with Robert's design a join inside a transaction that locked
arelation could still be parallelized, even if that rel is involved.
 

Andred

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: group locking: incomplete patch, just for discussion

From
Jeff Davis
Date:
<div dir="ltr">On Thu, Nov 13, 2014 at 11:26 AM, Robert Haas <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br />><br />> On Thu, Nov 13, 2014 at
3:38AM, Jeff Davis <<a href="mailto:pgsql@j-davis.com">pgsql@j-davis.com</a>> wrote:<br />> > If two
backendsboth have an exclusive lock on the relation for a join<br />> > operation, that implies that they need to
dotheir own synchronization,<br />> > because obviously the lock manager is not doing it for them.<br />><br
/>>This doesn't make sense to me.  Why would they need to synchronize<br />> access to a relation in order to
joinit?<br /><br /><br />Unfortunate typo: that was supposed to be "joint" operation, just meaning that they are
workingtogether for something (e.g. CLUSTER, VACUUM FULL as you suggest). Sorry for the confusion.<br /><br />I meant
thatthe backends need to divide up the work somehow. And if each operator needs to divide up the work before operating,
thatmeans we need to change every operator.<br /><br />Regards,<br />     Jeff Davis</div> 

Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 13, 2014 at 4:57 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, Nov 13, 2014 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Nov 13, 2014 at 3:38 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> > If two backends both have an exclusive lock on the relation for a join
>> > operation, that implies that they need to do their own synchronization,
>> > because obviously the lock manager is not doing it for them.
>>
>> This doesn't make sense to me.  Why would they need to synchronize
>> access to a relation in order to join it?
>
> Unfortunate typo: that was supposed to be "joint" operation, just meaning
> that they are working together for something (e.g. CLUSTER, VACUUM FULL as
> you suggest). Sorry for the confusion.

Ah, well, that's different.

> I meant that the backends need to divide up the work somehow. And if each
> operator needs to divide up the work before operating, that means we need to
> change every operator.

I think I see your point, which it just so happens Amit articulated to
me in different words while we were chatting about this problem this
morning.  We want to avoid waiving the mutual exclusion provided by
the lock manager only to end up re-implementing something very much
like the current lock manager to arbitrate resource contention among
backends within a parallel group.  However, we also don't want the
mutual exclusion that the lock manager provides to serve as a barrier
to implementing useful parallelism; that is, if the parallel backends
want to manage conflicts themselves instead of letting the lock
manager do it, that should be possible.

In http://www.postgresql.org/message-id/CA+TgmoYGpLoJo+LG1beBOs8gdjwjTQ0qdmxsYJ4ihFyJ11Tr-g@mail.gmail.com
I propose a new approach.  The basic idea is that any heavyweight lock
that we hold at the start of a parallel phase can't represent a bona
fide conflict between the activities of two different backends, so we
arrange to waive conflicts over those locks.  The problem (as Andres
points out) is that those locks could be subsequently re-taken by
operations that are not parallel-safe, and the new locks would be
granted locally regardless of the shared lock table because the
current backend would already hold them.  However, I think it's fine
to make it the job of the parallelism infrastructure to plug that
hole, rather than trying to enforce it in the lock manager.  There are
LOTS of things that need to be prohibited in parallel mode and only
allowed once they've been made sufficiently safe, and in the first
version, that's certainly going to include - among other things - any
operation that writes data.  I think it's OK to the job of the people
who want to relax those prohibitions to deal with making it safe to do
so.

To reiterate the basic problem here, if we do nothing at all about the
lock manager, a parallel backend can stall trying to grab an
AccessExclusiveLock that the user backend alread holds, either because
the user backend holds an AccessExclusiveLock as well, or because some
other process is waiting for one, we'll deadlock and not notice.  We
have to do *something* about that.  I'd like to arrive at some
consensus on how to move forward here as soon as possible, because the
clock is ticking here.
http://www.postgresql.org/message-id/CA+TgmoYGpLoJo+LG1beBOs8gdjwjTQ0qdmxsYJ4ihFyJ11Tr-g@mail.gmail.com
articulates what I currently believe to be the best plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Jeff Davis
Date:
On Mon, 2014-11-17 at 14:32 -0500, Robert Haas wrote:
> I think I see your point, which it just so happens Amit articulated to
> me in different words while we were chatting about this problem this
> morning.  We want to avoid waiving the mutual exclusion provided by
> the lock manager only to end up re-implementing something very much
> like the current lock manager to arbitrate resource contention among
> backends within a parallel group.

Even worse, it seems like you lose composability of executor nodes.
Right now, you can rearrange executor nodes in the planner quite easily,
because they all take tuples as input and give tuples as output (with
few exceptions).

What is the input to a parallelized executor node? What is the output?
How do you hook one parallized node to another in the planner? Does it
stay parallel all the way through, or do you have to serialize between
nodes?

You can answer that for Scan, because it's a leaf node; but how will we
get to HashJoin, etc.?

>   However, we also don't want the
> mutual exclusion that the lock manager provides to serve as a barrier
> to implementing useful parallelism; that is, if the parallel backends
> want to manage conflicts themselves instead of letting the lock
> manager do it, that should be possible.

Agreed. Backends are quite generic, and they should be usable for all
kinds of things.

> To reiterate the basic problem here, if we do nothing at all about the
> lock manager, a parallel backend can stall trying to grab an
> AccessExclusiveLock that the user backend alread holds, either because
> the user backend holds an AccessExclusiveLock as well, or because some
> other process is waiting for one, we'll deadlock and not notice.

My feeling is that we should keep the concept of a group and group
leader from your patch, and improve the deadlock detector to make use of
that information (looking at the code, it looks doable but not trivial).
But unless I am missing something, we should separate out the lock
sharing, and defer that until later.

I am actually starting to see that something like lock sharing could be
useful. I think your example of VACUUM (in the other thread) is a good
one. But I don't see anything forcing us to decide now if we can instead
detect the deadlocks and abort. We will have a lot more information
later, and I think we'll make a better decision about the exact form it
takes.

In other words: lock groups is important, but I don't see the rush for
lock sharing specifically.

Regards,Jeff Davis





Re: group locking: incomplete patch, just for discussion

From
Amit Kapila
Date:
On Tue, Nov 18, 2014 at 3:10 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2014-11-17 at 14:32 -0500, Robert Haas wrote:
>
> > To reiterate the basic problem here, if we do nothing at all about the
> > lock manager, a parallel backend can stall trying to grab an
> > AccessExclusiveLock that the user backend alread holds, either because
> > the user backend holds an AccessExclusiveLock as well, or because some
> > other process is waiting for one, we'll deadlock and not notice.
>
> My feeling is that we should keep the concept of a group and group
> leader from your patch, and improve the deadlock detector to make use of
> that information (looking at the code, it looks doable but not trivial).
> But unless I am missing something, we should separate out the lock
> sharing, and defer that until later.
>

+1 to initially have something like you describe and may be an additional
mechanism to grant the non-conflicting locks which are already held by
master backend to child backends.  I understand that allowing additional
non-conflicting locks could lead to some problem if write operations are
allowed via child backends as is described as point 1 in Robert's another
mail [1].  However I think as initially we want to allow read only operations
via child backends, this should be okay and later on if we want to
support write via child backends, we might want to consider having an
additional property with lock which will allow lock manager or deadlock detector
to consider them separately, so that those locks won't be granted to
child backends if there is conflict of same with parent.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Tue, Nov 18, 2014 at 4:40 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> To reiterate the basic problem here, if we do nothing at all about the
>> lock manager, a parallel backend can stall trying to grab an
>> AccessExclusiveLock that the user backend alread holds, either because
>> the user backend holds an AccessExclusiveLock as well, or because some
>> other process is waiting for one, we'll deadlock and not notice.
>
> My feeling is that we should keep the concept of a group and group
> leader from your patch, and improve the deadlock detector to make use of
> that information (looking at the code, it looks doable but not trivial).
> But unless I am missing something, we should separate out the lock
> sharing, and defer that until later.

Doing as you propose solves this problem:

1. Backend A-1 acquires AccessShareLock on relation R.
2. Backend B waits for AccessExclusiveLock on relation R.
3. Backend A-2 waits for AccessShareLock on relation R.

Knowing that A-1 and A-2 are related, the deadlock detector can move
A-2 ahead of B and grant the lock.  All is well.

But your proposal does not solve this problem:

1. Backend A-1 acquires AccessExclusiveLock on relation R.
2. Backend A-2 waits for AccessShareLock on relation R.

The good news is that the deadlock detector should realize that since
A-1 and A-2 are in the same group, this is a deadlock.  And it can
abort either A-1 or A-2, which will eventually abort them both.  The
bad news, to borrow a phrase from Peter Geoghegan, is that it's an
unprincipled deadlock; a user confronted with the news that her
parallel scan has self-deadlocked will be justifiably dismayed.  It's
been proposed by Andres, and maybe a few other people, that we can
detect this situation and just not use parallelism in these cases, but
that's actually quite hard to do.

Of course, it's pretty easy for a backend A-1 contemplating parallel
scan of relation R to notice that it holds a lock that conflicts with
the AccessShareLock it expects a prospective parallel backend A-2 to
attempt to acquire.  I don't think there's a lock manager method for
that today, but we could easily add one.  However, A-1 might
incidentally hold other locks (from earlier statements in the
transaction) that conflict with locks that will be sought by A-2, and
as discussed *extensively* upthread, it is not easy to predict all of
the locks a parellel backend might try to take: even seemingly-trivial
code like the equality operators for built-in data types can sometimes
take relation locks, and we have no infrastructure for predicting
which ones.

Even if you could somehow solve that problem, there's also the issue
of plan caching.  If we make a parallel plan for an SQL statement
inside a PL/pgsql procedure because our backend holds no strong locks,
and then we acquire a strong lock on a relevant relation, we have to
invalidate that plan.  I think that will add complexity inside the
plan cache machinery.  It seems to me that it would be better to
handle these issues inside the lock manager rather than letting them
creep into other parts of the system.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Amit Kapila
Date:
On Wed, Nov 19, 2014 at 9:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Nov 18, 2014 at 4:40 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> >> To reiterate the basic problem here, if we do nothing at all about the
> >> lock manager, a parallel backend can stall trying to grab an
> >> AccessExclusiveLock that the user backend alread holds, either because
> >> the user backend holds an AccessExclusiveLock as well, or because some
> >> other process is waiting for one, we'll deadlock and not notice.
> >
> > My feeling is that we should keep the concept of a group and group
> > leader from your patch, and improve the deadlock detector to make use of
> > that information (looking at the code, it looks doable but not trivial).
> > But unless I am missing something, we should separate out the lock
> > sharing, and defer that until later.
>
> Doing as you propose solves this problem:
>
> 1. Backend A-1 acquires AccessShareLock on relation R.
> 2. Backend B waits for AccessExclusiveLock on relation R.
> 3. Backend A-2 waits for AccessShareLock on relation R.
>
> Knowing that A-1 and A-2 are related, the deadlock detector can move
> A-2 ahead of B and grant the lock.  All is well.
>
> But your proposal does not solve this problem:
>
> 1. Backend A-1 acquires AccessExclusiveLock on relation R.
> 2. Backend A-2 waits for AccessShareLock on relation R.
>

I think that could be doable under above proposal by making lock
conflicts (LockCheckConflicts) code aware of parallel group.  But
it might lead to a problem incase it allows to acquire a lock to parallel
worker where it should not like in cases of Relation extension as
mentioned by you on another thread.  However I think we should block
any such operations and allow read only operations in parallel workers.
It seems to me  that we have to do lot more things to build a capability
to allow parallel workers to perform write operations.

>
> Even if you could somehow solve that problem, there's also the issue
> of plan caching.  If we make a parallel plan for an SQL statement
> inside a PL/pgsql procedure because our backend holds no strong locks,
> and then we acquire a strong lock on a relevant relation, we have to
> invalidate that plan.  I think that will add complexity inside the
> plan cache machinery.  It seems to me that it would be better to
> handle these issues inside the lock manager rather than letting them
> creep into other parts of the system.
>

How would changes in lock manager guarantee that it won't be problem
in anycase? 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: group locking: incomplete patch, just for discussion

From
Jeff Davis
Date:
On Wed, 2014-11-19 at 11:03 -0500, Robert Haas wrote:
> But your proposal does not solve this problem:
> 
> 1. Backend A-1 acquires AccessExclusiveLock on relation R.
> 2. Backend A-2 waits for AccessShareLock on relation R.
> 
> The good news is that the deadlock detector should realize that since
> A-1 and A-2 are in the same group, this is a deadlock.  And it can
> abort either A-1 or A-2, which will eventually abort them both.

Right. It can even give a nice error hint to tell the user how to avoid
the problem.

>   The
> bad news, to borrow a phrase from Peter Geoghegan, is that it's an
> unprincipled deadlock; a user confronted with the news that her
> parallel scan has self-deadlocked will be justifiably dismayed.

You seem to be raising this as a show-stopping problem, and I'm not
convinced that it is.

(a) There are no parallel operators yet, so this is speculative.
(b) Out of the parallel operators you imagine (e.g. Scan), we don't
expect anything other than AccessShare locks in the "normal" case.
(c) The failure mode is not so bad; it's just an error and the user can
probably work around it.

You could argue that we know we're headed for this problem, and
therefore we should solve it now. I disagree. You are assuming that
sharing exclusive heavyweight locks among a group will be a fundamental
part of everything postgres does with parallelism; but not every design
requires it.

Regards,Jeff Davis





Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 20, 2014 at 4:21 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>>   The
>> bad news, to borrow a phrase from Peter Geoghegan, is that it's an
>> unprincipled deadlock; a user confronted with the news that her
>> parallel scan has self-deadlocked will be justifiably dismayed.
>
> You seem to be raising this as a show-stopping problem, and I'm not
> convinced that it is.

Well, what I'm saying is that at very minimum we have to be able
detect deadlocks, and we have two plausible designs for avoiding that:

1. Modify the deadlock detector to know about lock groups.

2. Propagate pre-existing locks from the user backend to all the workers.

I initially proposed #1, but now I think #2 solves more of the
problems for less code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Jeff Davis
Date:
On Thu, 2014-11-20 at 11:22 -0500, Robert Haas wrote:
> 2. Propagate pre-existing locks from the user backend to all the workers.
> 
> I initially proposed #1, but now I think #2 solves more of the
> problems for less code.

OK. The primary concern with that is unintended consequences. But it's
reasonable for you to ask for something more concrete. I will think on
this more.

A few things I'm thinking about now:
* What do you mean by "pre-existing"? Locks existing prior to what
event? (I don't think that's exactly what you meant.)* What's the conceptual difference between granting locks that
would
otherwise conflict with another process in the group (which is what this
proposal is about) and having exactly the same set of locks? Is there
any?* Let's say you have processes A1 and A2 in one group, and B. A1 and B
both have an AccessShare lock, and A2 tries to acquire an exclusive
lock. B is waiting on A2. That's still a deadlock, right?

Regards,Jeff Davis






Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-11-20 11:22:37 -0500, Robert Haas wrote:
> On Thu, Nov 20, 2014 at 4:21 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> >>   The
> >> bad news, to borrow a phrase from Peter Geoghegan, is that it's an
> >> unprincipled deadlock; a user confronted with the news that her
> >> parallel scan has self-deadlocked will be justifiably dismayed.
> >
> > You seem to be raising this as a show-stopping problem, and I'm not
> > convinced that it is.
> 
> Well, what I'm saying is that at very minimum we have to be able
> detect deadlocks, and we have two plausible designs for avoiding that:
> 
> 1. Modify the deadlock detector to know about lock groups.
> 
> 2. Propagate pre-existing locks from the user backend to all the workers.
> 
> I initially proposed #1, but now I think #2 solves more of the
> problems for less code.

Except that it opens us up for all kinds of concurrency bugs. I'm pretty
strictly set against granting any self exclusive locks en-masse. If we
do this by default for all granted locks when starting a worker backend
it'll get *so* much harder to reason about correctness. Suddenly locks
don't guarantee what they used to anymore. We'll e.g. not be able to
rely that a CheckTableNotInUse() + AEL makes it safe to
drop/rewrite/whatever a relation - even if that happens in the main
backend.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 20, 2014 at 12:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2014-11-20 at 11:22 -0500, Robert Haas wrote:
>> 2. Propagate pre-existing locks from the user backend to all the workers.
>>
>> I initially proposed #1, but now I think #2 solves more of the
>> problems for less code.
>
> OK. The primary concern with that is unintended consequences. But it's
> reasonable for you to ask for something more concrete. I will think on
> this more.
>
> A few things I'm thinking about now:
>
>  * What do you mean by "pre-existing"? Locks existing prior to what
> event? (I don't think that's exactly what you meant.)
>  * What's the conceptual difference between granting locks that would
> otherwise conflict with another process in the group (which is what this
> proposal is about) and having exactly the same set of locks? Is there
> any?
>  * Let's say you have processes A1 and A2 in one group, and B. A1 and B
> both have an AccessShare lock, and A2 tries to acquire an exclusive
> lock. B is waiting on A2. That's still a deadlock, right?

I think I discussed all of these issues on the other thread already.
Am I wrong?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 20, 2014 at 12:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Except that it opens us up for all kinds of concurrency bugs. I'm pretty
> strictly set against granting any self exclusive locks en-masse. If we
> do this by default for all granted locks when starting a worker backend
> it'll get *so* much harder to reason about correctness. Suddenly locks
> don't guarantee what they used to anymore. We'll e.g. not be able to
> rely that a CheckTableNotInUse() + AEL makes it safe to
> drop/rewrite/whatever a relation - even if that happens in the main
> backend.

Haven't I responded to this a few times already?

I see no way, even theoretically, that it can be sane for
CheckTableNotInUse() to succeed in a parallel context.  Ever.  If the
deadlock detector would kill the processes anyway, it doesn't matter,
because CheckTableNotInUse() should do it first, so that we get a
better error and without waiting for deadlock_timeout.  So any
scenario that's predicated on the assumption that CheckTableNotInUse()
will succeed in a parallel context is 100% unconvincing to me as an
argument for anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-11-20 13:57:25 -0500, Robert Haas wrote:
> On Thu, Nov 20, 2014 at 12:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Except that it opens us up for all kinds of concurrency bugs. I'm pretty
> > strictly set against granting any self exclusive locks en-masse. If we
> > do this by default for all granted locks when starting a worker backend
> > it'll get *so* much harder to reason about correctness. Suddenly locks
> > don't guarantee what they used to anymore. We'll e.g. not be able to
> > rely that a CheckTableNotInUse() + AEL makes it safe to
> > drop/rewrite/whatever a relation - even if that happens in the main
> > backend.
> 
> Haven't I responded to this a few times already?

Not in a particularly convincing way at least.

> I see no way, even theoretically, that it can be sane for
> CheckTableNotInUse() to succeed in a parallel context.  Ever.

It's not exactly inconceivable that you'd want a parallel
CLUSTER/REINDEX/VACUUM or something similar. That'll require cooperating
backends, true, but it's still a pretty sane case.

You haven't really made your case why you want to allow access to AEL
locked relations in a parallel context in the first place.

> If the
> deadlock detector would kill the processes anyway, it doesn't matter,
> because CheckTableNotInUse() should do it first, so that we get a
> better error and without waiting for deadlock_timeout.  So any
> scenario that's predicated on the assumption that CheckTableNotInUse()
> will succeed in a parallel context is 100% unconvincing to me as an
> argument for anything.

Meh, there's plenty cases where CheckTableNotInUse() isn't used where we
still rely on locks actually working. And it's not just postgres code,
this *will* break user code.

Your argument essentially is that we don't actually need to lock objects
if the other side only reads. Yes, you add a condition or two ontop
(e.g. CheckTableNotInUse() fails), but that's not that much. If we want
to do this we really need to forbid *any* modification to catalog/user
tables while parallel operations are ongoing. In the primary and worker
backends.  I think that's going to end up being much more
problematic/restrictive than not allowing non-cooperative paralellism
after >= ShareUpdateExclusive. If we ever want to parallelize writing
operations we'll regret this quite a bit.

Breaking the locking system - and that's what you're doing by removing
the usual guarantees - seems just fundamentally wrong.  Yes, I can't
name all the dangers that I think are out there.

The 'lets grant all the current locks at start of parallel query'
approach seems quite a bit safer than always doing that during parallel
query, don't get me wrong.

Btw, what are your thoughts about SERIALIZABLE and parallel query?
Afaics that'd be pretty hard to support?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 20, 2014 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Haven't I responded to this a few times already?
> Not in a particularly convincing way at least.

That's not a very helpful response.

>> I see no way, even theoretically, that it can be sane for
>> CheckTableNotInUse() to succeed in a parallel context.  Ever.
>
> It's not exactly inconceivable that you'd want a parallel
> CLUSTER/REINDEX/VACUUM or something similar. That'll require cooperating
> backends, true, but it's still a pretty sane case.
>
> You haven't really made your case why you want to allow access to AEL
> locked relations in a parallel context in the first place.

Well, what the heck was I doing here?

http://www.postgresql.org/message-id/CA+Tgmoa_+u-qCcFZ9zAVZFk+TGXbqQB1+0ktPZRGgPJ5PvrJVw@mail.gmail.com

I've *repeatedly* point out that if you *don't* allow this, stuff
breaks.  And you haven't answered how you'd like to fix that.  The
only answer you've given so far, that I can see, is that maybe we can
foresee all the locks that the child might take and not initiate
parallelism if any of them conflict with locks we already hold.  But
that's not a reasonable approach; we have no reasonable way of
predicting what locks the child will need, and even if we did, things
can change between plan time and execution time.

>> If the
>> deadlock detector would kill the processes anyway, it doesn't matter,
>> because CheckTableNotInUse() should do it first, so that we get a
>> better error and without waiting for deadlock_timeout.  So any
>> scenario that's predicated on the assumption that CheckTableNotInUse()
>> will succeed in a parallel context is 100% unconvincing to me as an
>> argument for anything.
>
> Meh, there's plenty cases where CheckTableNotInUse() isn't used where we
> still rely on locks actually working. And it's not just postgres code,
> this *will* break user code.

It would help if you'd provide some specific examples of the problems
you foresee.  In the absence of specificity, it's hard to tell come to
any conclusions about whether your concerns are well-founded, or
propose any way to overcome them.

> Your argument essentially is that we don't actually need to lock objects
> if the other side only reads. Yes, you add a condition or two ontop
> (e.g. CheckTableNotInUse() fails), but that's not that much.

I don't think that's my argument at all.  What I'm arguing is that
there's more than one reason for taking locks.  Broadly, I think we
take some locks to preserve transaction isolation guarantees and
others to preserve data integrity.  If you're adding or dropping a
column, nobody else had better be looking at that relation, because
they might get confused and crash the server or something.  Backends,
even cooperating parallel backends, won't be able to cope with the
tuple descriptor changing under them.  But once that operation is
complete, there's no reason why the *next* statement in that
transaction can't see the relation you just modified, even though it's
still now exclusively locked.  As long as we make sure that new
parallel backends use our snapshot, they'll see the right metadata for
that relation and can safely access it.  We still can't let *other*
people see that relation because the ALTER TABLE is an uncommitted
change, but that's no argument against letting our own parallel
workers look at it.

> If we want
> to do this we really need to forbid *any* modification to catalog/user
> tables while parallel operations are ongoing. In the primary and worker
> backends.  I think that's going to end up being much more
> problematic/restrictive than not allowing non-cooperative paralellism
> after >= ShareUpdateExclusive. If we ever want to parallelize writing
> operations we'll regret this quite a bit.

I can't really follow how this follows from anything I said.

> Breaking the locking system - and that's what you're doing by removing
> the usual guarantees - seems just fundamentally wrong.  Yes, I can't
> name all the dangers that I think are out there.
>
> The 'lets grant all the current locks at start of parallel query'
> approach seems quite a bit safer than always doing that during parallel
> query, don't get me wrong.

"Lets grant all the current locks at start of parallel query" is the
proposal I'm focused on right now.  Based on your arguments and those
of others, I have long-since abandoned the original proposal of having
locks within a group never conflict.  It seemed like a good idea at
the time, but it would put way too much burden on the parallel workers
to coordinate their activity with each other, so it's dead.  What I
want to establish at this time is what types of problems, if any, you
or others can foresee with "Lets grant all the current locks at start
of parallel query" - because I sense considerable skepticism, but
cannot put my finger on anything concretely wrong with it.

> Btw, what are your thoughts about SERIALIZABLE and parallel query?
> Afaics that'd be pretty hard to support?

Hmm, I haven't thought about that.  Since we never *wait* for SIREAD
locks, and there's no deadlock detection to worry about, we might be
able to finagle things so that the parallel backends just take the
same SIREAD locks the parent would have taken, using the parent's
PGPROC.  But that could easily turn out to be unworkable,
insufficient, or otherwise brain-damaged.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-11-20 15:47:39 -0500, Robert Haas wrote:
> On Thu, Nov 20, 2014 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Haven't I responded to this a few times already?
> > Not in a particularly convincing way at least.
> 
> That's not a very helpful response.

Well...

> >> I see no way, even theoretically, that it can be sane for
> >> CheckTableNotInUse() to succeed in a parallel context.  Ever.
> >
> > It's not exactly inconceivable that you'd want a parallel
> > CLUSTER/REINDEX/VACUUM or something similar. That'll require cooperating
> > backends, true, but it's still a pretty sane case.
> >
> > You haven't really made your case why you want to allow access to AEL
> > locked relations in a parallel context in the first place.
> 
> Well, what the heck was I doing here?
> 
> http://www.postgresql.org/message-id/CA+Tgmoa_+u-qCcFZ9zAVZFk+TGXbqQB1+0ktPZRGgPJ5PvrJVw@mail.gmail.com

I don't see anything about actual use cases there.

> I've *repeatedly* point out that if you *don't* allow this, stuff
> breaks.

You've pointed out some strange corner cases that might break yes.

> And you haven't answered how you'd like to fix that.  The
> only answer you've given so far, that I can see, is that maybe we can
> foresee all the locks that the child might take and not initiate
> parallelism if any of them conflict with locks we already hold. But
> that's not a reasonable approach; we have no reasonable way of
> predicting what locks the child will need, and even if we did, things
> can change between plan time and execution time.

We don't need to predict all future locks. We only need to check which
self-exclusive locks we are already holding.

I don't buy the plan time/execution time argument. We'll need to be able
to adapt plans to the availability of bgworker slots *anyway*. Otherwise
we either need to to set the number of bgworkers to "MaxConnections *
MaxParallelism" or live with errors as soon as too many processes use
parallelism. The former is clearly unrealistic given PG's per-backend
overhead and the latter will be a *much* bigger PITA than all the
deadlock scenarios talked about here. It'll constantly fail, even if
everything is ok.

> >> If the
> >> deadlock detector would kill the processes anyway, it doesn't matter,
> >> because CheckTableNotInUse() should do it first, so that we get a
> >> better error and without waiting for deadlock_timeout.  So any
> >> scenario that's predicated on the assumption that CheckTableNotInUse()
> >> will succeed in a parallel context is 100% unconvincing to me as an
> >> argument for anything.
> >
> > Meh, there's plenty cases where CheckTableNotInUse() isn't used where we
> > still rely on locks actually working. And it's not just postgres code,
> > this *will* break user code.
> 
> It would help if you'd provide some specific examples of the problems
> you foresee.  In the absence of specificity, it's hard to tell come to
> any conclusions about whether your concerns are well-founded, or
> propose any way to overcome them.

How about a frontend process having created a relation that then starts
a parallel query. Then the frontend process ERRORs out and, in the
course of that, does smgrDoPendingDeletes(). Which will delete the new
relation. Boom. The background process might just be accessing it. If
you think thats harmless, think e.g. what'd happen with heap cleanup
records generated in the background process. They'd not replay.

Another one is that the frontend process does, from some function or so,
CREATE POLICY. Triggering a relcache flush. And RelationClearRelation()
essentially assumes that a referenced relcache entry can't change (which
is the reason the keep_* stuff is in RelationClearRelation()).

I'm pretty sure there's at least a dozen other such hazards around.

> > Your argument essentially is that we don't actually need to lock objects
> > if the other side only reads. Yes, you add a condition or two ontop
> > (e.g. CheckTableNotInUse() fails), but that's not that much.
> 
> I don't think that's my argument at all.  What I'm arguing is that
> there's more than one reason for taking locks.  Broadly, I think we
> take some locks to preserve transaction isolation guarantees and
> others to preserve data integrity.  If you're adding or dropping a
> column, nobody else had better be looking at that relation, because
> they might get confused and crash the server or something.  Backends,
> even cooperating parallel backends, won't be able to cope with the
> tuple descriptor changing under them.  But once that operation is
> complete, there's no reason why the *next* statement in that
> transaction can't see the relation you just modified, even though it's
> still now exclusively locked.  As long as we make sure that new
> parallel backends use our snapshot, they'll see the right metadata for
> that relation and can safely access it.  We still can't let *other*
> people see that relation because the ALTER TABLE is an uncommitted
> change, but that's no argument against letting our own parallel
> workers look at it.

I think the transaction abort example from above refutes that theory.

> > Btw, what are your thoughts about SERIALIZABLE and parallel query?
> > Afaics that'd be pretty hard to support?
> 
> Hmm, I haven't thought about that.  Since we never *wait* for SIREAD
> locks, and there's no deadlock detection to worry about, we might be
> able to finagle things so that the parallel backends just take the
> same SIREAD locks the parent would have taken, using the parent's
> PGPROC.  But that could easily turn out to be unworkable,
> insufficient, or otherwise brain-damaged.

I think there's some local state as well, so that's probably not so
easy. I guess this needs input from Kevin.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 20, 2014 at 4:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> And you haven't answered how you'd like to fix that.  The
>> only answer you've given so far, that I can see, is that maybe we can
>> foresee all the locks that the child might take and not initiate
>> parallelism if any of them conflict with locks we already hold. But
>> that's not a reasonable approach; we have no reasonable way of
>> predicting what locks the child will need, and even if we did, things
>> can change between plan time and execution time.
>
> We don't need to predict all future locks. We only need to check which
> self-exclusive locks we are already holding.

I can't follow that logic.  Why do you think self-exclusive locks are
the only ones that matter?

> I don't buy the plan time/execution time argument. We'll need to be able
> to adapt plans to the availability of bgworker slots *anyway*. Otherwise
> we either need to to set the number of bgworkers to "MaxConnections *
> MaxParallelism" or live with errors as soon as too many processes use
> parallelism. The former is clearly unrealistic given PG's per-backend
> overhead and the latter will be a *much* bigger PITA than all the
> deadlock scenarios talked about here. It'll constantly fail, even if
> everything is ok.

I certainly think that any parallel node needs to be able to cope with
getting fewer workers than it would ideally like to have.  But that's
different from saying that when our own backend takes new locks, we
have to invalidate plans.  Those seem like largely unrelated problems.

>> >> If the
>> >> deadlock detector would kill the processes anyway, it doesn't matter,
>> >> because CheckTableNotInUse() should do it first, so that we get a
>> >> better error and without waiting for deadlock_timeout.  So any
>> >> scenario that's predicated on the assumption that CheckTableNotInUse()
>> >> will succeed in a parallel context is 100% unconvincing to me as an
>> >> argument for anything.
>> >
>> > Meh, there's plenty cases where CheckTableNotInUse() isn't used where we
>> > still rely on locks actually working. And it's not just postgres code,
>> > this *will* break user code.
>>
>> It would help if you'd provide some specific examples of the problems
>> you foresee.  In the absence of specificity, it's hard to tell come to
>> any conclusions about whether your concerns are well-founded, or
>> propose any way to overcome them.
>
> How about a frontend process having created a relation that then starts
> a parallel query. Then the frontend process ERRORs out and, in the
> course of that, does smgrDoPendingDeletes(). Which will delete the new
> relation. Boom. The background process might just be accessing it. If
> you think thats harmless, think e.g. what'd happen with heap cleanup
> records generated in the background process. They'd not replay.

OK, that's an excellent example.  Thanks.  I'll think about that.

> Another one is that the frontend process does, from some function or so,
> CREATE POLICY. Triggering a relcache flush. And RelationClearRelation()
> essentially assumes that a referenced relcache entry can't change (which
> is the reason the keep_* stuff is in RelationClearRelation()).

I don't quite follow that one.  Are you saying that the frontend would
do a CREATE POLICY while there are backend workers running?  I
seriously doubt that can ever be made safe, but I'd choose to prohibit
it using something other than the heavyweight lock manager.

> I think there's some local state as well, so that's probably not so
> easy. I guess this needs input from Kevin.

Yeah.  I'd be OK to have v1 disable parallelism at serializable, and
fix that in v2, but of course if we can avoid that it's even better.
The heavyweight locking issue is really killing me, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-11-20 20:21:07 -0500, Robert Haas wrote:
> On Thu, Nov 20, 2014 at 4:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> And you haven't answered how you'd like to fix that.  The
> >> only answer you've given so far, that I can see, is that maybe we can
> >> foresee all the locks that the child might take and not initiate
> >> parallelism if any of them conflict with locks we already hold. But
> >> that's not a reasonable approach; we have no reasonable way of
> >> predicting what locks the child will need, and even if we did, things
> >> can change between plan time and execution time.
> >
> > We don't need to predict all future locks. We only need to check which
> > self-exclusive locks we are already holding.
> 
> I can't follow that logic.  Why do you think self-exclusive locks are
> the only ones that matter?

All the others can be acquired by jumping in front of the lock's
waitqueue?

> > I don't buy the plan time/execution time argument. We'll need to be able
> > to adapt plans to the availability of bgworker slots *anyway*. Otherwise
> > we either need to to set the number of bgworkers to "MaxConnections *
> > MaxParallelism" or live with errors as soon as too many processes use
> > parallelism. The former is clearly unrealistic given PG's per-backend
> > overhead and the latter will be a *much* bigger PITA than all the
> > deadlock scenarios talked about here. It'll constantly fail, even if
> > everything is ok.
> 
> I certainly think that any parallel node needs to be able to cope with
> getting fewer workers than it would ideally like to have.  But that's
> different from saying that when our own backend takes new locks, we
> have to invalidate plans.  Those seem like largely unrelated problems.

No need to invalidate them if you have the infrastructure to run with no
parallel workers - just use the path that's used if there's no workers
available.

> > Another one is that the frontend process does, from some function or so,
> > CREATE POLICY. Triggering a relcache flush. And RelationClearRelation()
> > essentially assumes that a referenced relcache entry can't change (which
> > is the reason the keep_* stuff is in RelationClearRelation()).
> 
> I don't quite follow that one.  Are you saying that the frontend would
> do a CREATE POLICY while there are backend workers running?

Yes.

>  I
> seriously doubt that can ever be made safe, but I'd choose to prohibit
> it using something other than the heavyweight lock manager.

Right. It's perfectly fine to forbid it imo. But there's lots of places
that have assumptions about the locking behaviour baked in, and the
relevant bugs will be hard to find.

> > I think there's some local state as well, so that's probably not so
> > easy. I guess this needs input from Kevin.
> 
> Yeah.  I'd be OK to have v1 disable parallelism at serializable, and
> fix that in v2, but of course if we can avoid that it's even better.

I don't have a problem with that either.

> The heavyweight locking issue is really killing me, though.

I don't really understand why you're not content with just detecting
deadlocks for now. Everything else seems like bells and whistles for
later.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 20, 2014 at 8:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I can't follow that logic.  Why do you think self-exclusive locks are
>> the only ones that matter?
>
> All the others can be acquired by jumping in front of the lock's
> waitqueue?

That's is true, as a practical matter, in many cases.  But from the
point of view of the lock manager, it's really not true.  It is almost
true if you assume that the lock level acquired by the parallel
backend will be weaker than the lock level held by the user backed,
but not quite, because RowExclusiveLock conflicts with ShareLock.

The assumption that the parallel backend's lock level will always be
weaker seems shaky.  It's probably true for relation locks, but if we
ever want to support parallel workers that can write any data
whatsoever, they're going to need to be able to take relation
extension locks.  If one backend in a parallel group attempts to
acquire the relation extension lock while another worker holds it, the
rule that all group members should be regarded as waiting for each
other will result in instantaneous deadlock.  That's pretty
undesirable, and suggests to me that the whole model is defective in
some way.

>> > I don't buy the plan time/execution time argument. We'll need to be able
>> > to adapt plans to the availability of bgworker slots *anyway*. Otherwise
>> > we either need to to set the number of bgworkers to "MaxConnections *
>> > MaxParallelism" or live with errors as soon as too many processes use
>> > parallelism. The former is clearly unrealistic given PG's per-backend
>> > overhead and the latter will be a *much* bigger PITA than all the
>> > deadlock scenarios talked about here. It'll constantly fail, even if
>> > everything is ok.
>>
>> I certainly think that any parallel node needs to be able to cope with
>> getting fewer workers than it would ideally like to have.  But that's
>> different from saying that when our own backend takes new locks, we
>> have to invalidate plans.  Those seem like largely unrelated problems.
>
> No need to invalidate them if you have the infrastructure to run with no
> parallel workers - just use the path that's used if there's no workers
> available.

Well, it's OK for a query to run with less paralellism than the
planner thought optimal.  It's not OK for it to attempt to run with
the requested degree of parallelism and deadlock.  I think if the
deadlock detector gets involved it's quite unlikely that you're going
to be able to make things degrade smoothly to non-parallel operation.

>>  I
>> seriously doubt that can ever be made safe, but I'd choose to prohibit
>> it using something other than the heavyweight lock manager.
>
> Right. It's perfectly fine to forbid it imo. But there's lots of places
> that have assumptions about the locking behaviour baked in, and the
> relevant bugs will be hard to find.

That's pretty vague.  I think we can rule out the great bulk of the
problem cases by prohibiting backends from starting a DDL command
while parallel mode is in effect.  (We may be parallelizing a DDL
command or just a regular query, but once we've initiated parallelism,
no NEW DDL commands can be started.)

We should focus on trying to foresee what can go wrong apart from
that.  That's why I think your example about smgrDoPendingDeletes() is
a pretty helpful one to think about, but this one seems like something
that has no chance of slipping through the cracks.

>> The heavyweight locking issue is really killing me, though.
>
> I don't really understand why you're not content with just detecting
> deadlocks for now. Everything else seems like bells and whistles for
> later.

I don't think it's OK for a parallel query to just randomly deadlock.
It's OK for parallel query to excuse itself politely and retire from
the field, but I don't think it's OK for it to say, oh, sure, I can
parallelize that for you, and then fail by deadlocking with itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Andres Freund
Date:
On 2014-11-21 08:31:01 -0500, Robert Haas wrote:
> On Thu, Nov 20, 2014 at 8:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I can't follow that logic.  Why do you think self-exclusive locks are
> >> the only ones that matter?
> >
> > All the others can be acquired by jumping in front of the lock's
> > waitqueue?
> 
> That's is true, as a practical matter, in many cases.  But from the
> point of view of the lock manager, it's really not true.  It is almost
> true if you assume that the lock level acquired by the parallel
> backend will be weaker than the lock level held by the user backed,
> but not quite, because RowExclusiveLock conflicts with ShareLock.

The assumption that a parallel backend will accquire a locklevel <= the
worker's lock imo is a pretty sane one - we obviously can't predict
things if not. And I don't think any of the other presented approaches
can do that.

I'm not really bothered by RowExclusiveLock vs. ShareLock. Unless I miss
something that can only be problematic if the primary acquires a
ShareLock and the worker tries a RowExclusive. That just can't be sane.

> The assumption that the parallel backend's lock level will always be
> weaker seems shaky.  It's probably true for relation locks, but if we
> ever want to support parallel workers that can write any data
> whatsoever, they're going to need to be able to take relation
> extension locks.

I don't think extension locks are a good argument here. As you noted
upthread, they really need to be used consistenly across processes. And
they shouldn't/wouldn't be granted by your suggested mechanism either,
right?

> If one backend in a parallel group attempts to
> acquire the relation extension lock while another worker holds it, the
> rule that all group members should be regarded as waiting for each
> other will result in instantaneous deadlock.  That's pretty
> undesirable, and suggests to me that the whole model is defective in
> some way.

Shouldn't the deadlock checker actually check that all involved
processes are actually waiting for a lock? It seems a bit pointless to
deadlock when one node is actually progressing. That seems like it'd be
an absolute PITA for anything involving system tables and such.

> >> > I don't buy the plan time/execution time argument. We'll need to be able
> >> > to adapt plans to the availability of bgworker slots *anyway*. Otherwise
> >> > we either need to to set the number of bgworkers to "MaxConnections *
> >> > MaxParallelism" or live with errors as soon as too many processes use
> >> > parallelism. The former is clearly unrealistic given PG's per-backend
> >> > overhead and the latter will be a *much* bigger PITA than all the
> >> > deadlock scenarios talked about here. It'll constantly fail, even if
> >> > everything is ok.
> >>
> >> I certainly think that any parallel node needs to be able to cope with
> >> getting fewer workers than it would ideally like to have.  But that's
> >> different from saying that when our own backend takes new locks, we
> >> have to invalidate plans.  Those seem like largely unrelated problems.
> >
> > No need to invalidate them if you have the infrastructure to run with no
> > parallel workers - just use the path that's used if there's no workers
> > available.
> 
> Well, it's OK for a query to run with less paralellism than the
> planner thought optimal.  It's not OK for it to attempt to run with
> the requested degree of parallelism and deadlock.  I think if the
> deadlock detector gets involved it's quite unlikely that you're going
> to be able to make things degrade smoothly to non-parallel operation.

That was in response to your argument about plan invalidation - which I
don't buy because it'd only reuse the graceful degradiation capabilities
we need anyway.

> >> The heavyweight locking issue is really killing me, though.
> >
> > I don't really understand why you're not content with just detecting
> > deadlocks for now. Everything else seems like bells and whistles for
> > later.
> 
> I don't think it's OK for a parallel query to just randomly deadlock.

I think that should be the end goal, yes.

> It's OK for parallel query to excuse itself politely and retire from
> the field, but I don't think it's OK for it to say, oh, sure, I can
> parallelize that for you, and then fail by deadlocking with itself.

But I also think that right now it doesn't necessarily need to be the
focus immediately. This is a topic that I think will be much easier to
approach if some demonstration of actual parallel users would be
available. Doesn't have to be committable or such, but I think it'll
help to shape how this has to look.  I think unrecognized deadlocks will
make things too annoying to use, but I don't think it needs to be
guaranteed deadlock free or such.

Greetings,

Andres Freund



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Thu, Nov 20, 2014 at 4:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> How about a frontend process having created a relation that then starts
> a parallel query. Then the frontend process ERRORs out and, in the
> course of that, does smgrDoPendingDeletes(). Which will delete the new
> relation. Boom. The background process might just be accessing it. If
> you think thats harmless, think e.g. what'd happen with heap cleanup
> records generated in the background process. They'd not replay.

I spent some time thinking about this case.  I don't think this is
really principally a locking issue; I think it it's really a timing
issue around transaction abort.  I believe that, if we wanted to be
able to write any tuples at all from within a parallel worker, there
are certain phases of transaction abort processing that would need to
happen only once we're sure that all of the backends involved have
done their local abort processing.  smgrDoPendingDeletes() is one, but
I think we have pretty much the same problem with
RecordTransactionAbort() and ProcArrayEndTransaction().  Without some
synchronization, it would be possible for a parallel backend to stamp
a tuple with an XID after the abort record is already written, or
after the transaction has already been removed from the ProcArray.
Are we prepared to view that sort of thing as harmless?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: group locking: incomplete patch, just for discussion

From
Peter Geoghegan
Date:
On Sun, Nov 23, 2014 at 3:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> The heavyweight locking issue is really killing me, though.
>> >
>> > I don't really understand why you're not content with just detecting
>> > deadlocks for now. Everything else seems like bells and whistles for
>> > later.
>>
>> I don't think it's OK for a parallel query to just randomly deadlock.
>
> I think that should be the end goal, yes.
>
>> It's OK for parallel query to excuse itself politely and retire from
>> the field, but I don't think it's OK for it to say, oh, sure, I can
>> parallelize that for you, and then fail by deadlocking with itself.
>
> But I also think that right now it doesn't necessarily need to be the
> focus immediately.

I am in strong agreement with Robert that unprincipled deadlocks are
not okay. Now, what that actually means in practice might be kind of
hard to delineate.

Take my ON CONFLICT UPDATE patch. The locking there is optimistic, and
so in theory there are risks of lock starvation [1]. There is no
axiomatic reason why a single backend cannot be starved of a lock in
the event of a perfect storm of conflicts. Similarly, the Lehman and
Yao paper makes what can only be called a plausibility argument around
how livelocks are not a concern in practice [2] when an indefinite
number of "move right" operations are required after a page split;
having written a bunch of theorems around the correctness of other
aspects of their algorithm, the argument here is more or less: "Come
on, that'll never happen!". Concurrency is hard.

If you want another example of this kind of thing, then there's how
Oracle deals with READ COMMITTED row conflicts -- the statement is
rolled back and restarted. Who is to say with certainty that it'll
make any progress the second time? Or the third or forth?

Note that the optimistic locking stuff was actually my solution to the
"unprincipled deadlocks" problem, a solution that, as I say, also
implies the theoretical risk of lock starvation (but not livelock). I
think I would also have found it acceptable if there was a solution
that reduced the original risk, the risk of *deadlocks* to merely a
theoretical one, as opposed to a very real one (that could be shown
with a simple testcase).

> This is a topic that I think will be much easier to
> approach if some demonstration of actual parallel users would be
> available. Doesn't have to be committable or such, but I think it'll
> help to shape how this has to look.  I think unrecognized deadlocks will
> make things too annoying to use, but I don't think it needs to be
> guaranteed deadlock free or such.

I couldn't agree more. The greater point about livelock, which I
believe applies here as well, is that there are different standards by
which we might consider that unprincipled deadlocks won't happen.
Basically, it seems inevitable that a practical standard sometimes
must be applied. If a well informed person, say myself or Andres,
cannot make any given practical implementation deadlock (or
alternatively be starved for locks, if we're talking about an
"optimistic locking" work-around to unprincipled deadlocks that
introduces that risk), then I guess that means that there are no
unprincipled deadlocks/livelocks, and I for one will probably find
that acceptable given enough testing.

I realize that that might sound terribly loose (I probably forgot a
few caveats), but what it boils down to is that we *don't* have the
tools to analyze certain types of problems like this, or at least I
don't think we do, because no one does. The closest thing we have is
some well written custom stress testing, and lots of scrutiny. Of
course I'd prefer a solution that is provably correct from first
principles (in other words, I'd prefer to use algorithms that are in
some formal sense non-blocking [3] or something along those lines),
but that will often be cost prohibitive or otherwise illusive when
designing concurrent algorithms.

I think I agree with you both.

[1] https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards
[2] http://www.cs.cornell.edu/courses/cs4411/2009sp/blink.pdf ,
Section 6.4 Livelock
[3] https://en.wikipedia.org/wiki/Non-blocking_algorithm
-- 
Peter Geoghegan



Re: group locking: incomplete patch, just for discussion

From
Noah Misch
Date:
On Thu, Nov 20, 2014 at 09:15:51PM +0100, Andres Freund wrote:
> On 2014-11-20 13:57:25 -0500, Robert Haas wrote:
> > On Thu, Nov 20, 2014 at 12:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > If the
> > deadlock detector would kill the processes anyway, it doesn't matter,
> > because CheckTableNotInUse() should do it first, so that we get a
> > better error and without waiting for deadlock_timeout.  So any
> > scenario that's predicated on the assumption that CheckTableNotInUse()
> > will succeed in a parallel context is 100% unconvincing to me as an
> > argument for anything.
> 
> Meh, there's plenty cases where CheckTableNotInUse() isn't used where we
> still rely on locks actually working. And it's not just postgres code,
> this *will* break user code.

Today, holding a relation AccessExclusiveLock ensures no other process is
using the table.  That is one of two prerequisites for making an alteration
that threatens concurrent users of the table.  The other prerequisite is
verifying that no other subsystem of the current process is using the table,
hence CheckTableNotInUse().  Robert's designs do soften AccessExclusiveLock
for relations; holding it will not preclude use of the table in workers of the
same parallel group.  Pessimizing CheckTableNotInUse() compensates.  In every
scenario where AccessExclusiveLock is softer than before, CheckTableNotInUse()
will fail unconditionally.  Code that, today, performs the correct checks
before making a threatening table alteration will remain correct.

ANALYZE and lazy VACUUM are the only core operations that take self-exclusive
relation locks and skip CheckTableNotInUse().  PreventTransactionChain() will
block naive concurrent VACUUM.  (If that weren't the case, VACUUM would need
CheckTableNotInUse() even without parallelism in the picture.  I wouldn't
oppose adding CheckTableNotInUse() to VACUUM as defense in depth.)  ANALYZE is
special.  It's compatible with most concurrent use of the table, so
CheckTableNotInUse() is unnecessary.  Blocking ANALYZE during parallelism,
even if we someday allow other database writes, is a wart I can live with.

Given a choice between no parallelism and parallelism when the initiating
transaction holds no self-exclusive locks, I'd pick the latter.  But you have
overstated the significance of holding a lock and the degree to which Robert's
proposals change the same.

> Your argument essentially is that we don't actually need to lock objects
> if the other side only reads. Yes, you add a condition or two ontop
> (e.g. CheckTableNotInUse() fails), but that's not that much. If we want
> to do this we really need to forbid *any* modification to catalog/user
> tables while parallel operations are ongoing. In the primary and worker
> backends.  I think that's going to end up being much more
> problematic/restrictive than not allowing non-cooperative paralellism
> after >= ShareUpdateExclusive. If we ever want to parallelize writing
> operations we'll regret this quite a bit.

https://wiki.postgresql.org/wiki/Parallel_Sort already contemplates blocking
every kind of database write, due to challenges around combo CIDs (affects
UPDATE and DELETE) and invalidation messages (mostly affects DDL).  I doubt
we'll ever care to allow a worker to start an independent DDL operation.  We
might want to allow INSERT, UPDATE, and DELETE.  I don't see that allowing
multiple workers to hold AccessExclusiveLock will make that more difficult.
Difficulties around XactLockTableWait() are more noteworthy, and they stand
regardless of what we do with heavyweight locking.

> The 'lets grant all the current locks at start of parallel query'
> approach seems quite a bit safer than always doing that during parallel
> query, don't get me wrong.

Agreed.

> Btw, what are your thoughts about SERIALIZABLE and parallel query?
> Afaics that'd be pretty hard to support?

It didn't look obviously awful, because most of the relevant data structures
live in shared memory.  We already have the concept of one backend adjusting
the predicate locks pertaining to another backend's transaction.  (That's not
to say the number of SERIALIZABLE-specific changes will be zero, either.)

Thanks,
nm



Re: group locking: incomplete patch, just for discussion

From
Robert Haas
Date:
On Wed, Nov 5, 2014 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> The procgloballist stuff should be the subject of a separate patch
>> which I agree with.
>
> Yes, I think that's probably a net improvement in robustness quite
> apart from what we decide to do about any of the rest of this.  I've
> attached it here as revise-procglobal-tracking.patch and will commit
> that bit if nobody objects.

In reviewing this thread I realized that I never got around to
committing this bit.  And it still seems like a good idea, so I've
done that now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company