Thread: [HACKERS] Two pass CheckDeadlock in contentent case

[HACKERS] Two pass CheckDeadlock in contentent case

From

Sokolov Yura

Date:

03 October 2017, 17:30:03

Good day, hackers.

During hard workload sometimes process reaches deadlock timeout
even if no real deadlock occurred. It is easily reproducible with
pg_xact_advisory_lock on same value + some time consuming
operation (update) and many clients.

When backend reaches deadlock timeout, it calls CheckDeadlock,
which locks all partitions of lock hash in exclusive mode to
walk through graph and search for deadlock.

If hundreds of backends reaches this timeout trying to acquire
advisory lock on a same value, it leads to hard-stuck for many
seconds, cause they all traverse same huge lock graph under
exclusive lock.
During this stuck there is no possibility to do any meaningful
operations (no new transaction can begin).

Attached patch makes CheckDeadlock to do two passes:
- first pass uses LW_SHARED on partitions of lock hash.
   DeadLockCheck is called with flag "readonly", so it doesn't
   modify anything.
- If there is possibility of "soft" or "hard" deadlock detected,
   ie if there is need to modify lock graph, then partitions
   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.

It fixes hard-stuck, cause backends walk lock graph under shared
lock, and found that there is no real deadlock.

Test on 4 socket xeon machine:
pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f 
./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5

ycsb_read_zipf.sql:
     \set i random_zipfian(1, 100000 * :scale, 1.01)
     SELECT abalance FROM pgbench_accounts WHERE aid = :i;
ycsb_update_lock2_zipf.sql:
     \set i random_zipfian(1, 100000 * :scale, 1.01)
     select lock_and_update( :i );

     CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
      RETURNS void
      LANGUAGE sql
     AS $function$
         SELECT pg_advisory_xact_lock( $1 );
         UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
     $function$

Without attached patch:

progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741

(autovacuum led to trigger deadlock timeout,
  and therefore, to stuck)

Patched:

progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482

(autovacuum led to trigger deadlock timeout,
  but postgresql did not stuck)

I believe, patch will positively affect other heavy workloads
as well, although I have not collect benchmarks.

`make check-world` passes with configured `--enable-tap-tests 
--enable-casserts`

-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

shared_deadlock_check.diff

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Sokolov Yura

Date:

03 October 2017, 18:06:30

On 2017-10-03 17:30, Sokolov Yura wrote:
> Good day, hackers.
> 
> During hard workload sometimes process reaches deadlock timeout
> even if no real deadlock occurred. It is easily reproducible with
> pg_xact_advisory_lock on same value + some time consuming
> operation (update) and many clients.
> 
> When backend reaches deadlock timeout, it calls CheckDeadlock,
> which locks all partitions of lock hash in exclusive mode to
> walk through graph and search for deadlock.
> 
> If hundreds of backends reaches this timeout trying to acquire
> advisory lock on a same value, it leads to hard-stuck for many
> seconds, cause they all traverse same huge lock graph under
> exclusive lock.
> During this stuck there is no possibility to do any meaningful
> operations (no new transaction can begin).
> 
> Attached patch makes CheckDeadlock to do two passes:
> - first pass uses LW_SHARED on partitions of lock hash.
>   DeadLockCheck is called with flag "readonly", so it doesn't
>   modify anything.
> - If there is possibility of "soft" or "hard" deadlock detected,
>   ie if there is need to modify lock graph, then partitions
>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
> 
> It fixes hard-stuck, cause backends walk lock graph under shared
> lock, and found that there is no real deadlock.
> 
> Test on 4 socket xeon machine:
> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
> 
> ycsb_read_zipf.sql:
>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>     SELECT abalance FROM pgbench_accounts WHERE aid = :i;
> ycsb_update_lock2_zipf.sql:
>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>     select lock_and_update( :i );
> 
>     CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>      RETURNS void
>      LANGUAGE sql
>     AS $function$
>         SELECT pg_advisory_xact_lock( $1 );
>         UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>     $function$
> 
> Without attached patch:
> 
> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
> 
> (autovacuum led to trigger deadlock timeout,
>  and therefore, to stuck)
> 
> Patched:
> 
> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
> 
> (autovacuum led to trigger deadlock timeout,
>  but postgresql did not stuck)
> 
> I believe, patch will positively affect other heavy workloads
> as well, although I have not collect benchmarks.
> 
> `make check-world` passes with configured `--enable-tap-tests 
> --enable-casserts`

Excuse me, corrected version is in attach.

-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

shared_deadlock_check.diff

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Ashutosh Bapat

Date:

11 July 2018, 17:01:19

The patch still applies and it's part of this commitfest.

On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura <y.sokolov@postgrespro.ru> wrote:
> On 2017-10-03 17:30, Sokolov Yura wrote:
>>
>> Good day, hackers.
>>
>> During hard workload sometimes process reaches deadlock timeout
>> even if no real deadlock occurred. It is easily reproducible with
>> pg_xact_advisory_lock on same value + some time consuming
>> operation (update) and many clients.
>>
>> When backend reaches deadlock timeout, it calls CheckDeadlock,
>> which locks all partitions of lock hash in exclusive mode to
>> walk through graph and search for deadlock.
>>
>> If hundreds of backends reaches this timeout trying to acquire
>> advisory lock on a same value, it leads to hard-stuck for many
>> seconds, cause they all traverse same huge lock graph under
>> exclusive lock.
>> During this stuck there is no possibility to do any meaningful
>> operations (no new transaction can begin).
>>
>> Attached patch makes CheckDeadlock to do two passes:
>> - first pass uses LW_SHARED on partitions of lock hash.
>>   DeadLockCheck is called with flag "readonly", so it doesn't
>>   modify anything.
>> - If there is possibility of "soft" or "hard" deadlock detected,
>>   ie if there is need to modify lock graph, then partitions
>>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>>
>> It fixes hard-stuck, cause backends walk lock graph under shared
>> lock, and found that there is no real deadlock.
>>
>> Test on 4 socket xeon machine:
>> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
>> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
>>
>> ycsb_read_zipf.sql:
>>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>>     SELECT abalance FROM pgbench_accounts WHERE aid = :i;
>> ycsb_update_lock2_zipf.sql:
>>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>>     select lock_and_update( :i );
>>
>>     CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>>      RETURNS void
>>      LANGUAGE sql
>>     AS $function$
>>         SELECT pg_advisory_xact_lock( $1 );
>>         UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>>     $function$
>>
>> Without attached patch:
>>
>> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
>> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
>> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
>> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
>> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
>> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
>> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
>> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
>> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
>> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
>> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
>> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
>> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
>> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
>>
>> (autovacuum led to trigger deadlock timeout,
>>  and therefore, to stuck)
>>
>> Patched:
>>
>> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
>> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
>> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
>> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
>> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
>> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
>> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
>> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
>> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
>> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
>> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
>> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
>> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
>> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482

I am confused about interpreting these numbers. Does it mean that
without the patch at the end of 70s, we have completed 40553.2 * 70
transactions and with the patch there are 70 * 38387.9 transactions
completed in 70 seconds?

I agree that with the patch there is smoother degradation when a lot
of backends enter deadlock detection phase. That's nice, but if my
interpretation of numbers is correct, that smoothness comes with a
cost of decreased TPS. So, it would make sense to have a GUC
controlling this behaviour.

Talking about GUCs, deadlock_timeout [1] value decides when the
deadlock check kicks in. In such cases probably increasing it would
suffice and we may not need the patch.

Anybody else can comment about how desired is this feature?

> Excuse me, corrected version is in attach.
>

About the patch itself, I think releasing the shared lock and taking
exclusive lock on lock partitions, may lead to starvation. If there
are multiple backends that enter deadlock detection simultaneously,
all of them will find that there's a deadlock and one of them will
succeed in resolving it only after all of them release their
respective locks. Also there's a comment at the end about the order in
which to release the lock

    /*
     * And release locks.  We do this in reverse order for two reasons: (1)
     * Anyone else who needs more than one of the locks will be trying to lock
     * them in increasing order; we don't want to release the other process
     * until it can get all the locks it needs. (2) This avoids O(N^2)
     * behavior inside LWLockRelease.
     */
check_done:
    for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
        LWLockRelease(LockHashPartitionLockByIndex(i));

Is that scenario applicable when the shared locks are released in the
first phase?

[1] https://www.postgresql.org/docs/11/static/runtime-config-locks.html

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Yura Sokolov

Date:

21 July 2018, 00:28:09

11.07.2018 17:01, Ashutosh Bapat пишет:
> The patch still applies and it's part of this commitfest.
>
> On Tue, Oct 3, 2017 at 8:36 PM, Sokolov Yura <y.sokolov@postgrespro.ru> wrote:
>> On 2017-10-03 17:30, Sokolov Yura wrote:
>>>
>>> Good day, hackers.
>>>
>>> During hard workload sometimes process reaches deadlock timeout
>>> even if no real deadlock occurred. It is easily reproducible with
>>> pg_xact_advisory_lock on same value + some time consuming
>>> operation (update) and many clients.
>>>
>>> When backend reaches deadlock timeout, it calls CheckDeadlock,
>>> which locks all partitions of lock hash in exclusive mode to
>>> walk through graph and search for deadlock.
>>>
>>> If hundreds of backends reaches this timeout trying to acquire
>>> advisory lock on a same value, it leads to hard-stuck for many
>>> seconds, cause they all traverse same huge lock graph under
>>> exclusive lock.
>>> During this stuck there is no possibility to do any meaningful
>>> operations (no new transaction can begin).
>>>
>>> Attached patch makes CheckDeadlock to do two passes:
>>> - first pass uses LW_SHARED on partitions of lock hash.
>>>   DeadLockCheck is called with flag "readonly", so it doesn't
>>>   modify anything.
>>> - If there is possibility of "soft" or "hard" deadlock detected,
>>>   ie if there is need to modify lock graph, then partitions
>>>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>>>
>>> It fixes hard-stuck, cause backends walk lock graph under shared
>>> lock, and found that there is no real deadlock.
>>>
>>> Test on 4 socket xeon machine:
>>> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
>>> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
>>>
>>> ycsb_read_zipf.sql:
>>>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>>>     SELECT abalance FROM pgbench_accounts WHERE aid = :i;
>>> ycsb_update_lock2_zipf.sql:
>>>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>>>     select lock_and_update( :i );
>>>
>>>     CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>>>      RETURNS void
>>>      LANGUAGE sql
>>>     AS $function$
>>>         SELECT pg_advisory_xact_lock( $1 );
>>>         UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>>>     $function$
>>>
>>> Without attached patch:
>>>
>>> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
>>> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
>>> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
>>> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
>>> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
>>> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
>>> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
>>> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
>>> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
>>> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
>>> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
>>> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
>>>
>>> (autovacuum led to trigger deadlock timeout,
>>>  and therefore, to stuck)
>>>
>>> Patched:
>>>
>>> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
>>> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
>>> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
>>> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
>>> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
>>> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
>>> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
>>> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
>>> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
>>> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
>>> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
>>> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
>>> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
>>> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
>
> I am confused about interpreting these numbers. Does it mean that
> without the patch at the end of 70s, we have completed 40553.2 * 70
> transactions and with the patch there are 70 * 38387.9 transactions
> completed in 70 seconds?

It is regular pgbench output, so there is no source for confusion.
tps numbers is number of transactions completed in that particular
5 second interval. That is why there are 0 tps and 1 tps intervals
without patch. Which way 0tps and 1tps could be if tps were average
since start?

It means that without patch there were
46k+51k+52k+47k+42k+12k+0k+0k+0k+0k+27k+39k+41k=357 thousands of
transactions completed.

And with patch there were
56k+55k+51k+51k+48k+45k+40k+14k+30k+39k+41k+38k+40k+38k=586 thousands of
transactions completed.

> I agree that with the patch there is smoother degradation when a lot
> of backends enter deadlock detection phase. That's nice, but if my
> interpretation of numbers is correct, that smoothness comes with a
> cost of decreased TPS. So, it would make sense to have a GUC
> controlling this behaviour.

Your interpretation is not correct.

> Talking about GUCs, deadlock_timeout [1] value decides when the
> deadlock check kicks in. In such cases probably increasing it would
> suffice and we may not need the patch.

Increasing deadlock_timeout solves this issue.
But increasing deadlock timeout is very-very bad decision in presence of
real deadlocks.

>> Excuse me, corrected version is in attach.
>>
>
> About the patch itself, I think releasing the shared lock and taking
> exclusive lock on lock partitions, may lead to starvation. If there
> are multiple backends that enter deadlock detection simultaneously,
> all of them will find that there's a deadlock and one of them will
> succeed in resolving it only after all of them release their
> respective locks. Also there's a comment at the end about the order in
> which to release the lock

It is not worse than behavior without patch. Without patch when real
deadlock happends and multiple backends enters deadlock detection, all
of them (except first) will wait for first backend to release all
EXCLUSIVE locks only to lock them again and then to find that deadlock
eliminated.
Given there is N backends in deadlock, there will be 1*X+(N-1)*Y=Q total
time without patch, where X is time to detect and resolve deadlock, and
Y is a time to check for deadlock after resolving.

With patch backends will do first pass in parallel, so it will be
Z*(1+e)+Q total time, where Z is a time to make first fast imprecise
check, and e is a time drift due to process scheduling.
Note, that in reality all this times usually much smaller than 1 second
default for deadlock_timeout, so that backends will wait for
deadlock_timeout much longer that additional Z*(1+e) time need for first
inexact pass.
And, if some backends tries to detect deadlock a bit later, ie after one
backend already takes exclusive lock, then they will not fully
participate in (N-1)*Y, because they will detect that deadlock were
eliminated using shared locks ie in parallel.

>
>     /*
>      * And release locks.  We do this in reverse order for two reasons: (1)
>      * Anyone else who needs more than one of the locks will be trying to lock
>      * them in increasing order; we don't want to release the other process
>      * until it can get all the locks it needs. (2) This avoids O(N^2)
>      * behavior inside LWLockRelease.
>      */
> check_done:
>     for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
>         LWLockRelease(LockHashPartitionLockByIndex(i));
>
> Is that scenario applicable when the shared locks are released in the
> first phase?

I release shared locks in same order, as described in this comment, so I
don't understand what you mean.

With regards,
Sokolov Yura

Attachment

signature.asc

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Ashutosh Bapat

Date:

23 July 2018, 12:14:07

On Sat, Jul 21, 2018 at 2:58 AM, Yura Sokolov <funny.falcon@gmail.com> wrote:
>
> It is regular pgbench output, so there is no source for confusion.
> tps numbers is number of transactions completed in that particular
> 5 second interval. That is why there are 0 tps and 1 tps intervals
> without patch. Which way 0tps and 1tps could be if tps were average
> since start?
>

In your output there's no mention of 0s and 1s tps. It might be
something standard with pgbench, which I don't know yet. Sorry for
that.

> It means that without patch there were
> 46k+51k+52k+47k+42k+12k+0k+0k+0k+0k+27k+39k+41k=357 thousands of
> transactions completed.
>
> And with patch there were
> 56k+55k+51k+51k+48k+45k+40k+14k+30k+39k+41k+38k+40k+38k=586 thousands of
> transactions completed.

Thanks for the explanation.

>
>> Talking about GUCs, deadlock_timeout [1] value decides when the
>> deadlock check kicks in. In such cases probably increasing it would
>> suffice and we may not need the patch.
>
> Increasing deadlock_timeout solves this issue.
> But increasing deadlock timeout is very-very bad decision in presence of
> real deadlocks.

I thought you were trying to improve a situation where the
transactions are genuinely taking longer than deadlock timeout and
ending up with deadlock detection unnecessarily. For that case the
advice that deadlock_timeout should be set to a value longer than
usual transaction period works. In case of real deadlocks, I think
that penalizes the deadlocking backends by making them wait longer
after the actual deadlock happens. But in such case, the solution is
to avoid such deadlock at the level of application or in the backend
(deadlock with system tables involved). Said that your patch is small
enough to accept if it improves those cases as well.

>
>>> Excuse me, corrected version is in attach.
>>>
>>
>> About the patch itself, I think releasing the shared lock and taking
>> exclusive lock on lock partitions, may lead to starvation. If there
>> are multiple backends that enter deadlock detection simultaneously,
>> all of them will find that there's a deadlock and one of them will
>> succeed in resolving it only after all of them release their
>> respective locks. Also there's a comment at the end about the order in
>> which to release the lock
>
> It is not worse than behavior without patch. Without patch when real
> deadlock happends and multiple backends enters deadlock detection, all
> of them (except first) will wait for first backend to release all
> EXCLUSIVE locks only to lock them again and then to find that deadlock
> eliminated.
> Given there is N backends in deadlock, there will be 1*X+(N-1)*Y=Q total
> time without patch, where X is time to detect and resolve deadlock, and
> Y is a time to check for deadlock after resolving.
>
> With patch backends will do first pass in parallel, so it will be
> Z*(1+e)+Q total time, where Z is a time to make first fast imprecise
> check, and e is a time drift due to process scheduling.
> Note, that in reality all this times usually much smaller than 1 second
> default for deadlock_timeout, so that backends will wait for
> deadlock_timeout much longer that additional Z*(1+e) time need for first
> inexact pass.
> And, if some backends tries to detect deadlock a bit later, ie after one
> backend already takes exclusive lock,

I think the case I am talking about is somewhere between these two -
the backend which has imprecisely decided that there's a deadlock will
start taking exclusive locks and will have to wait for all the
backends with shared locks to release there locks. This will never
happen if there is a stream of backends which keeps on taking shared
locks. This is classical problem of starvation because of lock
upgrade. Taking exclusive lock to begin with avoids this problem.

>>
>>     /*
>>      * And release locks.  We do this in reverse order for two reasons: (1)
>>      * Anyone else who needs more than one of the locks will be trying to lock
>>      * them in increasing order; we don't want to release the other process
>>      * until it can get all the locks it needs.

Your patch is breaking this assumption. We may end up in a situation
where more than two backends have shared locks. One of the backend
starts releasing its locks and releases even the first lock. Some
different backend gets its first exclusive lock because of that but
can not get the next one since it's held by the second backend holding
shared locks. This might not happen if by "more than one" the comment
means all the locks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Simon Riggs

Date:

23 July 2018, 15:38:14

On 3 October 2017 at 15:30, Sokolov Yura <y.sokolov@postgrespro.ru> wrote:

> If hundreds of backends reaches this timeout trying to acquire
> advisory lock on a same value, it leads to hard-stuck for many
> seconds, cause they all traverse same huge lock graph under
> exclusive lock.
> During this stuck there is no possibility to do any meaningful
> operations (no new transaction can begin).

Well observed, we clearly need to improve this.

> Attached patch makes CheckDeadlock to do two passes:
> - first pass uses LW_SHARED on partitions of lock hash.
>   DeadLockCheck is called with flag "readonly", so it doesn't
>   modify anything.
> - If there is possibility of "soft" or "hard" deadlock detected,
>   ie if there is need to modify lock graph, then partitions
>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>
> It fixes hard-stuck, cause backends walk lock graph under shared
> lock, and found that there is no real deadlock.

In phase 2, does this relock only the partitions required to reorder
the lock graph, or does it request all locks? Fewer locks would be
better.

If you decide to reorder the lock graph, then only one backend should
attempt this at a time. We should keep track of reorder-requests, so
if two backends arrive at the same conclusion then only one should
proceed to do this.

Many deadlocks happen between locks in same table. It would also be a
useful optimization to check just one partition for lock graphs before
we checked all partitions.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Robert Haas

Date:

24 July 2018, 22:19:24

On Mon, Jul 23, 2018 at 5:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I think the case I am talking about is somewhere between these two -
> the backend which has imprecisely decided that there's a deadlock will
> start taking exclusive locks and will have to wait for all the
> backends with shared locks to release there locks. This will never
> happen if there is a stream of backends which keeps on taking shared
> locks. This is classical problem of starvation because of lock
> upgrade. Taking exclusive lock to begin with avoids this problem.

I don't quite see how this could happen in this case.  IIUC, if a
backend takes the shared locks and finds that there is a deadlock, it
will then take exclusive locks.  If it takes the shared locks and
finds that there is no deadlock, then it will just wait to be woken
up.  After some small fraction of a second, every backend in the
system that was taking short locks will have taken them and either
started waiting for an exclusive lock or on the process semaphore.
It's not like ProcArrayLock or CLogControlLock where there can be an
endless stream of new requests.

>>>     /*
>>>      * And release locks.  We do this in reverse order for two reasons: (1)
>>>      * Anyone else who needs more than one of the locks will be trying to lock
>>>      * them in increasing order; we don't want to release the other process
>>>      * until it can get all the locks it needs.
>
> Your patch is breaking this assumption. We may end up in a situation
> where more than two backends have shared locks. One of the backend
> starts releasing its locks and releases even the first lock. Some
> different backend gets its first exclusive lock because of that but
> can not get the next one since it's held by the second backend holding
> shared locks. This might not happen if by "more than one" the comment
> means all the locks.

I don't see the problem.  We can't get the first exclusive lock until
all processes have released their shared locks, and since they release
in decreasing order, nobody holds any lock at that point.  Unless I'm
confused?

I think this patch is probably a good idea in some form.  Whether it's
got all the details exactly right I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Masahiko Sawada

Date:

06 November 2018, 04:16:00

On Wed, Oct 4, 2017 at 12:07 AM Sokolov Yura <y.sokolov@postgrespro.ru> wrote:
>
> On 2017-10-03 17:30, Sokolov Yura wrote:
> > Good day, hackers.
> >
> > During hard workload sometimes process reaches deadlock timeout
> > even if no real deadlock occurred. It is easily reproducible with
> > pg_xact_advisory_lock on same value + some time consuming
> > operation (update) and many clients.
> >
> > When backend reaches deadlock timeout, it calls CheckDeadlock,
> > which locks all partitions of lock hash in exclusive mode to
> > walk through graph and search for deadlock.
> >
> > If hundreds of backends reaches this timeout trying to acquire
> > advisory lock on a same value, it leads to hard-stuck for many
> > seconds, cause they all traverse same huge lock graph under
> > exclusive lock.
> > During this stuck there is no possibility to do any meaningful
> > operations (no new transaction can begin).
> >
> > Attached patch makes CheckDeadlock to do two passes:
> > - first pass uses LW_SHARED on partitions of lock hash.
> >   DeadLockCheck is called with flag "readonly", so it doesn't
> >   modify anything.
> > - If there is possibility of "soft" or "hard" deadlock detected,
> >   ie if there is need to modify lock graph, then partitions
> >   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
> >
> > It fixes hard-stuck, cause backends walk lock graph under shared
> > lock, and found that there is no real deadlock.
> >
> > Test on 4 socket xeon machine:
> > pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
> > ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
> >
> > ycsb_read_zipf.sql:
> >     \set i random_zipfian(1, 100000 * :scale, 1.01)
> >     SELECT abalance FROM pgbench_accounts WHERE aid = :i;
> > ycsb_update_lock2_zipf.sql:
> >     \set i random_zipfian(1, 100000 * :scale, 1.01)
> >     select lock_and_update( :i );
> >
> >     CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
> >      RETURNS void
> >      LANGUAGE sql
> >     AS $function$
> >         SELECT pg_advisory_xact_lock( $1 );
> >         UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
> >     $function$
> >
> > Without attached patch:
> >
> > progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
> > progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
> > progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
> > progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
> > progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
> > progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
> > progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
> > progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
> > progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
> > progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
> > progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
> > progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
> > progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
> > progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
> >
> > (autovacuum led to trigger deadlock timeout,
> >  and therefore, to stuck)
> >
> > Patched:
> >
> > progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
> > progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
> > progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
> > progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
> > progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
> > progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
> > progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
> > progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
> > progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
> > progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
> > progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
> > progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
> > progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
> > progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
> >
> > (autovacuum led to trigger deadlock timeout,
> >  but postgresql did not stuck)
> >
> > I believe, patch will positively affect other heavy workloads
> > as well, although I have not collect benchmarks.
> >
> > `make check-world` passes with configured `--enable-tap-tests
> > --enable-casserts`
>
> Excuse me, corrected version is in attach.
>

The idea of this patch seems reasonable.

I've looked at this patch. This patch still can be applied cleanly to
the current HEAD and passed regression tests.

Here is review comments.

+       if (readonly)
+       {
+               if (nWaitOrders > 0)
+                       return DS_HARD_DEADLOCK;
+               else if (blocking_autovacuum_proc != NULL)
+                       return DS_BLOCKED_BY_AUTOVACUUM;
+               else
+                       return DS_NO_DEADLOCK;
+       }

WIth this patch DeadLockCheck could return DS_HARD_DEADLOCK in
readonly = true case and we then retry checking deadlock but
DS_HARD_DEADLOCK originally means that we've detected a deadlock that
is un-resolvable (see definition of DedadLockState). So I think it's
better to return DS_SOFT_DEADLOCK. Or at least need to update the
description of DeadLockState.

------
        if (deadlock_state == DS_HARD_DEADLOCK)
        {
+               if (shared)
+               {
+                       shared = false;
+                       for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
+                               LWLockRelease(LockHashPartitionLockByIndex(i));
+                       for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+
LWLockAcquire(LockHashPartitionLockByIndex(i), LW_EXCLUSIVE);
+                       goto retry;
+               }
+

This code forces us to check deadlocks exactly twice. IIUC if a hard
deadlock is found even while holding a shared lock we can abort.

-----
The comment of DeadLockCheck needs to be updated.

 * DeadLockCheck -- Checks for deadlocks for a given process
 *
 * This code looks for deadlocks involving the given process.  If any
 * are found, it tries to rearrange lock wait queues to resolve the
 * deadlock.  If resolution is impossible, return DS_HARD_DEADLOCK ---
 * the caller is then expected to abort the given proc's transaction.
 *

We might not rearrange if the readonly is true even if we found a deadlock.

-----
The first sentence of the following comment also needs to be updated.

    /*
     * Acquire exclusive lock on the entire shared lock data structures. Must
     * grab LWLocks in partition-number order to avoid LWLock deadlock.
     *

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Dmitry Dolgov

Date:

30 November 2018, 15:44:37

> On Tue, Nov 6, 2018 at 5:19 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> The idea of this patch seems reasonable.
>
> I've looked at this patch. This patch still can be applied cleanly to
> the current HEAD and passed regression tests.
>
> Here is review comments.

Thanks for the review. Just for the records, patch still has no conflicts and
pass all the tests. Yura, do you have any plans about this patch, could you
respond to the feedback? In the meantime I'm moving it to the next CF.

Re: [HACKERS] Two pass CheckDeadlock in contentent case

From

Michael Paquier

Date:

04 February 2019, 05:05:32

On Fri, Nov 30, 2018 at 04:44:37PM +0100, Dmitry Dolgov wrote:
> Thanks for the review. Just for the records, patch still has no conflicts and
> pass all the tests. Yura, do you have any plans about this patch, could you
> respond to the feedback? In the meantime I'm moving it to the next CF.

No answers since the latest review, so I am marking the patch as
returned with feedback.
--
Michael

Attachment

signature.asc