Re: adding wait_start column to pg_locks - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: adding wait_start column to pg_locks
Date
Msg-id 64102638-7fc3-1941-12f1-e139e3463c03@oss.nttdata.com
Whole thread Raw
In response to Re: adding wait_start column to pg_locks  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: adding wait_start column to pg_locks
List pgsql-hackers

On 2021/02/09 19:11, Fujii Masao wrote:
> 
> 
> On 2021/02/09 18:13, Fujii Masao wrote:
>>
>>
>> On 2021/02/09 17:48, torikoshia wrote:
>>> On 2021-02-05 18:49, Fujii Masao wrote:
>>>> On 2021/02/05 0:03, torikoshia wrote:
>>>>> On 2021-02-03 11:23, Fujii Masao wrote:
>>>>>>> 64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating "waitStart" without
holdingthe partition lock? Also GetLockStatusData() needs spinlock when reading "waitStart"?
 
>>>>>>
>>>>>> Also it might be worth thinking to use 64-bit atomic operations like
>>>>>> pg_atomic_read_u64(), for that.
>>>>>
>>>>> Thanks for your suggestion and advice!
>>>>>
>>>>> In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64().
>>>>>
>>>>> waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and pg_atomic_write_xxx only supports
unsignedint, so I cast the type.
 
>>>>>
>>>>> I may be using these functions not correctly, so if something is wrong, I would appreciate any comments.
>>>>>
>>>>>
>>>>> About the documentation, since your suggestion seems better than v6, I used it as is.
>>>>
>>>> Thanks for updating the patch!
>>>>
>>>> +    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
>>>> +        pg_atomic_write_u64(&MyProc->waitStart,
>>>> +                            pg_atomic_read_u64((pg_atomic_uint64 *) &now));
>>>>
>>>> pg_atomic_read_u64() is really necessary? I think that
>>>> "pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.
>>>>
>>>> +        deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
>>>> +        pg_atomic_write_u64(&MyProc->waitStart,
>>>> +                    pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart));
>>>>
>>>> Same as above.
>>>>
>>>> +        /*
>>>> +         * Record waitStart reusing the deadlock timeout timer.
>>>> +         *
>>>> +         * It would be ideal this can be synchronously done with updating
>>>> +         * lock information. Howerver, since it gives performance impacts
>>>> +         * to hold partitionLock longer time, we do it here asynchronously.
>>>> +         */
>>>>
>>>> IMO it's better to comment why we reuse the deadlock timeout timer.
>>>>
>>>>      proc->waitStatus = waitStatus;
>>>> +    pg_atomic_init_u64(&MyProc->waitStart, 0);
>>>>
>>>> pg_atomic_write_u64() should be used instead? Because waitStart can be
>>>> accessed concurrently there.
>>>>
>>>> I updated the patch and addressed the above review comments. Patch attached.
>>>> Barring any objection, I will commit this version.
>>>
>>> Thanks for modifying the patch!
>>> I agree with your comments.
>>>
>>> BTW, I ran pgbench several times before and after applying
>>> this patch.
>>>
>>> The environment is virtual machine(CentOS 8), so this is
>>> just for reference, but there were no significant difference
>>> in latency or tps(both are below 1%).
>>
>> Thanks for the test! I pushed the patch.
> 
> But I reverted the patch because buildfarm members rorqual and
> prion don't like the patch. I'm trying to investigate the cause
> of this failures.
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10

-    relation     | locktype |        mode
------------------+----------+---------------------
- test_prepared_1 | relation | RowExclusiveLock
- test_prepared_1 | relation | AccessExclusiveLock
-(2 rows)
-
+ERROR:  invalid spinlock number: 0

"rorqual" reported that the above error happened in the server built with
--disable-atomics --disable-spinlocks when reading pg_locks after
the transaction was prepared. The cause of this issue is that "waitStart"
atomic variable in the dummy proc created at the end of prepare transaction
was not initialized. I updated the patch so that pg_atomic_init_u64() is
called for the "waitStart" in the dummy proc for prepared transaction.
Patch attached. I confirmed that the patched server built with
--disable-atomics --disable-spinlocks passed all the regression tests.

BTW, while investigating this issue, I found that pg_stat_wal_receiver also
could cause this error even in the current master (without the patch).
I will report that in separate thread.


> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16

"prion" reported the following error. But I'm not sure how the changes of
pg_locks caused this error. I found that Heikki also reported at [1] that
"prion" failed with the same error but was not sure how it happened.
This makes me think for now that this issue is not directly related to
the pg_locks changes.

-------------------------------------
pg_dump: error: query failed: ERROR:  missing chunk number 0 for toast value 14444 in pg_toast_2619
pg_dump: error: query was: SELECT
a.attnum,
a.attname,
a.atttypmod,
a.attstattarget,
a.attstorage,
t.typstorage,
a.attnotnull,
a.atthasdef,
a.attisdropped,
a.attlen,
a.attalign,
a.attislocal,
pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,
array_to_string(a.attoptions, ', ') AS attoptions,
CASE WHEN a.attcollation <> t.typcollation THEN a.attcollation ELSE 0 END AS attcollation,
pg_catalog.array_to_string(ARRAY(SELECT pg_catalog.quote_ident(option_name) || ' ' ||
pg_catalog.quote_literal(option_value)FROM pg_catalog.pg_options_to_table(attfdwoptions) ORDER BY option_name), E',
 
     ') AS attfdwoptions,
a.attidentity,
CASE WHEN a.atthasmissing AND NOT a.attisdropped THEN a.attmissingval ELSE null END AS attmissingval,
a.attgenerated
FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_type t ON a.atttypid = t.oid
WHERE a.attrelid = '35987'::pg_catalog.oid AND a.attnum > 0::pg_catalog.int2
ORDER BY a.attnum
pg_dumpall: error: pg_dump failed on database "regression", exiting
waiting for server to shut down.... done
server stopped
pg_dumpall of post-upgrade database cluster failed
-------------------------------------

[1]
https://www.postgresql.org/message-id/f03ea04a-9b77-e371-9ab9-182cb35db1f9@iki.fi


Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Perform COPY FROM encoding conversions in larger chunks
Next
From: Peter Eisentraut
Date:
Subject: Routine usage information schema tables