Thread: A bug in LWLOCK_STATS

A bug in LWLOCK_STATS

From
Fujii Masao
Date:
Hi,

When I compiled PostgreSQL with -DLWLOCK_STATS and tried to check
the statistics of light-weight locks, I observed that more than one
statistics entries were output *for the same backend process and
the same lwlock*. For example, I got the following four statistics
when I checked how the process with PID 81141 processed ProcArrayLock.
This is strange, and IMO only one statistics should be output for
the same backend process and lwlock.

$ grep "PID 81141 lwlock ProcArrayLock" data/log/postgresql-2020-02-05_141842.log
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 4000 exacq 0 blk 0 spindelay 0 dequeue self 0
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 2 exacq 0 blk 0 spindelay 0 dequeue self 0
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 6001 exacq 1 blk 0 spindelay 0 dequeue self 0
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 5 exacq 1 blk 0 spindelay 0 dequeue self 0

The cause of this issue is that the key variable used for lwlock hash
table was not fully initialized. The key consists of two fields and
they are initialized as follows. But the following 4 bytes allocated
for the alignment was not initialized. So even if the same key was
specified, hash_search(HASH_ENTER) could not find the existing
entry for that key and created new one.

    key.tranche = lock->tranche;
    key.instance = lock;

Attached is the patch fixing this issue by initializing the key
variable with zero. In the patched version, I confirmed that only one
statistics is output for the same process and the same lwlock.
Also this patch would reduce the volume of lwlock statistics
very much.

This issue was introduced by commit 3761fe3c20. So the patch needs
to be back-patch to v10.

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Attachment

Re: A bug in LWLOCK_STATS

From
Julien Rouhaud
Date:
On Wed, Feb 05, 2020 at 02:43:49PM +0900, Fujii Masao wrote:
> Hi,
> 
> When I compiled PostgreSQL with -DLWLOCK_STATS and tried to check
> the statistics of light-weight locks, I observed that more than one
> statistics entries were output *for the same backend process and
> the same lwlock*. For example, I got the following four statistics
> when I checked how the process with PID 81141 processed ProcArrayLock.
> This is strange, and IMO only one statistics should be output for
> the same backend process and lwlock.
> 
> $ grep "PID 81141 lwlock ProcArrayLock" data/log/postgresql-2020-02-05_141842.log
> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 4000 exacq 0 blk 0 spindelay 0 dequeue self 0
> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 2 exacq 0 blk 0 spindelay 0 dequeue self 0
> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 6001 exacq 1 blk 0 spindelay 0 dequeue self 0
> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 5 exacq 1 blk 0 spindelay 0 dequeue self 0
> 
> The cause of this issue is that the key variable used for lwlock hash
> table was not fully initialized. The key consists of two fields and
> they are initialized as follows. But the following 4 bytes allocated
> for the alignment was not initialized. So even if the same key was
> specified, hash_search(HASH_ENTER) could not find the existing
> entry for that key and created new one.
> 
>     key.tranche = lock->tranche;
>     key.instance = lock;
> 
> Attached is the patch fixing this issue by initializing the key
> variable with zero. In the patched version, I confirmed that only one
> statistics is output for the same process and the same lwlock.
> Also this patch would reduce the volume of lwlock statistics
> very much.
> 
> This issue was introduced by commit 3761fe3c20. So the patch needs
> to be back-patch to v10.

Good catch!  The patch looks good to me.  Just in case I looked at other users
of HASH_BLOBS and AFAICT there's no other cases of key that can contain padding
bytes that aren't memset first.



Re: A bug in LWLOCK_STATS

From
Kyotaro Horiguchi
Date:
At Wed, 5 Feb 2020 14:43:49 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> The cause of this issue is that the key variable used for lwlock hash
> table was not fully initialized. The key consists of two fields and
> they are initialized as follows. But the following 4 bytes allocated
> for the alignment was not initialized. So even if the same key was
> specified, hash_search(HASH_ENTER) could not find the existing
> entry for that key and created new one.
> 
>     key.tranche = lock->tranche;
>     key.instance = lock;
> 
> Attached is the patch fixing this issue by initializing the key
> variable with zero. In the patched version, I confirmed that only one
> statistics is output for the same process and the same lwlock.
> Also this patch would reduce the volume of lwlock statistics
> very much.

Nice catch. A brielf grepping showed me no another instance of the
same issue.  I found some composite hash key struct are used without
intialization but AFAIS they don't have padding before the last
member, or uses strcmp or custom coparison functions. (I don't think
we don't have that in the regular paths, though..)

> This issue was introduced by commit 3761fe3c20. So the patch needs
> to be back-patch to v10.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: A bug in LWLOCK_STATS

From
Fujii Masao
Date:

On 2020/02/05 17:13, Julien Rouhaud wrote:
> On Wed, Feb 05, 2020 at 02:43:49PM +0900, Fujii Masao wrote:
>> Hi,
>>
>> When I compiled PostgreSQL with -DLWLOCK_STATS and tried to check
>> the statistics of light-weight locks, I observed that more than one
>> statistics entries were output *for the same backend process and
>> the same lwlock*. For example, I got the following four statistics
>> when I checked how the process with PID 81141 processed ProcArrayLock.
>> This is strange, and IMO only one statistics should be output for
>> the same backend process and lwlock.
>>
>> $ grep "PID 81141 lwlock ProcArrayLock" data/log/postgresql-2020-02-05_141842.log
>> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 4000 exacq 0 blk 0 spindelay 0 dequeue self 0
>> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 2 exacq 0 blk 0 spindelay 0 dequeue self 0
>> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 6001 exacq 1 blk 0 spindelay 0 dequeue self 0
>> PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 5 exacq 1 blk 0 spindelay 0 dequeue self 0
>>
>> The cause of this issue is that the key variable used for lwlock hash
>> table was not fully initialized. The key consists of two fields and
>> they are initialized as follows. But the following 4 bytes allocated
>> for the alignment was not initialized. So even if the same key was
>> specified, hash_search(HASH_ENTER) could not find the existing
>> entry for that key and created new one.
>>
>>     key.tranche = lock->tranche;
>>     key.instance = lock;
>>
>> Attached is the patch fixing this issue by initializing the key
>> variable with zero. In the patched version, I confirmed that only one
>> statistics is output for the same process and the same lwlock.
>> Also this patch would reduce the volume of lwlock statistics
>> very much.
>>
>> This issue was introduced by commit 3761fe3c20. So the patch needs
>> to be back-patch to v10.

Pushed.

> Good catch!  The patch looks good to me.  Just in case I looked at other users
> of HASH_BLOBS and AFAICT there's no other cases of key that can contain padding
> bytes that aren't memset first.

Thanks Julien and Horiguchi-san for reviewing the patch
and checking other cases!

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters