Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Parallel Inserts in CREATE TABLE AS
Date
Msg-id CALj2ACVdcrjwHXwvJqT-Fa32vnJEOjteep_3L24X8MK50E7M8w@mail.gmail.com
Whole thread Raw
In response to RE: Parallel Inserts in CREATE TABLE AS  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Responses RE: Parallel Inserts in CREATE TABLE AS
List pgsql-hackers
On Tue, May 25, 2021 at 12:05 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> I noticed one place which could be one of the reasons that cause the performance degradation.
>
> +               /*
> +                * We don't need to skip contacting FSM while inserting tuples for
> +                * parallel mode, while extending the relations, workers instead of
> +                * blocking on a page while another worker is inserting, can check the
> +                * FSM for another page that can accommodate the tuples. This results
> +                * in major benefit for parallel inserts.
> +                */
> +               myState->ti_options = 0;
>
> I am not quite sure that disabling the " SKIP FSM " in parallel worker will bring performance gain.
> In my test environment, if I change this code to use option " TABLE_INSERT_SKIP_FSM ", then there
> seems no performance degradation . Could you please have a try on it ?
> (I test with the SQL you provided earlier[1])

Thanks for trying that out.

Please see the code around the use_fsm flag in
RelationGetBufferForTuple for more understanding of the points below.

What happens if FSM is skipped i.e. myState->ti_options =
TABLE_INSERT_SKIP_FSM;?
1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple.
2) Each worker initially gets a block and keeps inserting into it
until it is full. When the block is full, the worker doesn't look in
FSM GetPageWithFreeSpace as use_fsm is false. It directly goes for
relation extension and tries to acquire relation extension lock with
LockRelationForExtension. Note that the bulk extension of blocks with
RelationAddExtraBlocks is not reached as use_fsm is false.
3) After acquiring the relation extension lock, it adds an extra new
block with ReadBufferBI(relation, P_NEW, ...), see the comment "In
addition to whatever extension we performed above, we always add at
least one block to satisfy our own request." The tuple is inserted
into this new block.

Basically, the workers can't look for the empty pages from the pages
added by other workers, they keep doing the above steps in silos.

What happens if FSM is not skipped i.e. myState->ti_options = 0;?
1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple.
2) Each worker initially gets a block and keeps inserting into it
until it is full. When the block is full, the worker looks for the
page with free space in FSM GetPageWithFreeSpace as use_fsm is true.
If it can't find any page with the required amount of free space, it
goes for bulk relation extension(RelationAddExtraBlocks) after
acquiring relation extension lock with
ConditionalLockRelationForExtension. Then the worker adds extraBlocks
= Min(512, lockWaiters * 20); new blocks in RelationAddExtraBlocks and
immediately updates the bottom level of FSM for each block (see the
comment around RecordPageWithFreeSpace for why only the bottom level,
not the entire FSM tree). After all the blocks are added, then it
updates the entire FSM tree FreeSpaceMapVacuumRange.
4) After the bulk extension, then the worker adds another block see
the comment "In addition to whatever extension we performed above, we
always add at least one block to satisfy our own request." and inserts
tuple into this new block.

Basically, the workers can benefit from the bulk extension of the
relation and they always can look for the empty pages from the pages
added by other workers. There are high chances that the blocks will be
available after bulk extension. Having said that, if the added extra
blocks are consumed by the workers so fast i.e. if the tuple sizes are
big i.e very less tuples per page, then, the bulk extension too can't
help much and there will be more contention on the relation extension
lock. Well, one might think to add more blocks at a time, say
Min(1024, lockWaiters * 128/256/512) than currently extraBlocks =
Min(512, lockWaiters * 20);. This will work (i.e. we don't see any
regression with parallel inserts in CTAS patches), but it can't be a
practical solution. Because the total pages for the relation will be
more with many pages having more free space. Furthermore, the future
sequential scans on that relation might take a lot of time.

If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the
place(within if (myState->is_parallel)), then it will be effective for
leader i.e. leader will not look for FSM, but all the workers will,
because within if (myState->is_parallel_worker) in intorel_startup,
myState->ti_options = 0; for workers.

I ran tests with configuration shown at [1] for the case 4 (2
bigint(of 8 bytes each) columns, 16 name(of 64 bytes each) columns,
tuple size 1064 bytes, 10mn tuples) with leader participation where
I'm seeing regression:

1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader
and workers, then my results are as follows:
0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275
2) when myState->ti_options = 0; for both leader and workers, then my
results are as follows:
0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409

I hope the above explanation and the test results should clarify the
fact that skipping FSM doesn't solve the problem. Let me know if
anything is not clear or I'm missing something.

[1] postgresql.conf parameters used:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
port = 5440

System Configuration:
RAM:     528GB
Disk Type:   SSD
Disk Size:   1.5TB
lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Model name:            Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:              2
CPU MHz:               1064.000
CPU max MHz:           2129.0000
CPU min MHz:           1064.0000
BogoMIPS:              4266.62
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              24576K

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Fix typo: multiple tuple => tuples
Next
From: Bharath Rupireddy
Date:
Subject: Re: Parallel Inserts in CREATE TABLE AS