faster ETL / bulk data load for heap tables - Mailing list pgsql-hackers

From Luc Vlaming
Subject faster ETL / bulk data load for heap tables
Date
Msg-id 1175de79-690b-c605-8bb0-2fb152c2ea25@swarm64.com
Whole thread Raw
Responses Re: faster ETL / bulk data load for heap tables  (Zhihong Yu <zyu@yugabyte.com>)
Re: faster ETL / bulk data load for heap tables  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi,

In an effort to speed up bulk data loading/transforming I noticed that 
considerable time is spent in the relation extension lock. I know there 
are already many other efforts to increase the chances of using bulk 
loading [1], [2], [3], [4], efforts to make loading more parallel [5], 
and removing some syscalls [6], as well as completely new I/O systems 
[7] and some more extreme measures like disabling wal logging [8].

Whilst they will all help, they will ultimately be stopped by the 
relation extension lock. Also it seems from the tests I've done so far 
that at least for bulk loading using pwrite() actually can carry us 
rather far as long as we are not doing it under a global lock. Moreover, 
the solution provided here might be an alternative to [7] because the 
results are quite promising, even with WAL enabled.

Attached two WIP patches in the hopes of getting feedback.
The first changes the way we do bulk loading: each backend now gets a 
standalone set of blocks allocated that are local to that backend. This 
helps both with reducing contention but also with some OS writeback 
mechanisms.
The second patch reduces the time spent on locking the partition buffers 
by shifting around the logic to make each set of 128 blocks use the same 
buffer partition, and then adding a custom function to get buffer blocks 
specifically for extension, whilst keeping a previous partition lock, 
thereby reducing the amount of time we spent on futexes.

The design:
- add a set of buffers into the BulkInsertState that can be used by the 
backend without any locks.
- add a ReadBufferExtendBulk function which extends the relation with 
BULK_INSERT_BATCH_SIZE blocks at once.
- rework FileWrite to have a parameter to speed up relation extension by 
passing in if we are using filewrite just to extend the file. if 
supported uses ftruncate as this is much faster (also than 
posix_fallocate on my system) and according to the manpages 
(https://linux.die.net/man/2/ftruncate) should read as zeroed space. to 
be cleaned-up later possibly into a special function FileExtend().
- rework mdextend to get a page count.
- make a specialized version of BufferAlloc called BufferAllocExtend 
which keeps around the lock on the last buffer partition and tries to 
reuse this so that there are a lot less futex calls.

Many things that are still to be done; some are:
- reintroduce FSM again, and possibly optimize the lock usage there. in 
other words: this patch currently can only start the server and run COPY 
FROM and read queries.
- look into the wal logging. whilst it seems to be fairly optimal 
already wrt the locking and such i noticed there seems to be an 
alternating pattern between the bgwriter and the workers. whilst setting 
some parameters bigger helped a lot (wal_writer_flush_after, 
wal_writer_delay, wal_buffers)
- work nicely with the code from [6] so that the register_dirty_segment 
is indeed not needed anymore; or alternatively optimize that code so 
that less locks are needed.
- make use of [9] in the fallback case in FileWrite() when 
ftruncate/fallocate is not available so that the buffer size can be reduced.

First results are below; all tests were loading 32 times the same 1G 
lineitem csv into the same table. tests were done both on a nvme and the 
more parallel ones also with a tmpfs disk to see potential disk 
bottlenecks and e.g. potential wrt using NVDIMM.
=================================
using an unlogged table:
NVME, UNLOGGED table, 4 parallel streams:   HEAD 171s, patched 113s
NVME, UNLOGGED table, 8 parallel streams:   HEAD 113s, patched 67s
NVME, UNLOGGED table, 16 parallel streams:  HEAD 112s, patched 42s
NVME, UNLOGGED table, 32 parallel streams:  HEAD 121s, patched 36s
tmpfs, UNLOGGED table, 16 parallel streams: HEAD 96s, patched 38s
tmpfs, UNLOGGED table, 32 parallel streams: HEAD 104s, patched 25s
=================================
using a normal table, wal-level=minimal, 16mb wal segments:
NVME, 4 parallel streams:   HEAD 237s, patched 157s
NVME, 8 parallel streams:   HEAD 200s, patched 142s
NVME, 16 parallel streams:  HEAD 171s, patched 145s
NVME, 32 parallel streams:  HEAD 199s, patched 146s
tmpfs, 16 parallel streams: HEAD 131s, patched 89s
tmpfs, 32 parallel streams: HEAD 148s, patched 98s
=================================
using a normal table, wal-level=minimal, 256mb wal segments,
wal_init_zero = off, wal_buffers = 262143, wal_writer_delay = 10000ms,
wal_writer_flush_after = 512MB

NVME, 4 parallel streams:   HEAD 201s, patched 159s
NVME, 8 parallel streams:   HEAD 157s, patched 109s
NVME, 16 parallel streams:  HEAD 150s, patched 78s
NVME, 32 parallel streams:  HEAD 158s, patched 70s
tmpfs, 16 parallel streams: HEAD 106s, patched 54s
tmpfs, 32 parallel streams: HEAD 113s, patched 44s
=================================

Thoughts?

Cheers,
Luc
Swarm64


[1] 

https://www.postgresql.org/message-id/flat/CAJcOf-f%3DUX1uKbPjDXf%2B8gJOoEPz9VCzh7pKnknfU4sG4LXj0A%40mail.gmail.com#49fe9f2ffcc9916cc5ed3a712aa5f28f
[2] 

https://www.postgresql.org/message-id/flat/CALj2ACWjymmyTvvhU20Er-LPLaBjzBQxMJwr4nzO7XWmOkxhsg%40mail.gmail.com#be34b5b7861876fc0fd7edb621c067fa
[3] 
https://www.postgresql.org/message-id/flat/CALj2ACXg-4hNKJC6nFnepRHYT4t5jJVstYvri%2BtKQHy7ydcr8A%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40mail.gmail.com
[5] 

https://www.postgresql.org/message-id/flat/CALDaNm3GaZyYPpGu-PpF0SEkJg-eaW3TboHxpxJ-2criv2j_eA%40mail.gmail.com#07292ce654ef58fae7f257a4e36afc41
[6] 

https://www.postgresql.org/message-id/flat/20200203132319.x7my43whtefeznz7%40alap3.anarazel.de#85a2a0ab915cdf079862d70505abe3db
[7] 

https://www.postgresql.org/message-id/flat/20201208040227.7rlzpfpdxoau4pvd%40alap3.anarazel.de#b8ea4a3b7f37e88ddfe121c4b3075e7b
[8] 

https://www.postgresql.org/message-id/flat/CAD21AoA9oK1VOoUuKW-jEO%3DY2nt5kCQKKFgeQwwRUMoh6BE-ow%40mail.gmail.com#0475248a5ff7aed735be41fd4034ae36
[9] 

https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHf_R_ih1pkBMTWn%3DSTyKMOM2Ks47Y_UqqfU1wRc1VvA%40mail.gmail.com#7a53ad72331e423ba3c6a50e6dc1259f

Attachment

pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Let people set host(no)ssl settings from initdb
Next
From: Zhihong Yu
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit