Thread: faster ETL / bulk data load for heap tables

faster ETL / bulk data load for heap tables

From

Luc Vlaming

Date:

01 January 2021, 14:06:45

Hi,

In an effort to speed up bulk data loading/transforming I noticed that 
considerable time is spent in the relation extension lock. I know there 
are already many other efforts to increase the chances of using bulk 
loading [1], [2], [3], [4], efforts to make loading more parallel [5], 
and removing some syscalls [6], as well as completely new I/O systems 
[7] and some more extreme measures like disabling wal logging [8].

Whilst they will all help, they will ultimately be stopped by the 
relation extension lock. Also it seems from the tests I've done so far 
that at least for bulk loading using pwrite() actually can carry us 
rather far as long as we are not doing it under a global lock. Moreover, 
the solution provided here might be an alternative to [7] because the 
results are quite promising, even with WAL enabled.

Attached two WIP patches in the hopes of getting feedback.
The first changes the way we do bulk loading: each backend now gets a 
standalone set of blocks allocated that are local to that backend. This 
helps both with reducing contention but also with some OS writeback 
mechanisms.
The second patch reduces the time spent on locking the partition buffers 
by shifting around the logic to make each set of 128 blocks use the same 
buffer partition, and then adding a custom function to get buffer blocks 
specifically for extension, whilst keeping a previous partition lock, 
thereby reducing the amount of time we spent on futexes.

The design:
- add a set of buffers into the BulkInsertState that can be used by the 
backend without any locks.
- add a ReadBufferExtendBulk function which extends the relation with 
BULK_INSERT_BATCH_SIZE blocks at once.
- rework FileWrite to have a parameter to speed up relation extension by 
passing in if we are using filewrite just to extend the file. if 
supported uses ftruncate as this is much faster (also than 
posix_fallocate on my system) and according to the manpages 
(https://linux.die.net/man/2/ftruncate) should read as zeroed space. to 
be cleaned-up later possibly into a special function FileExtend().
- rework mdextend to get a page count.
- make a specialized version of BufferAlloc called BufferAllocExtend 
which keeps around the lock on the last buffer partition and tries to 
reuse this so that there are a lot less futex calls.

Many things that are still to be done; some are:
- reintroduce FSM again, and possibly optimize the lock usage there. in 
other words: this patch currently can only start the server and run COPY 
FROM and read queries.
- look into the wal logging. whilst it seems to be fairly optimal 
already wrt the locking and such i noticed there seems to be an 
alternating pattern between the bgwriter and the workers. whilst setting 
some parameters bigger helped a lot (wal_writer_flush_after, 
wal_writer_delay, wal_buffers)
- work nicely with the code from [6] so that the register_dirty_segment 
is indeed not needed anymore; or alternatively optimize that code so 
that less locks are needed.
- make use of [9] in the fallback case in FileWrite() when 
ftruncate/fallocate is not available so that the buffer size can be reduced.

First results are below; all tests were loading 32 times the same 1G 
lineitem csv into the same table. tests were done both on a nvme and the 
more parallel ones also with a tmpfs disk to see potential disk 
bottlenecks and e.g. potential wrt using NVDIMM.
=================================
using an unlogged table:
NVME, UNLOGGED table, 4 parallel streams:   HEAD 171s, patched 113s
NVME, UNLOGGED table, 8 parallel streams:   HEAD 113s, patched 67s
NVME, UNLOGGED table, 16 parallel streams:  HEAD 112s, patched 42s
NVME, UNLOGGED table, 32 parallel streams:  HEAD 121s, patched 36s
tmpfs, UNLOGGED table, 16 parallel streams: HEAD 96s, patched 38s
tmpfs, UNLOGGED table, 32 parallel streams: HEAD 104s, patched 25s
=================================
using a normal table, wal-level=minimal, 16mb wal segments:
NVME, 4 parallel streams:   HEAD 237s, patched 157s
NVME, 8 parallel streams:   HEAD 200s, patched 142s
NVME, 16 parallel streams:  HEAD 171s, patched 145s
NVME, 32 parallel streams:  HEAD 199s, patched 146s
tmpfs, 16 parallel streams: HEAD 131s, patched 89s
tmpfs, 32 parallel streams: HEAD 148s, patched 98s
=================================
using a normal table, wal-level=minimal, 256mb wal segments,
wal_init_zero = off, wal_buffers = 262143, wal_writer_delay = 10000ms,
wal_writer_flush_after = 512MB

NVME, 4 parallel streams:   HEAD 201s, patched 159s
NVME, 8 parallel streams:   HEAD 157s, patched 109s
NVME, 16 parallel streams:  HEAD 150s, patched 78s
NVME, 32 parallel streams:  HEAD 158s, patched 70s
tmpfs, 16 parallel streams: HEAD 106s, patched 54s
tmpfs, 32 parallel streams: HEAD 113s, patched 44s
=================================

Thoughts?

Cheers,
Luc
Swarm64


[1] 

https://www.postgresql.org/message-id/flat/CAJcOf-f%3DUX1uKbPjDXf%2B8gJOoEPz9VCzh7pKnknfU4sG4LXj0A%40mail.gmail.com#49fe9f2ffcc9916cc5ed3a712aa5f28f
[2] 

https://www.postgresql.org/message-id/flat/CALj2ACWjymmyTvvhU20Er-LPLaBjzBQxMJwr4nzO7XWmOkxhsg%40mail.gmail.com#be34b5b7861876fc0fd7edb621c067fa
[3] 
https://www.postgresql.org/message-id/flat/CALj2ACXg-4hNKJC6nFnepRHYT4t5jJVstYvri%2BtKQHy7ydcr8A%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40mail.gmail.com
[5] 

https://www.postgresql.org/message-id/flat/CALDaNm3GaZyYPpGu-PpF0SEkJg-eaW3TboHxpxJ-2criv2j_eA%40mail.gmail.com#07292ce654ef58fae7f257a4e36afc41
[6] 

https://www.postgresql.org/message-id/flat/20200203132319.x7my43whtefeznz7%40alap3.anarazel.de#85a2a0ab915cdf079862d70505abe3db
[7] 

https://www.postgresql.org/message-id/flat/20201208040227.7rlzpfpdxoau4pvd%40alap3.anarazel.de#b8ea4a3b7f37e88ddfe121c4b3075e7b
[8] 

https://www.postgresql.org/message-id/flat/CAD21AoA9oK1VOoUuKW-jEO%3DY2nt5kCQKKFgeQwwRUMoh6BE-ow%40mail.gmail.com#0475248a5ff7aed735be41fd4034ae36
[9] 

https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHf_R_ih1pkBMTWn%3DSTyKMOM2Ks47Y_UqqfU1wRc1VvA%40mail.gmail.com#7a53ad72331e423ba3c6a50e6dc1259f

Attachment

Re: faster ETL / bulk data load for heap tables

From

Zhihong Yu

Date:

01 January 2021, 18:55:13

Hi, Luc:

Happy New Year.

Looking at BufferAllocExtend() in v1-0002-WIP-buffer-alloc-specialized-for-relation-extensi.patch. it seems there is duplicate code with the existing BufferAlloc().

It would be good if some refactoring is done by extracting common code into a helper function.

Thanks

On Fri, Jan 1, 2021 at 6:07 AM Luc Vlaming <luc@swarm64.com> wrote:

Hi,

In an effort to speed up bulk data loading/transforming I noticed that
considerable time is spent in the relation extension lock. I know there
are already many other efforts to increase the chances of using bulk
loading [1], [2], [3], [4], efforts to make loading more parallel [5],
and removing some syscalls [6], as well as completely new I/O systems
[7] and some more extreme measures like disabling wal logging [8].

Whilst they will all help, they will ultimately be stopped by the
relation extension lock. Also it seems from the tests I've done so far
that at least for bulk loading using pwrite() actually can carry us
rather far as long as we are not doing it under a global lock. Moreover,
the solution provided here might be an alternative to [7] because the
results are quite promising, even with WAL enabled.

Attached two WIP patches in the hopes of getting feedback.
The first changes the way we do bulk loading: each backend now gets a
standalone set of blocks allocated that are local to that backend. This
helps both with reducing contention but also with some OS writeback
mechanisms.
The second patch reduces the time spent on locking the partition buffers
by shifting around the logic to make each set of 128 blocks use the same
buffer partition, and then adding a custom function to get buffer blocks
specifically for extension, whilst keeping a previous partition lock,
thereby reducing the amount of time we spent on futexes.

The design:
- add a set of buffers into the BulkInsertState that can be used by the
backend without any locks.
- add a ReadBufferExtendBulk function which extends the relation with
BULK_INSERT_BATCH_SIZE blocks at once.
- rework FileWrite to have a parameter to speed up relation extension by
passing in if we are using filewrite just to extend the file. if
supported uses ftruncate as this is much faster (also than
posix_fallocate on my system) and according to the manpages
(https://linux.die.net/man/2/ftruncate) should read as zeroed space. to
be cleaned-up later possibly into a special function FileExtend().
- rework mdextend to get a page count.
- make a specialized version of BufferAlloc called BufferAllocExtend
which keeps around the lock on the last buffer partition and tries to
reuse this so that there are a lot less futex calls.

Many things that are still to be done; some are:
- reintroduce FSM again, and possibly optimize the lock usage there. in
other words: this patch currently can only start the server and run COPY
FROM and read queries.
- look into the wal logging. whilst it seems to be fairly optimal
already wrt the locking and such i noticed there seems to be an
alternating pattern between the bgwriter and the workers. whilst setting
some parameters bigger helped a lot (wal_writer_flush_after,
wal_writer_delay, wal_buffers)
- work nicely with the code from [6] so that the register_dirty_segment
is indeed not needed anymore; or alternatively optimize that code so
that less locks are needed.
- make use of [9] in the fallback case in FileWrite() when
ftruncate/fallocate is not available so that the buffer size can be reduced.

First results are below; all tests were loading 32 times the same 1G
lineitem csv into the same table. tests were done both on a nvme and the
more parallel ones also with a tmpfs disk to see potential disk
bottlenecks and e.g. potential wrt using NVDIMM.
=================================
using an unlogged table:
NVME, UNLOGGED table, 4 parallel streams: HEAD 171s, patched 113s
NVME, UNLOGGED table, 8 parallel streams: HEAD 113s, patched 67s
NVME, UNLOGGED table, 16 parallel streams: HEAD 112s, patched 42s
NVME, UNLOGGED table, 32 parallel streams: HEAD 121s, patched 36s
tmpfs, UNLOGGED table, 16 parallel streams: HEAD 96s, patched 38s
tmpfs, UNLOGGED table, 32 parallel streams: HEAD 104s, patched 25s
=================================
using a normal table, wal-level=minimal, 16mb wal segments:
NVME, 4 parallel streams: HEAD 237s, patched 157s
NVME, 8 parallel streams: HEAD 200s, patched 142s
NVME, 16 parallel streams: HEAD 171s, patched 145s
NVME, 32 parallel streams: HEAD 199s, patched 146s
tmpfs, 16 parallel streams: HEAD 131s, patched 89s
tmpfs, 32 parallel streams: HEAD 148s, patched 98s
=================================
using a normal table, wal-level=minimal, 256mb wal segments,
wal_init_zero = off, wal_buffers = 262143, wal_writer_delay = 10000ms,
wal_writer_flush_after = 512MB

NVME, 4 parallel streams: HEAD 201s, patched 159s
NVME, 8 parallel streams: HEAD 157s, patched 109s
NVME, 16 parallel streams: HEAD 150s, patched 78s
NVME, 32 parallel streams: HEAD 158s, patched 70s
tmpfs, 16 parallel streams: HEAD 106s, patched 54s
tmpfs, 32 parallel streams: HEAD 113s, patched 44s
=================================

Thoughts?

Cheers,
Luc
Swarm64

[1]
https://www.postgresql.org/message-id/flat/CAJcOf-f%3DUX1uKbPjDXf%2B8gJOoEPz9VCzh7pKnknfU4sG4LXj0A%40mail.gmail.com#49fe9f2ffcc9916cc5ed3a712aa5f28f
[2]
https://www.postgresql.org/message-id/flat/CALj2ACWjymmyTvvhU20Er-LPLaBjzBQxMJwr4nzO7XWmOkxhsg%40mail.gmail.com#be34b5b7861876fc0fd7edb621c067fa
[3]
https://www.postgresql.org/message-id/flat/CALj2ACXg-4hNKJC6nFnepRHYT4t5jJVstYvri%2BtKQHy7ydcr8A%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/flat/CALDaNm3GaZyYPpGu-PpF0SEkJg-eaW3TboHxpxJ-2criv2j_eA%40mail.gmail.com#07292ce654ef58fae7f257a4e36afc41
[6]
https://www.postgresql.org/message-id/flat/20200203132319.x7my43whtefeznz7%40alap3.anarazel.de#85a2a0ab915cdf079862d70505abe3db
[7]
https://www.postgresql.org/message-id/flat/20201208040227.7rlzpfpdxoau4pvd%40alap3.anarazel.de#b8ea4a3b7f37e88ddfe121c4b3075e7b
[8]
https://www.postgresql.org/message-id/flat/CAD21AoA9oK1VOoUuKW-jEO%3DY2nt5kCQKKFgeQwwRUMoh6BE-ow%40mail.gmail.com#0475248a5ff7aed735be41fd4034ae36
[9]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHf_R_ih1pkBMTWn%3DSTyKMOM2Ks47Y_UqqfU1wRc1VvA%40mail.gmail.com#7a53ad72331e423ba3c6a50e6dc1259f

Re: faster ETL / bulk data load for heap tables

From

Amit Kapila

Date:

02 January 2021, 07:36:51

On Fri, Jan 1, 2021 at 7:37 PM Luc Vlaming <luc@swarm64.com> wrote:
>
> Hi,
>
> In an effort to speed up bulk data loading/transforming I noticed that
> considerable time is spent in the relation extension lock.
>

We already do extend the relation in bulk when there is a contention
on relation extension lock via RelationAddExtraBlocks. I wonder why is
that not able to address this kind of workload. On a quick look at
your patch, it seems you are always trying to extend the relation by
128 blocks for copy operation after acquiring the lock whereas the
current mechanism has some smarts where it decides based on the number
of waiters. Now, it is possible that we should extend by a larger
number of blocks as compared to what we are doing now but using some
ad-hoc number might lead to space wastage. Have you tried to fiddle
with the current scheme of bulk-extension to see if that addresses
some of the gains you are seeing? I see that you have made quite a few
other changes that might be helping here but still, it is better to
see how much bottleneck is for relation extension lock and if that can
be addressed with the current mechanism rather than changing the
things in a different way.

-- 
With Regards,
Amit Kapila.

Re: faster ETL / bulk data load for heap tables

From

Luc Vlaming

Date:

02 January 2021, 10:04:37

On 02-01-2021 08:36, Amit Kapila wrote:
> On Fri, Jan 1, 2021 at 7:37 PM Luc Vlaming <luc@swarm64.com> wrote:
>>
>> Hi,
>>
>> In an effort to speed up bulk data loading/transforming I noticed that
>> considerable time is spent in the relation extension lock.
>>
> 
> We already do extend the relation in bulk when there is a contention
> on relation extension lock via RelationAddExtraBlocks. I wonder why is
> that not able to address this kind of workload. On a quick look at
> your patch, it seems you are always trying to extend the relation by
> 128 blocks for copy operation after acquiring the lock whereas the
> current mechanism has some smarts where it decides based on the number
> of waiters. Now, it is possible that we should extend by a larger
> number of blocks as compared to what we are doing now but using some
> ad-hoc number might lead to space wastage. Have you tried to fiddle
> with the current scheme of bulk-extension to see if that addresses
> some of the gains you are seeing? I see that you have made quite a few
> other changes that might be helping here but still, it is better to
> see how much bottleneck is for relation extension lock and if that can
> be addressed with the current mechanism rather than changing the
> things in a different way.
> 

Hi,

Thanks for looking at the patch!

Yes I tried that. I guess I should have also shared all other things I 
have tried before I ended up with these patches.
I've tried to improve RelationAddExtraBlocks to extend the mechanism to 
more aggresively allocate blocks, to not put them in the FSM immediately 
as this also seemed like a point of contention, extending ReadBufferBI 
to allocate several pages in a loop, and a few more variants like this.
To be sure I just tried again a few variants where I made e.g. 
extraBlocks=128, 256, etc, and disabled e.g. the FSM code. None of those 
grant very big performance gains or actually make it slower, suggesting 
the code is parameterized quite well already for the current design.

The main problem is that the relation extension lock is taken to extend 
one block at a time, whilst doing (expensive) syscalls like pwrite(). 
Even though we then put these blocks immediately in the FSM and such, 
the bottleneck stays the extension of the file itself, no matter how 
many blocks we allocate in a loop in RelationAddExtraBlocks.

Therefore what I've set out to do is:
- make the relation extension lock taken as short as possible.
- reduce the time spent on syscalls as much as possible.

This resulted in a design which hands out blocks to a specific backend 
so that everything after the file extension can be done safely without 
locks. Given that the current API did not allow this to be specified 
properly for now I added extra functions which have just as purpose to 
extend the relation so that they could be purpose built for this. I did 
not want to suggest this is however the best API there could be. The 
current state of this patch is in that sense still very crude. I just 
wrote the code I needed to be able to do performance testing. So it is 
as you rightfully pointed out quite ad-hoc and not very generic, nor the 
right design, has quite some code duplication, etc.

I was at this point mostly looking for feedback on the design/approach. 
If this requires a (much) cleaner and more productified patch then I can 
arrange that. However I thought to first find out if this approach makes 
sense at all before doing more testing to make this more generic.

Kind regards,
Luc

Re: faster ETL / bulk data load for heap tables

From

Luc Vlaming

Date:

02 January 2021, 10:09:20

On 01-01-2021 19:55, Zhihong Yu wrote:
> Hi, Luc:
> Happy New Year.
> 
> Looking at BufferAllocExtend() 
> in v1-0002-WIP-buffer-alloc-specialized-for-relation-extensi.patch. it 
> seems there is duplicate code with the existing BufferAlloc().
> 
> It would be good if some refactoring is done by extracting common code 
> into a helper function.
> 
> Thanks
> 

Hi,

Thanks! Happy new year to you too!

Thanks for your suggestion. I would wait a bit and first get some 
feedback on the design/approach of my patches before doing the 
refactoring. The current code is very much a WIP where I just copied 
functions to be able to make specialized variants of them aimed at bulk 
extension and then benchmark those. If the refactoring needs to be done 
before I can get feedback on the design / approach then let me know.

Kind regards,
Luc

Re: faster ETL / bulk data load for heap tables

From

Amit Kapila

Date:

04 January 2021, 05:05:23

On Sat, Jan 2, 2021 at 3:34 PM Luc Vlaming <luc@swarm64.com> wrote:
>
> On 02-01-2021 08:36, Amit Kapila wrote:
> > On Fri, Jan 1, 2021 at 7:37 PM Luc Vlaming <luc@swarm64.com> wrote:
> >>
> >> Hi,
> >>
> >> In an effort to speed up bulk data loading/transforming I noticed that
> >> considerable time is spent in the relation extension lock.
> >>
> >
> > We already do extend the relation in bulk when there is a contention
> > on relation extension lock via RelationAddExtraBlocks. I wonder why is
> > that not able to address this kind of workload. On a quick look at
> > your patch, it seems you are always trying to extend the relation by
> > 128 blocks for copy operation after acquiring the lock whereas the
> > current mechanism has some smarts where it decides based on the number
> > of waiters. Now, it is possible that we should extend by a larger
> > number of blocks as compared to what we are doing now but using some
> > ad-hoc number might lead to space wastage. Have you tried to fiddle
> > with the current scheme of bulk-extension to see if that addresses
> > some of the gains you are seeing? I see that you have made quite a few
> > other changes that might be helping here but still, it is better to
> > see how much bottleneck is for relation extension lock and if that can
> > be addressed with the current mechanism rather than changing the
> > things in a different way.
> >
>
> Hi,
>
> Thanks for looking at the patch!
>
> Yes I tried that. I guess I should have also shared all other things I
> have tried before I ended up with these patches.
> I've tried to improve RelationAddExtraBlocks to extend the mechanism to
> more aggresively allocate blocks, to not put them in the FSM immediately
> as this also seemed like a point of contention, extending ReadBufferBI
> to allocate several pages in a loop, and a few more variants like this.
> To be sure I just tried again a few variants where I made e.g.
> extraBlocks=128, 256, etc, and disabled e.g. the FSM code. None of those
> grant very big performance gains or actually make it slower, suggesting
> the code is parameterized quite well already for the current design.
>
> The main problem is that the relation extension lock is taken to extend
> one block at a time, whilst doing (expensive) syscalls like pwrite().
> Even though we then put these blocks immediately in the FSM and such,
> the bottleneck stays the extension of the file itself, no matter how
> many blocks we allocate in a loop in RelationAddExtraBlocks.
>
> Therefore what I've set out to do is:
> - make the relation extension lock taken as short as possible.
> - reduce the time spent on syscalls as much as possible.
>
> This resulted in a design which hands out blocks to a specific backend
> so that everything after the file extension can be done safely without
> locks.
>

Oh, so you mean to say that the benefit comes from the fact that you
have disassociated relation extension from buffer allocation and then
extend it in multiples and then separately allocated buffer for all
those new blocks. I don't see anything wrong with that immediately but
I think it would be better if we know the benefits of each
optimization step. Basically, as of now, you have done multiple things
to get the benefits shown but it is not clear which of those
optimizations have generated how much benefit.

> Given that the current API did not allow this to be specified
> properly for now I added extra functions which have just as purpose to
> extend the relation so that they could be purpose built for this. I did
> not want to suggest this is however the best API there could be. The
> current state of this patch is in that sense still very crude. I just
> wrote the code I needed to be able to do performance testing. So it is
> as you rightfully pointed out quite ad-hoc and not very generic, nor the
> right design, has quite some code duplication, etc.
>
> I was at this point mostly looking for feedback on the design/approach.
> If this requires a (much) cleaner and more productified patch then I can
> arrange that. However I thought to first find out if this approach makes
> sense at all before doing more testing to make this more generic.
>

I think irrespective of which approach is used there is always a
better chance of getting meaningful feedback on cleaner patches.

-- 
With Regards,
Amit Kapila.