Thread: refactoring relation extension and BufferAlloc(), faster COPY

refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

29 October 2022, 02:54:20

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock.  We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
   smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS


1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.


The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.


My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.


To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.



Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
   does, nearly doubles the amount of writes. First the kernel ends up writing
   out all the zeroed out buffers after a while, then when we write out the
   actual buffer contents.

   The best fix for that seems to be to optionally use posix_fallocate() to
   reserve space, without dirtying pages in the kernel page cache. However, it
   looks like that's only beneficial when extending by multiple pages at once,
   because it ends up causing one filesystem-journal entry for each extension
   on at least some filesystems.

   I added 'smgrzeroextend()' that can extend by multiple blocks, without the
   caller providing a buffer to write out. When extending by 8 or more blocks,
   posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
   used to extend the file.


b) I found that is quite beneficial to bulk-extend the relation with
   smgrextend() even without concurrency. The reason for that is the primarily
   the aforementioned dirty buffers that our current extension method causes.

   One bit that stumped me for quite a while is to know how much to extend the
   relation by. RelationGetBufferForTuple() drives the decision whether / how
   much to bulk extend purely on the contention on the extension lock, which
   obviously does not work for non-concurrent workloads.

   After quite a while I figured out that we actually have good information on
   how much to extend by, at least for COPY /
   heap_multi_insert(). heap_multi_insert() can compute how much space is
   needed to store all tuples, and pass that on to
   RelationGetBufferForTuple().

   For that to be accurate we need to recompute that number whenever we use an
   already partially filled page. That's not great, but doesn't appear to be a
   measurable overhead.


c) Contention on the FSM and the pages returned by it is a serious bottleneck
   after a) and b).

   The biggest issue is that the current bulk insertion logic in hio.c enters
   all but one of the new pages into the freespacemap. That will immediately
   cause all the other backends to contend on the first few pages returned the
   FSM, and cause contention on the FSM pages itself.

   I've, partially, addressed that by using the information about the required
   number of pages from b). Whether we bulk insert or not, the number of pages
   we know we're going to need for one heap_multi_insert() don't need to be
   added to the FSM - we're going to use them anyway.

   I've stashed the number of free blocks in the BulkInsertState for now, but
   I'm not convinced that that's the right place.

   If I revert just this part, the "concurrent COPY into unlogged table"
   benchmark goes from ~240 tps to ~190 tps.


   Even after that change the FSM is a major bottleneck. Below I included
   benchmarks showing this by just removing the use of the FSM, but I haven't
   done anything further about it. The contention seems to be both from
   updating the FSM, as well as thundering-herd like symptoms from accessing
   the FSM.

   The update part could likely be addressed to some degree with a batch
   update operation updating the state for multiple pages.

   The access part could perhaps be addressed by adding an operation that gets
   a page and immediately marks it as fully used, so other backends won't also
   try to access it.



d) doing
        /* new buffers are zero-filled */
        MemSet((char *) bufBlock, 0, BLCKSZ);

  under the extension lock is surprisingly expensive on my two socket
  workstation (but much less noticable on my laptop).

  If I move the MemSet back under the extension lock, the "concurrent COPY
  into unlogged table" benchmark goes from ~240 tps to ~200 tps.


e) When running a few benchmarks for this email, I noticed that there was a
   sharp performance dropoff for the patched code for a pgbench -S -s100 on a
   database with 1GB s_b, start between 512 and 1024 clients. This started with
   the patch only acquiring one buffer partition lock at a time. Lots of
   debugging ensued, resulting in [3].

   The problem isn't actually related to the change, it just makes it more
   visible, because the "lock chains" between two partitions reduce the
   average length of the wait queues substantially, by distribution them
   between more partitions.  [3] has a reproducer that's entirely independent
   of this patchset.




Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.


The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock).  That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.



I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.


server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       58.63   207     50.22   242     54.35   224
2       32.67   372     25.82   472     27.30   446
4       22.53   540     13.30   916     14.33   851
8       15.14   804     7.43    1640    7.48    1632
16      14.69   829     4.79    2544    4.50    2718
32      15.28   797     4.41    2763    3.32    3710
64      15.34   794     5.22    2334    3.06    4061
128     15.49   786     4.97    2452    3.13    3926
256     15.85   768     5.02    2427    3.26    3769
512     16.02   760     5.29    2303    3.54    3471

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       68.18   178     59.41   205     63.43   192
2       39.71   306     33.10   368     34.99   348
4       27.26   446     19.75   617     20.09   607
8       18.84   646     12.86   947     12.68   962
16      15.96   763     9.62    1266    8.51    1436
32      15.43   789     8.20    1486    7.77    1579
64      16.11   756     8.91    1367    8.90    1383
128     16.41   742     10.00   1218    9.74    1269
256     17.33   702     11.91   1023    10.89   1136
512     18.46   659     14.07   866     11.82   1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       63.27s  192     56.14   217     59.25   205
2       40.17s  303     29.88   407     31.50   386
4       27.57s  442     16.16   754     17.18   709
8       21.26s  573     11.89   1025    11.09   1099
16      21.25s  573     10.68   1141    10.22   1192
32      21.00s  580     10.72   1136    10.35   1178
64      20.64s  590     10.15   1200    9.76    1249
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       71.89s  169     65.57   217     69.09   69.09
2       47.36s  257     36.22   407     38.71   38.71
4       33.10s  368     21.76   754     22.78   22.78
8       26.62s  457     15.89   1025    15.30   15.30
16      24.89s  489     17.08   1141    15.20   15.20
32      25.15s  484     17.41   1136    16.14   16.14
64      26.11s  466     17.89   1200    16.76   16.76
128     skipped
256     skipped
512     skipped


Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       34.14    357    28.04    434    29.46    413
2       22.67    537    14.42    845    14.75    826
4       16.63    732    7.62    1599    7.69    1587
8       13.48    904    4.36    2795    4.13    2959
16      14.37    848    3.78    3224    2.74    4493
32      14.79    823    4.20    2902    2.07    5974
64      14.76    825    5.03    2423    2.21    5561
128     14.95    815    4.36    2796    2.30    5343
256     15.18    802    4.31    2828    2.49    4935
512     15.41    790    4.59    2656    2.84    4327


s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    62.55    194    54.22    224
2    37.11    328    28.94    421
4    25.97    469    16.42    742
8    20.01    609    11.92    1022
16    19.55    623    11.05    1102
32    19.34    630    11.27    1081
64    19.07    639    12.04    1012
128    19.22    634    12.27    993
256    19.34    630    12.28    992
512    19.60    621    11.74    1038

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    71.71    169    63.63    191
2    46.93    259    36.31    335
4    30.37    401    22.41    543
8    22.86    533    16.90    721
16    20.18    604    14.07    866
32    19.64    620    13.06    933
64    19.71    618    15.08    808
128    19.95    610    15.47    787
256    20.48    595    16.53    737
512    21.56    565    16.86    722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    62.65    194    55.74    218
2    40.25    302    29.45    413
4    27.37    445    16.26    749
8    22.07    552    11.75    1037
16    21.29    572    10.64    1145
32    20.98    580    10.70    1139
64    20.65    590    10.21    1193
128    skipped
256    skipped
512    skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    71.72    169    65.12    187
2    46.46    262    35.74    341
4    32.61    373    21.60    564
8    26.69    456    16.30    747
16    25.31    481    17.00    716
32    24.96    488    17.47    697
64    26.05    467    17.90    680
128    skipped
256    skipped
512    skipped


test: unlogged_small_files, format: binary, files: 1024, 9505MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    37.62    323    32.77    371
2    28.35    429    18.89    645
4    20.87    583    12.18    1000
8    19.37    629    10.38    1173
16    19.41    627    10.36    1176
32    18.62    654    11.04    1103
64    18.33    664    11.89    1024
128    18.41    661    11.91    1023
256    18.52    658    12.10    1007
512    18.78    648    11.49    1060


benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.


The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.


s_b=1GB
    tps        tps
clients    master        patched
1          49541           48805
2       85342           90010
4         167340          168918
8         308194          303222
16        524294          523678
32        649516          649100
64        932547          937702
128       908249          906281
256       856496          903979
512       764254          934702
1024      653886          925113
2048      569695          917262
4096      526782          903258


s_b=128MB:
    tps        tps
clients    master        patched
1          40407           39854
2          73180           72252
4         143334          140860
8         240982          245331
16        429265          420810
32        544593          540127
64        706408          726678
128       713142          718087
256       611030          695582
512       552751          686290
1024      508248          666370
2048      474108          656735
4096      448582          633040


As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
  that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
  same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
  resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...


The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtest);
 
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtext);
 
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Hi,

On 2023-01-06 11:52:04 +0530, vignesh C wrote:
> On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote:
> >
> > The patches here aren't fully polished (as will be evident). But they should
> > be more than good enough to discuss whether this is a sane direction.
> 
> The patch does not apply on top of HEAD as in [1], please post a rebased
> patch.

Thanks for letting me now. Updated version attached.

Greetings,

Andres Freund

Hi,

On 2023-03-01 09:02:00 -0800, Andres Freund wrote:
> On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote:
> > On 27/02/2023 23:45, Andres Freund wrote:
> > > But, uh, isn't this code racy? Because this doesn't go through shared buffers,
> > > there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
> > > that writing pages isn't atomic vs readers. So another connection could
> > > connection could see the new relation size, but a read might return a
> > > partially written state of the page. Which then would cause checksum
> > > failures. And even worse, I think it could lead to loosing a write, if the
> > > concurrent connection writes out a page.
> > 
> > fsm_readbuf and vm_readbuf check the relation size first, with
> > smgrnblocks(), before trying to read the page. So to have a problem, the
> > smgrnblocks() would have to already return the new size, but the smgrread()
> > would not return the new contents. I don't think that's possible, but not
> > sure.
> 
> I hacked Thomas' program to test torn reads to ftruncate the file on the write
> side.
> 
> It frequently observes a file size that's not the write size (e.g. reading 4k
> when writing an 8k block).
> 
> After extending the test to more than one reader, I indeed also see torn
> reads. So far all the tears have been at a 4k block boundary. However so far
> it always has been *prior* page contents, not 0s.

On tmpfs the failure rate is much higher, and we also end up reading 0s,
despite never writing them.

I've attached my version of the test program.

ext4: lots of 4k reads with 8k writes, some torn reads at 4k boundaries
xfs: no issues
tmpfs: loads of 4k reads with 8k writes, lots torn reads reading 0s, some torn reads at 4k boundaries


Greetings,

Andres Freund

Attachment

concurrent-read-write.c

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Heikki Linnakangas

Date:

01 March 2023, 22:04:14

On 01/03/2023 09:33, Andres Freund wrote:
> On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote:
>> On 2023-Feb-21, Heikki Linnakangas wrote:
>>
>>>> +static BlockNumber
>>>> +BulkExtendSharedRelationBuffered(Relation rel,
>>>> +                                 SMgrRelation smgr,
>>>> +                                 bool skip_extension_lock,
>>>> +                                 char relpersistence,
>>>> +                                 ForkNumber fork, ReadBufferMode mode,
>>>> +                                 BufferAccessStrategy strategy,
>>>> +                                 uint32 *num_pages,
>>>> +                                 uint32 num_locked_pages,
>>>> +                                 Buffer *buffers)
>>>
>>> Ugh, that's a lot of arguments, some are inputs and some are outputs. I
>>> don't have any concrete suggestions, but could we simplify this somehow?
>>> Needs a comment at least.
>>
>> Yeah, I noticed this too.  I think it would be easy enough to add a new
>> struct that can be passed as a pointer, which can be stack-allocated
>> by the caller, and which holds the input arguments that are common to
>> both functions, as is sensible.
> 
> I played a fair bit with various options. I ended up not using a struct to
> pass most options, but instead go for a flags argument. However, I did use a
> struct for passing either relation or smgr.
> 
> 
> typedef enum ExtendBufferedFlags {
>     EB_SKIP_EXTENSION_LOCK = (1 << 0),
>     EB_IN_RECOVERY = (1 << 1),
>     EB_CREATE_FORK_IF_NEEDED = (1 << 2),
>     EB_LOCK_FIRST = (1 << 3),
> 
>     /* internal flags follow */
>     EB_RELEASE_PINS = (1 << 4),
> } ExtendBufferedFlags;

Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really 
needed? What does EB_LOCK_FIRST do?

> /*
>   * To identify the relation - either relation or smgr + relpersistence has to
>   * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
>   * to use the same function for both crash recovery and normal operation.
>   */
> typedef struct ExtendBufferedWhat
> {
>     Relation rel;
>     struct SMgrRelationData *smgr;
>     char relpersistence;
> } ExtendBufferedWhat;
> 
> #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
> /* requires use of EB_SKIP_EXTENSION_LOCK */
> #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})

Clever. I'm still not 100% convinced we need the EB_SMGR variant, but 
with this we'll have the flexibility in any case.

> extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
>                                 ForkNumber forkNum,
>                                 BufferAccessStrategy strategy,
>                                 uint32 flags);
> extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
>                                        ForkNumber fork,
>                                        BufferAccessStrategy strategy,
>                                        uint32 flags,
>                                        uint32 extend_by,
>                                        Buffer *buffers,
>                                        uint32 *extended_by);
> extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
>                                   ForkNumber fork,
>                                   BufferAccessStrategy strategy,
>                                   uint32 flags,
>                                   BlockNumber extend_to,
>                                   ReadBufferMode mode);
> 
> As you can see I removed ReadBufferMode from most of the functions (as
> suggested by Heikki earlier). When extending by 1/multiple pages, we only need
> to know whether to lock or not.

Ok, that's better. Still complex and a lot of arguments, but I don't 
have any great suggestions on how to improve it.

> The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
> fall back to reading page normally if there was a concurrent relation
> extension.

Hmm, I think you'll need another return value, to let the caller know if 
the relation was extended or not. Or a flag to ereport(ERROR) if the 
page already exists, for ginbuild() and friends.

> The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated,
> gnarly, code to do so from vm_extend(), fsm_extend().

Makes sense.

> I'm not sure about the function naming pattern. I do like 'By' a lot more than
> the Bulk prefix I used before.

+1

- Heikki

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

01 March 2023, 22:35:15

Hi,

On 2023-03-02 00:04:14 +0200, Heikki Linnakangas wrote:
> On 01/03/2023 09:33, Andres Freund wrote:
> > typedef enum ExtendBufferedFlags {
> >     EB_SKIP_EXTENSION_LOCK = (1 << 0),
> >     EB_IN_RECOVERY = (1 << 1),
> >     EB_CREATE_FORK_IF_NEEDED = (1 << 2),
> >     EB_LOCK_FIRST = (1 << 3),
> >
> >     /* internal flags follow */
> >     EB_RELEASE_PINS = (1 << 4),
> > } ExtendBufferedFlags;
>
> Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really needed?

Right now it's just passed in from the caller. It's at the moment just needed
to know what to pass to smgrcreate(isRedo).

However, XLogReadBufferExtended() doesn't currently use this path, so maybe
it's not actually needed.

> What does EB_LOCK_FIRST do?

Lock the first returned buffer, this is basically the replacement for
num_locked_buffers from the earlier version. I think it's likely that either
locking the first, or potentially at some later point locking all buffers, is
all that's needed for ExtendBufferedRelBy().

EB_LOCK_FIRST_BUFFER would perhaps be better?

> > /*
> >   * To identify the relation - either relation or smgr + relpersistence has to
> >   * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
> >   * to use the same function for both crash recovery and normal operation.
> >   */
> > typedef struct ExtendBufferedWhat
> > {
> >     Relation rel;
> >     struct SMgrRelationData *smgr;
> >     char relpersistence;
> > } ExtendBufferedWhat;
> >
> > #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
> > /* requires use of EB_SKIP_EXTENSION_LOCK */
> > #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence =
p_relpersistence})
>
> Clever. I'm still not 100% convinced we need the EB_SMGR variant, but with
> this we'll have the flexibility in any case.

Hm - how would you use it from XLogReadBufferExtended() without that?

XLogReadBufferExtended() spends a disappointing amount of time in
smgropen(). Quite visible in profiles.

In the plan read case at least one in XLogReadBufferExtended()
itself, then in ReadBufferWithoutRelcache(). The extension path right now is
worse - it does one smgropen() for each extended block.

I think we should avoid using ReadBufferWithoutRelcache() in
XLogReadBufferExtended() in the read path as well, but that's for later.

> > extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
> >                                 ForkNumber forkNum,
> >                                 BufferAccessStrategy strategy,
> >                                 uint32 flags);
> > extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
> >                                        ForkNumber fork,
> >                                        BufferAccessStrategy strategy,
> >                                        uint32 flags,
> >                                        uint32 extend_by,
> >                                        Buffer *buffers,
> >                                        uint32 *extended_by);
> > extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
> >                                   ForkNumber fork,
> >                                   BufferAccessStrategy strategy,
> >                                   uint32 flags,
> >                                   BlockNumber extend_to,
> >                                   ReadBufferMode mode);
> >
> > As you can see I removed ReadBufferMode from most of the functions (as
> > suggested by Heikki earlier). When extending by 1/multiple pages, we only need
> > to know whether to lock or not.
>
> Ok, that's better. Still complex and a lot of arguments, but I don't have
> any great suggestions on how to improve it.

I don't think there are going to be all that many callers of
ExtendBufferedRelBy() and ExtendBufferedRelTo(), most are going to be
ExtendBufferedRel(), I think. So the complexity seems acceptable.

> > The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
> > fall back to reading page normally if there was a concurrent relation
> > extension.
>
> Hmm, I think you'll need another return value, to let the caller know if the
> relation was extended or not. Or a flag to ereport(ERROR) if the page
> already exists, for ginbuild() and friends.

I don't think ginbuild() et al need to use ExtendBufferedRelTo()? A plain
ExtendBufferedRel() should suffice.  The places that do need it are
fsm_extend() and vm_extend() - I did end up avoiding the redundant lookup.

But I was wondering about a flag controlling this as well.

Attached is my current version of this. Still needs more polishing, including
comments explaining the flags. But I thought it'd be useful to have it out
there.

There's two new patches in the series:
- a patch to not initialize pages in the loop in fsm_extend(), vm_extend()
  anymore - we have a check about initializing pages at a later point, so
  there doesn't really seem to be a need for it?
- a patch to use the new ExtendBufferedRelTo() in fsm_extend(), vm_extend()
  and XLogReadBufferExtended()

In this version I also tries to address some of the other feedback raised in
the thread. One thing I haven't decided what to do about yet is David's
feedback about a version of PinLocalBuffer() that doesn't adjust the
usagecount, which wouldn't need to read the buf_state.

Greetings,

Andres Freund

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

25 March 2023, 05:03:58

Hi,

On 2022-10-28 19:54:20 -0700, Andres Freund wrote:
> I've done a fair bit of benchmarking of this patchset. For COPY it comes out
> ahead everywhere. It's possible that there's a very small regression for
> extremly IO miss heavy workloads, more below.
> 
> 
> server "base" configuration:
> 
> max_wal_size=150GB
> shared_buffers=24GB
> huge_pages=on
> autovacuum=0
> backend_flush_after=2MB
> max_connections=5000
> wal_buffers=128MB
> wal_segment_size=1GB
> 
> benchmark: pgbench running COPY into a single table. pgbench -t is set
> according to the client count, so that the same amount of data is inserted.
> This is done oth using small files ([1], ringbuffer not effective, no dirty
> data to write out within the benchmark window) and a bit larger files ([2],
> lots of data to write out due to ringbuffer).
> 
> To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.
> 
> s_b=24GB
> 
> test: unlogged_small_files, format: text, files: 1024, 9015MB total
>         seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
> clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
> 1       58.63   207     50.22   242     54.35   224
> 2       32.67   372     25.82   472     27.30   446
> 4       22.53   540     13.30   916     14.33   851
> 8       15.14   804     7.43    1640    7.48    1632
> 16      14.69   829     4.79    2544    4.50    2718
> 32      15.28   797     4.41    2763    3.32    3710
> 64      15.34   794     5.22    2334    3.06    4061
> 128     15.49   786     4.97    2452    3.13    3926
> 256     15.85   768     5.02    2427    3.26    3769
> 512     16.02   760     5.29    2303    3.54    3471

I just spent a few hours trying to reproduce these benchmark results. For the
longest time I could not get the numbers for *HEAD* to even get close to the
above, while the numbers for the patch were very close.

I was worried it was a performance regression in HEAD etc. But no, same git
commit as back then produces the same issue.

As it turns out, I somehow screwed up my benchmark tooling, and I did not set
set the CPU "scaling_governor" and "energy_performance_preference" to
"performance". In a crazy turn of events, that approximately makes no
difference with the patch applied, and a 2x difference for HEAD.

I suspect this is some pathological issue when encountering heavy lock
contention (likely leading to the CPU reducing speed into a deeper state,
which then takes longer to get out of when the lock is released). As the lock
contention is drastically reduced with the patch, that affect is not visible
anymore.

After fixing the performance scaling issue, the results are quite close to the
above numbers again...

Aargh, I want my afternoon back.

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

26 March 2023, 19:26:59

Hi,

Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the
relation extension related code in hio.c back into its own function.

While reviewing the hio.c code, I did realize that too much stuff is done
while holding the buffer lock. See also the pre-existing issue
https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de

Greetings,

Andres Freund

Hi,

Attached is v6. Changes:

- Try to address Melanie and Horiguchi-san's review. I think there's one or
two further things that need to be done

- Avoided inserting newly extended pages into the FSM while holding a buffer
lock. If we need to do so, we now drop the buffer lock and recheck if there
still is space (very very likely). See also [1]. I use the infrastructure
introduced over in that in this patchset.

- Lots of comment and commit message polishing. More needed, particularly for
the latter, but ...

- Added a patch to fix the pre-existing undefined behaviour in localbuf.c that
Melanie pointed out. Plan to commit that soon.

- Added a patch to fix some pre-existing DO_DB() format code issues. Plan to
commit that soon.

I did some benchmarking on "bufmgr: Acquire and clean victim buffer
separately" in isolation. For workloads that do a *lot* of reads, that proves
to be a substantial benefit on its own. For the, obviously unrealistically
extreme, workload of N backends doing
SELECT pg_prewarm('pgbench_accounts', 'buffer');
in a scale 100 database (with a 1281MB pgbench_accounts) and shared_buffers of
128MB, I see > 2x gains at 128, 512 clients. Of course realistic workloads
will have much smaller gains, but it's still good to see.

Looking at the patchset, I am mostly happy with the breakdown into individual
commits. However "bufmgr: Move relation extension handling into
ExtendBufferedRel{By,To,}" is quite large. But I don't quite see how to break
it into smaller pieces without making things awkward (e.g. due to static
functions being unused, or temporarily duplicating the code doing relation
extensions).

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

John Naylor

Date:

30 March 2023, 05:28:57

On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:

> Attached is v6. Changes:

0006:

+ ereport(ERROR,

+ errcode_for_file_access(),
+ errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+ FilePathName(v->mdfd_vfd)),
+ errhint("Check free disk space."));

Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was used in FileFallocate().

--
John Naylor
EDB: http://www.enterprisedb.com

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

05 April 2023, 00:32:03

Hi,

On 2023-03-30 12:28:57 +0700, John Naylor wrote:
> On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> > Attached is v6. Changes:
> 
> 0006:
> 
> +        ereport(ERROR,
> +            errcode_for_file_access(),
> +            errmsg("could not extend file \"%s\" with posix_fallocate():
> %m",
> +                 FilePathName(v->mdfd_vfd)),
> +            errhint("Check free disk space."));
> 
> Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was
> used in FileFallocate().

Fair point. I would however like to see a different error message for the two
ways of extending, at least initially. What about referencing FileFallocate()?

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

05 April 2023, 00:39:45

Hi,

On 2023-03-29 20:02:33 -0700, Andres Freund wrote:
> Attached is v6. Changes:

Attached is v7. Not much in the way of changes:
- polished a lot of the commit messages
- reordered commits to not be blocked as immediately by
  https://www.postgresql.org/message-id/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de
- Used the new relation extension function in two more places (finding an
  independent bug on the way), not sure why I didn't convert those earlier...

I'm planning to push the patches up to the hio.c changes soon, unless somebody
would like me to hold off.

After that I'm planning to wait for a buildfarm cycle, and push the changes
necessary to use bulk extension in hio.c (the main win).

I might split the patch to use ExtendBufferedRelTo() into two, one for
vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more
complicated and has more of a complicated history (see [1]).

Greetings,

Andres Freund


https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de

Hi,

On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:
> A few days later I've found a new defect introduced with 31966b151.

That's the same issue that Tom also just reported, at
https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us

Attached is my WIP fix, including a test.

Greetings,

Andres Freund

Attachment

v1-0001-WIP-Fix-ExtendBufferedRelTo-assert-failure-in-rec.patch

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Alexander Lakhin

Date:

12 April 2023, 05:00:00

12.04.2023 02:21, Andres Freund wrote:
> Hi,
>
> On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:
>> A few days later I've found a new defect introduced with 31966b151.
> That's the same issue that Tom also just reported, at
> https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us
>
> Attached is my WIP fix, including a test.

Thanks for the fix. I can confirm that the issue is gone.
ReadBuffer_common() contains an Assert(), that is similar to the fixed one,
but it looks unreachable for the WAL replay case after 26158b852.

Best regards,
Alexander

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

14 April 2023, 18:38:44

Hi,

On 2023-04-12 08:00:00 +0300, Alexander Lakhin wrote:
> 12.04.2023 02:21, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:
> > > A few days later I've found a new defect introduced with 31966b151.
> > That's the same issue that Tom also just reported, at
> > https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us
> > 
> > Attached is my WIP fix, including a test.
> 
> Thanks for the fix. I can confirm that the issue is gone.
> ReadBuffer_common() contains an Assert(), that is similar to the fixed one,
> but it looks unreachable for the WAL replay case after 26158b852.

Good catch. I implemented it there too. As now all of the modes are supported,
I removed the assertion.

I also extended the test slightly to also test the case of dropped relations.

Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Muhammad Malik

Date:

03 May 2023, 18:25:51

Hi,

Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want to use the same benchmarks.

Thanks,

Muhammad

From: Andres Freund <andres@anarazel.de>
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; Thomas Munro <thomas.munro@gmail.com>; Melanie Plageman <melanieplageman@gmail.com>
Cc: Yura Sokolov <y.sokolov@postgrespro.ru>; Robert Haas <robertmhaas@gmail.com>
Subject: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
   smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS

1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.

The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.

My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.

To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.

Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
   does, nearly doubles the amount of writes. First the kernel ends up writing
   out all the zeroed out buffers after a while, then when we write out the
   actual buffer contents.

   The best fix for that seems to be to optionally use posix_fallocate() to
   reserve space, without dirtying pages in the kernel page cache. However, it
   looks like that's only beneficial when extending by multiple pages at once,
   because it ends up causing one filesystem-journal entry for each extension
   on at least some filesystems.

   I added 'smgrzeroextend()' that can extend by multiple blocks, without the
   caller providing a buffer to write out. When extending by 8 or more blocks,
   posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
   used to extend the file.

b) I found that is quite beneficial to bulk-extend the relation with
   smgrextend() even without concurrency. The reason for that is the primarily
   the aforementioned dirty buffers that our current extension method causes.

   One bit that stumped me for quite a while is to know how much to extend the
   relation by. RelationGetBufferForTuple() drives the decision whether / how
   much to bulk extend purely on the contention on the extension lock, which
   obviously does not work for non-concurrent workloads.

   After quite a while I figured out that we actually have good information on
   how much to extend by, at least for COPY /
   heap_multi_insert(). heap_multi_insert() can compute how much space is
   needed to store all tuples, and pass that on to
   RelationGetBufferForTuple().

   For that to be accurate we need to recompute that number whenever we use an
   already partially filled page. That's not great, but doesn't appear to be a
   measurable overhead.

c) Contention on the FSM and the pages returned by it is a serious bottleneck
   after a) and b).

   The biggest issue is that the current bulk insertion logic in hio.c enters
   all but one of the new pages into the freespacemap. That will immediately
   cause all the other backends to contend on the first few pages returned the
   FSM, and cause contention on the FSM pages itself.

   I've, partially, addressed that by using the information about the required
   number of pages from b). Whether we bulk insert or not, the number of pages
   we know we're going to need for one heap_multi_insert() don't need to be
   added to the FSM - we're going to use them anyway.

   I've stashed the number of free blocks in the BulkInsertState for now, but
   I'm not convinced that that's the right place.

   If I revert just this part, the "concurrent COPY into unlogged table"
   benchmark goes from ~240 tps to ~190 tps.

   Even after that change the FSM is a major bottleneck. Below I included
   benchmarks showing this by just removing the use of the FSM, but I haven't
   done anything further about it. The contention seems to be both from
   updating the FSM, as well as thundering-herd like symptoms from accessing
   the FSM.

   The update part could likely be addressed to some degree with a batch
   update operation updating the state for multiple pages.

   The access part could perhaps be addressed by adding an operation that gets
   a page and immediately marks it as fully used, so other backends won't also
   try to access it.

d) doing
                /* new buffers are zero-filled */
                MemSet((char *) bufBlock, 0, BLCKSZ);

under the extension lock is surprisingly expensive on my two socket
workstation (but much less noticable on my laptop).

If I move the MemSet back under the extension lock, the "concurrent COPY
into unlogged table" benchmark goes from ~240 tps to ~200 tps.

e) When running a few benchmarks for this email, I noticed that there was a
   sharp performance dropoff for the patched code for a pgbench -S -s100 on a
   database with 1GB s_b, start between 512 and 1024 clients. This started with
   the patch only acquiring one buffer partition lock at a time. Lots of
   debugging ensued, resulting in [3].

   The problem isn't actually related to the change, it just makes it more
   visible, because the "lock chains" between two partitions reduce the
   average length of the wait queues substantially, by distribution them
   between more partitions. [3] has a reproducer that's entirely independent
   of this patchset.

Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.

The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock). That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.

I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.

server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm no_fsm
1       58.63   207     50.22   242     54.35   224
2       32.67   372     25.82   472     27.30   446
4       22.53   540     13.30   916     14.33   851
8       15.14   804     7.43    1640    7.48    1632
16      14.69   829     4.79    2544    4.50    2718
32      15.28   797     4.41    2763    3.32    3710
64      15.34   794     5.22    2334    3.06    4061
128     15.49   786     4.97    2452    3.13    3926
256     15.85   768     5.02    2427    3.26    3769
512     16.02   760     5.29    2303    3.54    3471

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm no_fsm
1       68.18   178     59.41   205     63.43   192
2       39.71   306     33.10   368     34.99   348
4       27.26   446     19.75   617     20.09   607
8       18.84   646     12.86   947     12.68   962
16      15.96   763     9.62    1266    8.51    1436
32      15.43   789     8.20    1486    7.77    1579
64      16.11   756     8.91    1367    8.90    1383
128     16.41   742     10.00   1218    9.74    1269
256     17.33   702     11.91   1023    10.89   1136
512     18.46   659     14.07   866     11.82   1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm no_fsm
1       63.27s 192     56.14   217     59.25   205
2       40.17s 303     29.88   407     31.50   386
4       27.57s 442     16.16   754     17.18   709
8       21.26s 573     11.89   1025    11.09   1099
16      21.25s 573     10.68   1141    10.22   1192
32      21.00s 580     10.72   1136    10.35   1178
64      20.64s 590     10.15   1200    9.76    1249
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm no_fsm
1       71.89s 169     65.57   217     69.09   69.09
2       47.36s 257     36.22   407     38.71   38.71
4       33.10s 368     21.76   754     22.78   22.78
8       26.62s 457     15.89   1025    15.30   15.30
16      24.89s 489     17.08   1141    15.20   15.20
32      25.15s 484     17.41   1136    16.14   16.14
64      26.11s 466     17.89   1200    16.76   16.76
128     skipped
256     skipped
512     skipped

Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm no_fsm
1       34.14   357     28.04   434     29.46   413
2       22.67   537     14.42   845     14.75   826
4       16.63   732     7.62    1599    7.69    1587
8       13.48   904     4.36    2795    4.13    2959
16      14.37   848     3.78    3224    2.74    4493
32      14.79   823     4.20    2902    2.07    5974
64      14.76   825     5.03    2423    2.21    5561
128     14.95   815     4.36    2796    2.30    5343
256     15.18   802     4.31    2828    2.49    4935
512     15.41   790     4.59    2656    2.84    4327

s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       62.55   194     54.22   224
2       37.11   328     28.94   421
4       25.97   469     16.42   742
8       20.01   609     11.92   1022
16      19.55   623     11.05   1102
32      19.34   630     11.27   1081
64      19.07   639     12.04   1012
128     19.22   634     12.27   993
256     19.34   630     12.28   992
512     19.60   621     11.74   1038

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       71.71   169     63.63   191
2       46.93   259     36.31   335
4       30.37   401     22.41   543
8       22.86   533     16.90   721
16      20.18   604     14.07   866
32      19.64   620     13.06   933
64      19.71   618     15.08   808
128     19.95   610     15.47   787
256     20.48   595     16.53   737
512     21.56   565     16.86   722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       62.65   194     55.74   218
2       40.25   302     29.45   413
4       27.37   445     16.26   749
8       22.07   552     11.75   1037
16      21.29   572     10.64   1145
32      20.98   580     10.70   1139
64      20.65   590     10.21   1193
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       71.72   169     65.12   187
2       46.46   262     35.74   341
4       32.61   373     21.60   564
8       26.69   456     16.30   747
16      25.31   481     17.00   716
32      24.96   488     17.47   697
64      26.05   467     17.90   680
128     skipped
256     skipped
512     skipped

test: unlogged_small_files, format: binary, files: 1024, 9505MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       37.62   323     32.77   371
2       28.35   429     18.89   645
4       20.87   583     12.18   1000
8       19.37   629     10.38   1173
16      19.41   627     10.36   1176
32      18.62   654     11.04   1103
64      18.33   664     11.89   1024
128     18.41   661     11.91   1023
256     18.52   658     12.10   1007
512     18.78   648     11.49   1060

benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.

The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.

s_b=1GB
        tps             tps
clients master          patched
1          49541           48805
2          85342           90010
4         167340          168918
8         308194          303222
16        524294          523678
32        649516          649100
64        932547          937702
128       908249          906281
256       856496          903979
512       764254          934702
1024      653886          925113
2048      569695          917262
4096      526782          903258

s_b=128MB:
        tps             tps
clients master          patched
1          40407           39854
2          73180           72252
4         143334          140860
8         240982          245331
16        429265          420810
32        544593          540127
64        706408          726678
128       713142          718087
256       611030          695582
512       552751          686290
1024      508248          666370
2048      474108          656735
4096      448582          633040

As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

03 May 2023, 18:44:19

Hi,

On 2023-05-03 18:25:51 +0000, Muhammad Malik wrote:
> Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want
touse the same benchmarks.
 

The email should contain all the necessary information. What are you missing?


c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data
text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench -n -P1
-c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0'


> I've done a fair bit of benchmarking of this patchset. For COPY it comes out
> ahead everywhere. It's possible that there's a very small regression for
> extremly IO miss heavy workloads, more below.
>
>
> server "base" configuration:
>
> max_wal_size=150GB
> shared_buffers=24GB
> huge_pages=on
> autovacuum=0
> backend_flush_after=2MB
> max_connections=5000
> wal_buffers=128MB
> wal_segment_size=1GB
>
> benchmark: pgbench running COPY into a single table. pgbench -t is set
> according to the client count, so that the same amount of data is inserted.
> This is done oth using small files ([1], ringbuffer not effective, no dirty
> data to write out within the benchmark window) and a bit larger files ([2],
> lots of data to write out due to ringbuffer).

I use a script like:

c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time
/srv/dev/build/m-opt/src/bin/pgbench/pgbench-n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE
copytest_0'


> [1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtest);
 
> [2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtext);
 


Greetings,

Andres Freund

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Muhammad Malik

Date:

03 May 2023, 19:29:46

> I use a script like:

> c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench -n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0'

> >[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
> >[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);

When I ran this script it did not insert anything into the copytest_0 table. It only generated a single copytest_data_text.copy file of size 9.236MB.

Please help me understand how is this 'pgbench running COPY into a single table'. Also what are the 'seconds' and 'tbl-MBs' metrics that were reported.

Thank you,

Muhammad

Re: refactoring relation extension and BufferAlloc(), faster COPY

From

Andres Freund

Date:

03 May 2023, 21:25:45

Hi,

On 2023-05-03 19:29:46 +0000, Muhammad Malik wrote:
> > I use a script like:
>
> > c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time
/srv/dev/build/m-opt/src/bin/pgbench/pgbench-n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE
copytest_0'
>
> > >[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtest);
 
> > >[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy'
WITH(FORMAT text);
 
>
> When I ran this script it did not insert anything into the copytest_0 table. It only generated a single
copytest_data_text.copyfile of size 9.236MB.
 
> Please help me understand how is this 'pgbench running COPY into a single table'.

That's the data generation for the file to be COPYed in. The script passed to
pgbench is just something like

COPY copytest_0 FROM '/tmp/copytest_data_text.copy';
or
COPY copytest_0 FROM '/tmp/copytest_data_binary.copy';


> Also what are the 'seconds' and 'tbl-MBs' metrics that were reported.

The total time for inserting N (1024 for the small files, 64 for the larger
ones). "tbl-MBs" is size of the resulting table, divided by time. I.e. a
measure of throughput.

Greetings,

Andres Freund