Thread: refactoring relation extension and BufferAlloc(), faster COPY

refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock.  We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
   smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS


1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.


The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.


My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.


To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.



Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
   does, nearly doubles the amount of writes. First the kernel ends up writing
   out all the zeroed out buffers after a while, then when we write out the
   actual buffer contents.

   The best fix for that seems to be to optionally use posix_fallocate() to
   reserve space, without dirtying pages in the kernel page cache. However, it
   looks like that's only beneficial when extending by multiple pages at once,
   because it ends up causing one filesystem-journal entry for each extension
   on at least some filesystems.

   I added 'smgrzeroextend()' that can extend by multiple blocks, without the
   caller providing a buffer to write out. When extending by 8 or more blocks,
   posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
   used to extend the file.


b) I found that is quite beneficial to bulk-extend the relation with
   smgrextend() even without concurrency. The reason for that is the primarily
   the aforementioned dirty buffers that our current extension method causes.

   One bit that stumped me for quite a while is to know how much to extend the
   relation by. RelationGetBufferForTuple() drives the decision whether / how
   much to bulk extend purely on the contention on the extension lock, which
   obviously does not work for non-concurrent workloads.

   After quite a while I figured out that we actually have good information on
   how much to extend by, at least for COPY /
   heap_multi_insert(). heap_multi_insert() can compute how much space is
   needed to store all tuples, and pass that on to
   RelationGetBufferForTuple().

   For that to be accurate we need to recompute that number whenever we use an
   already partially filled page. That's not great, but doesn't appear to be a
   measurable overhead.


c) Contention on the FSM and the pages returned by it is a serious bottleneck
   after a) and b).

   The biggest issue is that the current bulk insertion logic in hio.c enters
   all but one of the new pages into the freespacemap. That will immediately
   cause all the other backends to contend on the first few pages returned the
   FSM, and cause contention on the FSM pages itself.

   I've, partially, addressed that by using the information about the required
   number of pages from b). Whether we bulk insert or not, the number of pages
   we know we're going to need for one heap_multi_insert() don't need to be
   added to the FSM - we're going to use them anyway.

   I've stashed the number of free blocks in the BulkInsertState for now, but
   I'm not convinced that that's the right place.

   If I revert just this part, the "concurrent COPY into unlogged table"
   benchmark goes from ~240 tps to ~190 tps.


   Even after that change the FSM is a major bottleneck. Below I included
   benchmarks showing this by just removing the use of the FSM, but I haven't
   done anything further about it. The contention seems to be both from
   updating the FSM, as well as thundering-herd like symptoms from accessing
   the FSM.

   The update part could likely be addressed to some degree with a batch
   update operation updating the state for multiple pages.

   The access part could perhaps be addressed by adding an operation that gets
   a page and immediately marks it as fully used, so other backends won't also
   try to access it.



d) doing
        /* new buffers are zero-filled */
        MemSet((char *) bufBlock, 0, BLCKSZ);

  under the extension lock is surprisingly expensive on my two socket
  workstation (but much less noticable on my laptop).

  If I move the MemSet back under the extension lock, the "concurrent COPY
  into unlogged table" benchmark goes from ~240 tps to ~200 tps.


e) When running a few benchmarks for this email, I noticed that there was a
   sharp performance dropoff for the patched code for a pgbench -S -s100 on a
   database with 1GB s_b, start between 512 and 1024 clients. This started with
   the patch only acquiring one buffer partition lock at a time. Lots of
   debugging ensued, resulting in [3].

   The problem isn't actually related to the change, it just makes it more
   visible, because the "lock chains" between two partitions reduce the
   average length of the wait queues substantially, by distribution them
   between more partitions.  [3] has a reproducer that's entirely independent
   of this patchset.




Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.


The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock).  That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.



I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.


server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       58.63   207     50.22   242     54.35   224
2       32.67   372     25.82   472     27.30   446
4       22.53   540     13.30   916     14.33   851
8       15.14   804     7.43    1640    7.48    1632
16      14.69   829     4.79    2544    4.50    2718
32      15.28   797     4.41    2763    3.32    3710
64      15.34   794     5.22    2334    3.06    4061
128     15.49   786     4.97    2452    3.13    3926
256     15.85   768     5.02    2427    3.26    3769
512     16.02   760     5.29    2303    3.54    3471

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       68.18   178     59.41   205     63.43   192
2       39.71   306     33.10   368     34.99   348
4       27.26   446     19.75   617     20.09   607
8       18.84   646     12.86   947     12.68   962
16      15.96   763     9.62    1266    8.51    1436
32      15.43   789     8.20    1486    7.77    1579
64      16.11   756     8.91    1367    8.90    1383
128     16.41   742     10.00   1218    9.74    1269
256     17.33   702     11.91   1023    10.89   1136
512     18.46   659     14.07   866     11.82   1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       63.27s  192     56.14   217     59.25   205
2       40.17s  303     29.88   407     31.50   386
4       27.57s  442     16.16   754     17.18   709
8       21.26s  573     11.89   1025    11.09   1099
16      21.25s  573     10.68   1141    10.22   1192
32      21.00s  580     10.72   1136    10.35   1178
64      20.64s  590     10.15   1200    9.76    1249
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       71.89s  169     65.57   217     69.09   69.09
2       47.36s  257     36.22   407     38.71   38.71
4       33.10s  368     21.76   754     22.78   22.78
8       26.62s  457     15.89   1025    15.30   15.30
16      24.89s  489     17.08   1141    15.20   15.20
32      25.15s  484     17.41   1136    16.14   16.14
64      26.11s  466     17.89   1200    16.76   16.76
128     skipped
256     skipped
512     skipped


Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       34.14    357    28.04    434    29.46    413
2       22.67    537    14.42    845    14.75    826
4       16.63    732    7.62    1599    7.69    1587
8       13.48    904    4.36    2795    4.13    2959
16      14.37    848    3.78    3224    2.74    4493
32      14.79    823    4.20    2902    2.07    5974
64      14.76    825    5.03    2423    2.21    5561
128     14.95    815    4.36    2796    2.30    5343
256     15.18    802    4.31    2828    2.49    4935
512     15.41    790    4.59    2656    2.84    4327


s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    62.55    194    54.22    224
2    37.11    328    28.94    421
4    25.97    469    16.42    742
8    20.01    609    11.92    1022
16    19.55    623    11.05    1102
32    19.34    630    11.27    1081
64    19.07    639    12.04    1012
128    19.22    634    12.27    993
256    19.34    630    12.28    992
512    19.60    621    11.74    1038

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    71.71    169    63.63    191
2    46.93    259    36.31    335
4    30.37    401    22.41    543
8    22.86    533    16.90    721
16    20.18    604    14.07    866
32    19.64    620    13.06    933
64    19.71    618    15.08    808
128    19.95    610    15.47    787
256    20.48    595    16.53    737
512    21.56    565    16.86    722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    62.65    194    55.74    218
2    40.25    302    29.45    413
4    27.37    445    16.26    749
8    22.07    552    11.75    1037
16    21.29    572    10.64    1145
32    20.98    580    10.70    1139
64    20.65    590    10.21    1193
128    skipped
256    skipped
512    skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    71.72    169    65.12    187
2    46.46    262    35.74    341
4    32.61    373    21.60    564
8    26.69    456    16.30    747
16    25.31    481    17.00    716
32    24.96    488    17.47    697
64    26.05    467    17.90    680
128    skipped
256    skipped
512    skipped


test: unlogged_small_files, format: binary, files: 1024, 9505MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1    37.62    323    32.77    371
2    28.35    429    18.89    645
4    20.87    583    12.18    1000
8    19.37    629    10.38    1173
16    19.41    627    10.36    1176
32    18.62    654    11.04    1103
64    18.33    664    11.89    1024
128    18.41    661    11.91    1023
256    18.52    658    12.10    1007
512    18.78    648    11.49    1060


benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.


The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.


s_b=1GB
    tps        tps
clients    master        patched
1          49541           48805
2       85342           90010
4         167340          168918
8         308194          303222
16        524294          523678
32        649516          649100
64        932547          937702
128       908249          906281
256       856496          903979
512       764254          934702
1024      653886          925113
2048      569695          917262
4096      526782          903258


s_b=128MB:
    tps        tps
clients    master        patched
1          40407           39854
2          73180           72252
4         143334          140860
8         240982          245331
16        429265          420810
32        544593          540127
64        706408          726678
128       713142          718087
256       611030          695582
512       552751          686290
1024      508248          666370
2048      474108          656735
4096      448582          633040


As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
  that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
  same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
  resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...


The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtest);
 
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtext);
 
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Bharath Rupireddy
Date:
On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I'm working to extract independently useful bits from my AIO work, to reduce
> the size of that patchset. This is one of those pieces.

Thanks a lot for this great work. There are 12 patches in this thread,
I believe each of these patches is trying to solve separate problems
and can be reviewed and get committed separately, am I correct?

> In workloads that extend relations a lot, we end up being extremely contended
> on the relation extension lock. We've attempted to address that to some degree
> by using batching, which helps, but only so much.

Yes, I too have observed this in the past for parallel inserts in CTAS
work - https://www.postgresql.org/message-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk%2BKD2H7ZmRg%40mail.gmail.com.
Tackling bulk relation extension problems will unblock the parallel
inserts (in CTAS, COPY) work I believe.

> The fundamental issue, in my opinion, is that we do *way* too much while
> holding the relation extension lock.  We acquire a victim buffer, if that
> buffer is dirty, we potentially flush the WAL, then write out that
> buffer. Then we zero out the buffer contents. Call smgrextend().
>
> Most of that work does not actually need to happen while holding the relation
> extension lock. As far as I can tell, the minimum that needs to be covered by
> the extension lock is the following:
>
> 1) call smgrnblocks()
> 2) insert buffer[s] into the buffer mapping table at the location returned by
>    smgrnblocks
> 3) mark buffer[s] as IO_IN_PROGRESS

Makes sense.

I will try to understand and review each patch separately.

Firstly, 0001 avoids extra loop over waiters and looks a reasonable
change, some comments on the patch:

1)
+    int            lwWaiting;        /* 0 if not waiting, 1 if on
waitlist, 2 if
+                                 * waiting to be woken */
Use macros instead of hard-coded values for better readability?

#define PROC_LW_LOCK_NOT_WAITING 0
#define PROC_LW_LOCK_ON_WAITLIST 1
#define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2

2) Missing initialization of lwWaiting to 0 or the macro in twophase.c
and proc.c.
    proc->lwWaiting = false;
    MyProc->lwWaiting = false;

3)
+        proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
+        found = true;
I guess 'found' is a bit meaningless here as we are doing away with
the proclist_foreach_modify loop. We can directly use
MyProc->lwWaiting == 1 and remove 'found'.

4)
            if (!MyProc->lwWaiting)
            if (!proc->lwWaiting)
Can we modify the above conditions in lwlock.c to MyProc->lwWaiting !=
1 or PROC_LW_LOCK_ON_WAITLIST or the macro?

5) Is there any specific test case that I can see benefit of this
patch? If yes, can you please share it here?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2022-10-29 18:33:53 +0530, Bharath Rupireddy wrote:
> On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > I'm working to extract independently useful bits from my AIO work, to reduce
> > the size of that patchset. This is one of those pieces.
> 
> Thanks a lot for this great work. There are 12 patches in this thread,
> I believe each of these patches is trying to solve separate problems
> and can be reviewed and get committed separately, am I correct?

Mostly, yes.

For 0001 I already started
https://www.postgresql.org/message-id/20221027165914.2hofzp4cvutj6gin%40awork3.anarazel.de
to discuss the specific issue.

We don't strictly need v1-0002-aio-Add-some-error-checking-around-pinning.patch
but I did find it useful.

v1-0012-bufmgr-debug-Add-PrintBuffer-Desc.patch is not used in the patch
series, but I found it quite useful when debugging issues with the patch. A
heck of a lot easier to interpret page flags when they can be printed.

I also think there's some architectural questions that'll influence the number
of patches. E.g. I'm not convinced
v1-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch is quite the
right spot to track which additional pages should be used. It could very well
instead be alongside ->smgr_targblock.  Possibly the best path would instead
be to return the additional pages explicitly to callers of
RelationGetBufferForTuple, but RelationGetBufferForTuple does a bunch of work
around pinning that potentially would need to be repeated in heap_multi_insert().



> > In workloads that extend relations a lot, we end up being extremely contended
> > on the relation extension lock. We've attempted to address that to some degree
> > by using batching, which helps, but only so much.
> 
> Yes, I too have observed this in the past for parallel inserts in CTAS
> work - https://www.postgresql.org/message-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk%2BKD2H7ZmRg%40mail.gmail.com.
> Tackling bulk relation extension problems will unblock the parallel
> inserts (in CTAS, COPY) work I believe.

Yea. There's a lot of places the current approach ended up being a bottleneck.


> Firstly, 0001 avoids extra loop over waiters and looks a reasonable
> change, some comments on the patch:

> 1)
> +    int            lwWaiting;        /* 0 if not waiting, 1 if on
> waitlist, 2 if
> +                                 * waiting to be woken */
> Use macros instead of hard-coded values for better readability?
> 
> #define PROC_LW_LOCK_NOT_WAITING 0
> #define PROC_LW_LOCK_ON_WAITLIST 1
> #define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2

Yea - this was really more of a prototype patch - I noted that we'd want to
use defines for this in
https://www.postgresql.org/message-id/20221027165914.2hofzp4cvutj6gin%40awork3.anarazel.de


> 3)
> +        proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
> +        found = true;
> I guess 'found' is a bit meaningless here as we are doing away with
> the proclist_foreach_modify loop. We can directly use
> MyProc->lwWaiting == 1 and remove 'found'.

We can rename it, but I think we still do need it, it's easier to analyze the
logic if the relevant check happens on a value from while we held the wait
list lock. Probably should do the reset inside the locked section as well.


> 4)
>             if (!MyProc->lwWaiting)
>             if (!proc->lwWaiting)
> Can we modify the above conditions in lwlock.c to MyProc->lwWaiting !=
> 1 or PROC_LW_LOCK_ON_WAITLIST or the macro?

I think it's better to check it's 0, rather than just != 1.


> 5) Is there any specific test case that I can see benefit of this
> patch? If yes, can you please share it here?

Yep, see the other thread, there's a pretty easy case there. You can also see
it at extreme client counts with a pgbench -S against a cluster with a smaller
shared_buffers. But the difference is not huge before something like 2048-4096
clients, and then it only occurs occasionally (because you need to end up with
most connections waiting for one of the partitions). So the test case from the
other thread is a lot better.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
vignesh C
Date:
On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote:
>
> The patches here aren't fully polished (as will be evident). But they should
> be more than good enough to discuss whether this is a sane direction.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
f2857af485a00ab5dbfa2c83af9d83afe4378239 ===
=== applying patch
./v1-0001-wip-lwlock-fix-quadratic-behaviour-with-very-long.patch
patching file src/include/storage/proc.h
Hunk #1 FAILED at 217.
1 out of 1 hunk FAILED -- saving rejects to file src/include/storage/proc.h.rej
patching file src/backend/storage/lmgr/lwlock.c
Hunk #1 succeeded at 988 with fuzz 2 (offset 1 line).
Hunk #2 FAILED at 1047.
Hunk #3 FAILED at 1076.
Hunk #4 FAILED at 1104.
Hunk #5 FAILED at 1117.
Hunk #6 FAILED at 1141.
Hunk #7 FAILED at 1775.
Hunk #8 FAILED at 1790.
7 out of 8 hunks FAILED -- saving rejects to file
src/backend/storage/lmgr/lwlock.c.rej

[1] - http://cfbot.cputube.org/patch_41_3993.log

Regards,
Vignesh



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-01-06 11:52:04 +0530, vignesh C wrote:
> On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote:
> >
> > The patches here aren't fully polished (as will be evident). But they should
> > be more than good enough to discuss whether this is a sane direction.
> 
> The patch does not apply on top of HEAD as in [1], please post a rebased
> patch.

Thanks for letting me now. Updated version attached.

Greetings,

Andres Freund

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
David Rowley
Date:
On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote:
> Thanks for letting me now. Updated version attached.

I'm not too sure I've qualified for giving a meaningful design review
here, but I have started looking at the patches and so far only made
it as far as 0006.

I noted down the following while reading:

v2-0001:

1. BufferCheckOneLocalPin needs a header comment

v2-0002:

2. The following comment and corresponding code to release the
extension lock has been moved now.

/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more.
*/

I think it's worth detailing out why it's fine to release the
extension lock in the new location. You've added detail to the commit
message but I think you need to do the same in the comments too.

v2-0003

3. FileFallocate() and FileZero() should likely document what they
return, i.e zero on success and non-zero on failure.

4. I'm not quite clear on why you've modified FileGetRawDesc() to call
FileAccess() twice.

v2-0004:

5. Is it worth having two versions of PinLocalBuffer() one to adjust
the usage count and one that does not? Couldn't the version that does
not adjust the count skip doing pg_atomic_read_u32()?

v2-0005
v2-0006

David



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
I'll continue reviewing this, but here's some feedback on the first two 
patches:

v2-0001-aio-Add-some-error-checking-around-pinning.patch:

I wonder if the extra assertion in LockBufHdr() is worth the overhead. 
It won't add anything without assertions, of course, but still. No 
objections if you think it's worth it.


v2-0002-hio-Release-extension-lock-before-initializing-pa.patch:

Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK, 
which zeroes the page, and then we call PageInit to zero the page again. 
RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache 
previously, but with P_NEW, that is always true.

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
> v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
This can be applied separately from the rest of the patches, which is 
nice. Some small comments on it:

* Needs a rebase, it conflicted slightly with commit f30d62c2fc.

* GetVictimBuffer needs a comment to explain what it does. In 
particular, mention that it returns a buffer that's pinned and known 
!BM_TAG_VALID.

* I suggest renaming 'cur_buf' and other such local variables in 
GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some 
other buffer involved too, but there is no 'prev' or 'next' or 'other' 
buffer. The old code called it just 'buf' too, and before this patch it 
actually was a bit confusing because there were two buffers involved. 
But with this patch, GetVictimBuffer only deals with one buffer at a time.

* This FIXME:

>         /* OK, do the I/O */
>         /* FIXME: These used the wrong smgr before afaict? */
>         {
>             SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),
>                                          InvalidBackendId);
> 
>             TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
>                                                       buf_hdr->tag.blockNum,
>                                                       smgr->smgr_rlocator.locator.spcOid,
>                                                       smgr->smgr_rlocator.locator.dbOid,
>                                                       smgr->smgr_rlocator.locator.relNumber);
> 
>             FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context);
>             LWLockRelease(content_lock);
> 
>             ScheduleBufferTagForWriteback(&BackendWritebackContext,
>                                           &buf_hdr->tag);
> 
>             TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
>                                                      buf_hdr->tag.blockNum,
>                                                      smgr->smgr_rlocator.locator.spcOid,
>                                                      smgr->smgr_rlocator.locator.dbOid,
>                                                      smgr->smgr_rlocator.locator.relNumber);
>         }

I believe that was intentional. The probes previously reported the block 
and relation whose read *caused* the eviction. It was not just the smgr 
but also the blockNum and forkNum that referred to the block that was 
being read. There's another pair of probe points, 
TRACE_POSTGRESQL_BUFFER_FLUSH_START/DONE, inside FlushBuffer that 
indicate the page that is being flushed.

I see that reporting the evicted page is more convenient with this 
patch, otherwise you'd need to pass the smgr and blocknum of the page 
that's being read to InvalidateVictimBuffer(). IMHO you can just remove 
these probe points. We don't need to bend over backwards to maintain 
specific probe points.

* InvalidateVictimBuffer reads the buffer header with an atomic read op, 
just to check if BM_TAG_VALID is set. If it's not, it does nothing 
(except for a few Asserts). But the caller has already read the buffer 
header. Consider refactoring it so that the caller checks VM_TAG_VALID, 
and only calls InvalidateVictimBuffer if it's set, saving one atomic 
read in InvalidateVictimBuffer. I think it would be just as readable, so 
no loss there. I doubt the atomic read makes any measurable performance 
difference, but it looks redundant.

* I don't understand this comment:

>     /*
>      * Clear out the buffer's tag and flags and usagecount.  We must do
>      * this to ensure that linear scans of the buffer array don't think
>      * the buffer is valid.
>      *
>      * XXX: This is a pre-existing comment I just moved, but isn't it
>      * entirely bogus with regard to the tag? We can't do anything with
>      * the buffer without taking BM_VALID / BM_TAG_VALID into
>      * account. Likely doesn't matter because we're already dirtying the
>      * cacheline, but still.
>      *
>      */
>     ClearBufferTag(&buf_hdr->tag);
>     buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
>     UnlockBufHdr(buf_hdr, buf_state);

What exactly is wrong with clearing the tag? What does dirtying the 
cacheline have to do with the correctness here?

* pgindent

- Heikki



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote:
> > v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
> This can be applied separately from the rest of the patches, which is nice.
> Some small comments on it:

Thanks for looking at these!


> * Needs a rebase, it conflicted slightly with commit f30d62c2fc.

Will work on that.


> * GetVictimBuffer needs a comment to explain what it does. In particular,
> mention that it returns a buffer that's pinned and known !BM_TAG_VALID.

Will add.


> * I suggest renaming 'cur_buf' and other such local variables in
> GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some
> other buffer involved too, but there is no 'prev' or 'next' or 'other'
> buffer. The old code called it just 'buf' too, and before this patch it
> actually was a bit confusing because there were two buffers involved. But
> with this patch, GetVictimBuffer only deals with one buffer at a time.

Hm. Yea. I probably ended up with these names because initially
GetVictimBuffer() wasn't a separate function, and I indeed constantly got
confused by which buffer was referenced.


> * This FIXME:
>
> >         /* OK, do the I/O */
> >         /* FIXME: These used the wrong smgr before afaict? */
> >         {
> >             SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),
> >                                          InvalidBackendId);
> >
> >             TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
> >                                                       buf_hdr->tag.blockNum,
> >                                                       smgr->smgr_rlocator.locator.spcOid,
> >                                                       smgr->smgr_rlocator.locator.dbOid,
> >                                                       smgr->smgr_rlocator.locator.relNumber);
> >
> >             FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context);
> >             LWLockRelease(content_lock);
> >
> >             ScheduleBufferTagForWriteback(&BackendWritebackContext,
> >                                           &buf_hdr->tag);
> >
> >             TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
> >                                                      buf_hdr->tag.blockNum,
> >                                                      smgr->smgr_rlocator.locator.spcOid,
> >                                                      smgr->smgr_rlocator.locator.dbOid,
> >                                                      smgr->smgr_rlocator.locator.relNumber);
> >         }
>
> I believe that was intentional. The probes previously reported the block and
> relation whose read *caused* the eviction. It was not just the smgr but also
> the blockNum and forkNum that referred to the block that was being read.

You're probably right.  It's certainly not understandable from our docs
though:

    <row>
     <entry><literal>buffer-write-dirty-start</literal></entry>
     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
     <entry>Probe that fires when a server process begins to write a dirty
      buffer.  (If this happens often, it implies that
      <xref linkend="guc-shared-buffers"/> is too
      small or the background writer control parameters need adjustment.)
      arg0 and arg1 contain the fork and block numbers of the page.
      arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
      identifying the relation.</entry>
    </row>


> I see that reporting the evicted page is more convenient with this patch,
> otherwise you'd need to pass the smgr and blocknum of the page that's being
> read to InvalidateVictimBuffer(). IMHO you can just remove these probe
> points. We don't need to bend over backwards to maintain specific probe
> points.

Agreed.


> * InvalidateVictimBuffer reads the buffer header with an atomic read op,
> just to check if BM_TAG_VALID is set.

It's not a real atomic op, in the sense of being special instruction. It does
force the compiler to actually read from memory, but that's it.

But you're right, even that is unnecessary. I think it ended up that way
because I also wanted the full buf_hdr, and it seemed somewhat error prone to
pass in both.



> * I don't understand this comment:
>
> >     /*
> >      * Clear out the buffer's tag and flags and usagecount.  We must do
> >      * this to ensure that linear scans of the buffer array don't think
> >      * the buffer is valid.
> >      *
> >      * XXX: This is a pre-existing comment I just moved, but isn't it
> >      * entirely bogus with regard to the tag? We can't do anything with
> >      * the buffer without taking BM_VALID / BM_TAG_VALID into
> >      * account. Likely doesn't matter because we're already dirtying the
> >      * cacheline, but still.
> >      *
> >      */
> >     ClearBufferTag(&buf_hdr->tag);
> >     buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
> >     UnlockBufHdr(buf_hdr, buf_state);
>
> What exactly is wrong with clearing the tag? What does dirtying the
> cacheline have to do with the correctness here?

There's nothing wrong with clearing out the tag, but I don't think it's a hard
requirement today, and certainly not for the reason stated above.

Validity of the buffer isn't determined by the tag, it's determined by
BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID).

Without either having pinned the buffer, or holding the buffer header
spinlock, the tag can change at any time. And code like DropDatabaseBuffers()
knows that, and re-checks the the tag after locking the buffer header
spinlock.

Afaict, there'd be no correctness issue with removing the
ClearBufferTag(). There would be an efficiency issue though, because when
encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(),
which'd find that BM_[TAG_]VALID isn't set, and not to anything.


Even though it's not a correctness issue, it seems to me that
DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
before doing anything further.  Particularly in DropRelationsAllBuffers(), the
check we do for each buffer isn't cheap. Doing it for buffers that don't even
have a tag seems .. not smart.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
On 2023-02-11 13:36:51 -0800, Andres Freund wrote:
> Even though it's not a correctness issue, it seems to me that
> DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
> before doing anything further.  Particularly in DropRelationsAllBuffers(), the
> check we do for each buffer isn't cheap. Doing it for buffers that don't even
> have a tag seems .. not smart.

There's a small regression for a single relation, but after that it's a clear
benefit.

32GB shared buffers, empty. The test creates N new relations and then rolls
back.

        tps        tps
num relations    HEAD        precheck
1        46.11        45.22
2        43.24        44.87
4        35.14        44.20
8        28.72        42.79

I don't understand the regression at 1, TBH.  I think it must be a random code
layout issue, because the same pre-check in DropRelationBuffers() (exercised
via TRUNCATE of a newly created relation), shows a tiny speedup.



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-10 18:38:50 +0200, Heikki Linnakangas wrote:
> I'll continue reviewing this, but here's some feedback on the first two
> patches:
> 
> v2-0001-aio-Add-some-error-checking-around-pinning.patch:
> 
> I wonder if the extra assertion in LockBufHdr() is worth the overhead. It
> won't add anything without assertions, of course, but still. No objections
> if you think it's worth it.

It's so easy to get confused about local/non-local buffers, that I think it is
useful.  I think we really need to consider cleaning up the separation
further. Having half the code for local buffers in bufmgr.c and the other half
in localbuf.c, without a scheme that I can recognize, is not a good scheme.


It bothers me somewhat ConditionalLockBufferForCleanup() silently accepts
multiple pins by the current backend. That's the right thing for
e.g. heap_page_prune_opt(), but for something like lazy_scan_heap() it's
not.  And yes, I did encounter a bug hidden by that when making vacuumlazy use
AIO as part of that patchset.  That's why I made BufferCheckOneLocalPin()
externally visible.



> v2-0002-hio-Release-extension-lock-before-initializing-pa.patch:
> 
> Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK,
> which zeroes the page, and then we call PageInit to zero the page again.
> RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache
> previously, but with P_NEW, that is always true.

It is quite silly, and it shows up noticably in profiles. The zeroing is
definitely needed in other places calling PageInit(), though. I suspect we
should have a PageInitZeroed() or such, that asserts the page is zero, but
otherwise skips it.

Seems independent enough from this series, that I'd probably tackle it
separately? If you prefer, I'm ok with adding a patch to this series instead,
though.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-01-20 13:40:55 +1300, David Rowley wrote:
> On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote:
> > Thanks for letting me now. Updated version attached.
> 
> I'm not too sure I've qualified for giving a meaningful design review
> here, but I have started looking at the patches and so far only made
> it as far as 0006.

Thanks!


> I noted down the following while reading:
> 
> v2-0001:
> 
> 1. BufferCheckOneLocalPin needs a header comment
> 
> v2-0002:
> 
> 2. The following comment and corresponding code to release the
> extension lock has been moved now.
> 
> /*
> * Release the file-extension lock; it's now OK for someone else to extend
> * the relation some more.
> */
> 
> I think it's worth detailing out why it's fine to release the
> extension lock in the new location. You've added detail to the commit
> message but I think you need to do the same in the comments too.

Will do.


> v2-0003
> 
> 3. FileFallocate() and FileZero() should likely document what they
> return, i.e zero on success and non-zero on failure.

I guess I just tried to fit in with the rest of the file :)


> 4. I'm not quite clear on why you've modified FileGetRawDesc() to call
> FileAccess() twice.

I do not have the faintest idea what happened there... Will fix.


> v2-0004:
> 
> 5. Is it worth having two versions of PinLocalBuffer() one to adjust
> the usage count and one that does not? Couldn't the version that does
> not adjust the count skip doing pg_atomic_read_u32()?

I think it'd be nicer to just move the read inside the if
(adjust_usagecount). That way the rest of the function doesn't have to be
duplicated.


Thanks,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-11 14:25:06 -0800, Andres Freund wrote:
> On 2023-01-20 13:40:55 +1300, David Rowley wrote:
> > v2-0004:
> > 
> > 5. Is it worth having two versions of PinLocalBuffer() one to adjust
> > the usage count and one that does not? Couldn't the version that does
> > not adjust the count skip doing pg_atomic_read_u32()?
> 
> I think it'd be nicer to just move the read inside the if
> (adjust_usagecount). That way the rest of the function doesn't have to be
> duplicated.

Ah, no, we need it for the return value. No current users of
  PinLocalBuffer(adjust_usagecount = false)
need the return value, but I don't think that's necessarily the case.

I'm somewhat inclined to not duplicate it, but if you think it's worth it,
I'll do that.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
On 11/02/2023 23:36, Andres Freund wrote:
> On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote:
>> * I don't understand this comment:
>>
>>>     /*
>>>      * Clear out the buffer's tag and flags and usagecount.  We must do
>>>      * this to ensure that linear scans of the buffer array don't think
>>>      * the buffer is valid.
>>>      *
>>>      * XXX: This is a pre-existing comment I just moved, but isn't it
>>>      * entirely bogus with regard to the tag? We can't do anything with
>>>      * the buffer without taking BM_VALID / BM_TAG_VALID into
>>>      * account. Likely doesn't matter because we're already dirtying the
>>>      * cacheline, but still.
>>>      *
>>>      */
>>>     ClearBufferTag(&buf_hdr->tag);
>>>     buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
>>>     UnlockBufHdr(buf_hdr, buf_state);
>>
>> What exactly is wrong with clearing the tag? What does dirtying the
>> cacheline have to do with the correctness here?
> 
> There's nothing wrong with clearing out the tag, but I don't think it's a hard
> requirement today, and certainly not for the reason stated above.
> 
> Validity of the buffer isn't determined by the tag, it's determined by
> BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID).
> 
> Without either having pinned the buffer, or holding the buffer header
> spinlock, the tag can change at any time. And code like DropDatabaseBuffers()
> knows that, and re-checks the the tag after locking the buffer header
> spinlock.
> 
> Afaict, there'd be no correctness issue with removing the
> ClearBufferTag(). There would be an efficiency issue though, because when
> encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(),
> which'd find that BM_[TAG_]VALID isn't set, and not to anything.

Okay, gotcha.

> Even though it's not a correctness issue, it seems to me that
> DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
> before doing anything further.  Particularly in DropRelationsAllBuffers(), the
> check we do for each buffer isn't cheap. Doing it for buffers that don't even
> have a tag seems .. not smart.

Depends on what percentage of buffers are valid, I guess. If all buffers 
are valid, checking BM_TAG_VALID first would lose. In practice, I doubt 
it makes any measurable difference either way.

Since we're micro-optimizing, I noticed that 
BufTagMatchesRelFileLocator() compares the fields in order "spcOid, 
dbOid, relNumber". Before commit 82ac34db20, we used 
RelFileLocatorEqual(), which has this comment:

/*
  * Note: RelFileLocatorEquals and RelFileLocatorBackendEquals compare 
relNumber
  * first since that is most likely to be different in two unequal
  * RelFileLocators.  It is probably redundant to compare spcOid if the 
other
  * fields are found equal, but do it anyway to be sure.  Likewise for 
checking
  * the backend ID in RelFileLocatorBackendEquals.
  */

So we lost that micro-optimization. Should we reorder the checks in 
BufTagMatchesRelFileLocator()?

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
> v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch

This looks straightforward. My only concern is that it changes the order 
that things happen at abort. Currently, AbortBufferIO() is called very 
early in AbortTransaction(), and this patch moves it much later. I don't 
see any immediate problems from that, but it feels scary.


> @@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
>  static void
>  AtProcExit_Buffers(int code, Datum arg)
>  {
> -    AbortBufferIO();
>      UnlockBuffers();
>  
>      CheckForBufferLeaks();

Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on 
elog(FATAL)? Do we need to worry about that?

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
> v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch

> +static BlockNumber
> +BulkExtendSharedRelationBuffered(Relation rel,
> +                                 SMgrRelation smgr,
> +                                 bool skip_extension_lock,
> +                                 char relpersistence,
> +                                 ForkNumber fork, ReadBufferMode mode,
> +                                 BufferAccessStrategy strategy,
> +                                 uint32 *num_pages,
> +                                 uint32 num_locked_pages,
> +                                 Buffer *buffers)

Ugh, that's a lot of arguments, some are inputs and some are outputs. I 
don't have any concrete suggestions, but could we simplify this somehow? 
Needs a comment at least.

> v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index de1427a1e0e..1810f7ebfef 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
>       * whole relation will be rolled back.
>       */
>  
> -    meta = ReadBuffer(index, P_NEW);
> +    meta = ExtendRelationBuffered(index, NULL, true,
> +                                  index->rd_rel->relpersistence,
> +                                  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
> +                                  NULL);
>      Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
> -    LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
>  
>      brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
>                         BRIN_CURRENT_VERSION);

Since we're changing the API anyway, how about introducing a new 
function for this case where we extend the relation but we what block 
number we're going to get? This pattern of using P_NEW and asserting the 
result has always felt awkward to me.

> -        buf = ReadBuffer(irel, P_NEW);
> +        buf = ExtendRelationBuffered(irel, NULL, false,
> +                                     irel->rd_rel->relpersistence,
> +                                     MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
> +                                     NULL);

These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW). 
I'd suggest something like:

buf = ExtendBuffer(rel);

Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with 
ExtendRelationBuffered?

Is it ever possible to call this without a relcache entry? WAL redo 
functions do that with ReadBuffer, but they only extend a relation 
implicitly, by replay a record for a particular block.

All of the above comments are around the BulkExtendRelationBuffered() 
function's API. That needs a closer look and a more thought-out design 
to make it nice. Aside from that, this approach seems valid.

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Alvaro Herrera
Date:
On 2023-Feb-21, Heikki Linnakangas wrote:

> > +static BlockNumber
> > +BulkExtendSharedRelationBuffered(Relation rel,
> > +                                 SMgrRelation smgr,
> > +                                 bool skip_extension_lock,
> > +                                 char relpersistence,
> > +                                 ForkNumber fork, ReadBufferMode mode,
> > +                                 BufferAccessStrategy strategy,
> > +                                 uint32 *num_pages,
> > +                                 uint32 num_locked_pages,
> > +                                 Buffer *buffers)
> 
> Ugh, that's a lot of arguments, some are inputs and some are outputs. I
> don't have any concrete suggestions, but could we simplify this somehow?
> Needs a comment at least.

Yeah, I noticed this too.  I think it would be easy enough to add a new
struct that can be passed as a pointer, which can be stack-allocated
by the caller, and which holds the input arguments that are common to
both functions, as is sensible.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Update: super-fast reaction on the Postgres bugs mailing list. The report
was acknowledged [...], and a fix is under discussion.
The wonders of open-source !"
             https://twitter.com/gunnarmorling/status/1596080409259003906



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-21 17:40:31 +0200, Heikki Linnakangas wrote:
> > v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch
> 
> This looks straightforward. My only concern is that it changes the order
> that things happen at abort. Currently, AbortBufferIO() is called very early
> in AbortTransaction(), and this patch moves it much later. I don't see any
> immediate problems from that, but it feels scary.

Yea, it does feel a bit awkward. But I suspect it's actually the right
thing. We've not even adjusted the transaction state at the point we're
calling AbortBufferIO(). And AbortBufferIO() will sometimes allocate memory
for a WARNING, which conceivably could fail - although I don't think that's a
particularly realistic scenario due to TransactionAbortContext (I guess you
could have a large error context stack or such).


Medium term I think we need to move a lot more of the error handling into
resowners. Having a dozen+ places with their own choreographed sigsetjmp()
recovery blocks is error prone as hell. Not to mention tedious.


> > @@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
> >  static void
> >  AtProcExit_Buffers(int code, Datum arg)
> >  {
> > -    AbortBufferIO();
> >      UnlockBuffers();
> >      CheckForBufferLeaks();
> 
> Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on
> elog(FATAL)? Do we need to worry about that?

We have before_shmem_exit() callbacks that should protect against
that. InitPostgres() registers ShutdownPostgres(), and
CreateAuxProcessResourceOwner() registers
ReleaseAuxProcessResourcesCallback().


I think we'd already be in trouble if we didn't reliably end up doing resowner
cleanup during process exit.

Perhaps ResourceOwnerCreate()/ResourceOwnerDelete() should maintain a list of
"active" resource owners and have a before-exit callback that ensures the list
is empty and PANICs if not?  Better a crash restart than hanging because we
didn't release some shared resource.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:
> > v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch
> 
> > +static BlockNumber
> > +BulkExtendSharedRelationBuffered(Relation rel,
> > +                                 SMgrRelation smgr,
> > +                                 bool skip_extension_lock,
> > +                                 char relpersistence,
> > +                                 ForkNumber fork, ReadBufferMode mode,
> > +                                 BufferAccessStrategy strategy,
> > +                                 uint32 *num_pages,
> > +                                 uint32 num_locked_pages,
> > +                                 Buffer *buffers)
> 
> Ugh, that's a lot of arguments, some are inputs and some are outputs. I
> don't have any concrete suggestions, but could we simplify this somehow?
> Needs a comment at least.

Yea. I think this is the part of the patchset I like the least.

The ugliest bit is accepting both rel and smgr. The background to that is that
we need the relation oid to acquire the extension lock. But during crash
recovery we don't have that - which is fine, because we don't need the
extension lock.

We could have two different of functions, but that ends up a mess as well, as
we've seen in other cases.


> > v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch
> 
> > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> > index de1427a1e0e..1810f7ebfef 100644
> > --- a/src/backend/access/brin/brin.c
> > +++ b/src/backend/access/brin/brin.c
> > @@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
> >       * whole relation will be rolled back.
> >       */
> > -    meta = ReadBuffer(index, P_NEW);
> > +    meta = ExtendRelationBuffered(index, NULL, true,
> > +                                  index->rd_rel->relpersistence,
> > +                                  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
> > +                                  NULL);
> >      Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
> > -    LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
> >      brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
> >                         BRIN_CURRENT_VERSION);
> 
> Since we're changing the API anyway, how about introducing a new function
> for this case where we extend the relation but we what block number we're
> going to get? This pattern of using P_NEW and asserting the result has
> always felt awkward to me.

To me it always felt like a code smell that some code insists on specific
getting specific block numbers with P_NEW. I guess it's ok for things like
building a new index, but outside of that it feels wrong.

The first case I found just now is revmap_physical_extend(). Which seems to
extend the relation while holding an lwlock. Ugh.

Maybe ExtendRelationBufferedTo() or something like that? With a big comment
saying that users of it are likely bad ;)


> > -        buf = ReadBuffer(irel, P_NEW);
> > +        buf = ExtendRelationBuffered(irel, NULL, false,
> > +                                     irel->rd_rel->relpersistence,
> > +                                     MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
> > +                                     NULL);
> 
> These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW). I'd
> suggest something like:

I guess. Not sure if it's worth optimizing for brevity all that much here -
there's not that many places extending relations. Several places end up with
less code, actually , because they don't need to care about the extension lock
themselves anymore. I think an ExtendBuffer() that doesn't mention the fork,
etc, ends up being more confusing than helpful.


> buf = ExtendBuffer(rel);

Without the relation in the name it just seems confusing to me - the extension
isn't "restricted" to shared_buffers. ReadBuffer() isn't great as a name
either, but it makes a bit more sense at least, it reads into a buffer. And
it's a vastly more frequent operation, so optimizing for density is worth it.


> Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
> ExtendRelationBuffered?

Hm. That's a a good point. Probably not. Perhaps it could be useful to support
RBM_NORMAL as well? But even if, it'd just be a lock release away if we always
used RBM_ZERO_AND_LOCK.


> Is it ever possible to call this without a relcache entry? WAL redo
> functions do that with ReadBuffer, but they only extend a relation
> implicitly, by replay a record for a particular block.

I think we should use it for crash recovery as well, but the patch doesn't
yet. We have some gnarly code there, see the loop using P_NEW in
XLogReadBufferExtended(). Extending the file one-by-one is a lot more
expensive than doing it in bulk.


> All of the above comments are around the BulkExtendRelationBuffered()
> function's API. That needs a closer look and a more thought-out design to
> make it nice. Aside from that, this approach seems valid.

Thanks for looking!  I agree that it can stand a fair bit of polishing...

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Jim Nasby
Date:
On 10/28/22 9:54 PM, Andres Freund wrote:
> b) I found that is quite beneficial to bulk-extend the relation with
>     smgrextend() even without concurrency. The reason for that is the primarily
>     the aforementioned dirty buffers that our current extension method causes.
>
>     One bit that stumped me for quite a while is to know how much to extend the
>     relation by. RelationGetBufferForTuple() drives the decision whether / how
>     much to bulk extend purely on the contention on the extension lock, which
>     obviously does not work for non-concurrent workloads.
>
>     After quite a while I figured out that we actually have good information on
>     how much to extend by, at least for COPY /
>     heap_multi_insert(). heap_multi_insert() can compute how much space is
>     needed to store all tuples, and pass that on to
>     RelationGetBufferForTuple().
>
>     For that to be accurate we need to recompute that number whenever we use an
>     already partially filled page. That's not great, but doesn't appear to be a
>     measurable overhead.
Some food for thought: I think it's also completely fine to extend any 
relation over a certain size by multiple blocks, regardless of 
concurrency. E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't 
have a good feel for what algorithm would make sense here; maybe 
something along the lines of extend = max(relpages / 2048, 128); if 
extend < 8 extend = 1; (presumably extending by just a couple extra 
pages doesn't help much without concurrency).



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-21 15:00:15 -0600, Jim Nasby wrote:
> On 10/28/22 9:54 PM, Andres Freund wrote:
> > b) I found that is quite beneficial to bulk-extend the relation with
> >     smgrextend() even without concurrency. The reason for that is the primarily
> >     the aforementioned dirty buffers that our current extension method causes.
> > 
> >     One bit that stumped me for quite a while is to know how much to extend the
> >     relation by. RelationGetBufferForTuple() drives the decision whether / how
> >     much to bulk extend purely on the contention on the extension lock, which
> >     obviously does not work for non-concurrent workloads.
> > 
> >     After quite a while I figured out that we actually have good information on
> >     how much to extend by, at least for COPY /
> >     heap_multi_insert(). heap_multi_insert() can compute how much space is
> >     needed to store all tuples, and pass that on to
> >     RelationGetBufferForTuple().
> > 
> >     For that to be accurate we need to recompute that number whenever we use an
> >     already partially filled page. That's not great, but doesn't appear to be a
> >     measurable overhead.
> Some food for thought: I think it's also completely fine to extend any
> relation over a certain size by multiple blocks, regardless of concurrency.
> E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel
> for what algorithm would make sense here; maybe something along the lines of
> extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably
> extending by just a couple extra pages doesn't help much without
> concurrency).

I previously implemented just that. It's not easy to get right. You can easily
end up with several backends each extending the relation by quite a bit, at
the same time (or you re-introduce contention). Which can end up with a
relation being larger by a bunch if data loading stops at some point.

We might want that as well at some point, but the approach implemented in the
patchset is precise and thus always a win, and thus should be the baseline.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Jim Nasby
Date:
On 2/21/23 3:12 PM, Andres Freund wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 
>
>
>
> Hi,
>
> On 2023-02-21 15:00:15 -0600, Jim Nasby wrote:
>> Some food for thought: I think it's also completely fine to extend any
>> relation over a certain size by multiple blocks, regardless of concurrency.
>> E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel
>> for what algorithm would make sense here; maybe something along the lines of
>> extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably
>> extending by just a couple extra pages doesn't help much without
>> concurrency).
> I previously implemented just that. It's not easy to get right. You can easily
> end up with several backends each extending the relation by quite a bit, at
> the same time (or you re-introduce contention). Which can end up with a
> relation being larger by a bunch if data loading stops at some point.
>
> We might want that as well at some point, but the approach implemented in the
> patchset is precise and thus always a win, and thus should be the baseline.
Yeah, what I was suggesting would only make sense when there *wasn't* 
contention.



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
On 21/02/2023 21:22, Andres Freund wrote:
> On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:
>> Is it ever possible to call this without a relcache entry? WAL redo
>> functions do that with ReadBuffer, but they only extend a relation
>> implicitly, by replay a record for a particular block.
> 
> I think we should use it for crash recovery as well, but the patch doesn't
> yet. We have some gnarly code there, see the loop using P_NEW in
> XLogReadBufferExtended(). Extending the file one-by-one is a lot more
> expensive than doing it in bulk.

Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the 
gap, and then call ExtendRelationBuffered for the target page. Or the 
new ExtendRelationBufferedTo() function that you mentioned.

In the common case that you load a lot of data to a relation extending 
it, and then crash, the WAL replay would still extend the relation one 
page at a time, which is inefficient. Changing that would need bigger 
changes, to WAL-log the relation extension as a separate WAL record, for 
example. I don't think we need to solve that right now, it can be 
addressed separately later.

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-22 11:18:57 +0200, Heikki Linnakangas wrote:
> On 21/02/2023 21:22, Andres Freund wrote:
> > On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:
> > > Is it ever possible to call this without a relcache entry? WAL redo
> > > functions do that with ReadBuffer, but they only extend a relation
> > > implicitly, by replay a record for a particular block.
> > 
> > I think we should use it for crash recovery as well, but the patch doesn't
> > yet. We have some gnarly code there, see the loop using P_NEW in
> > XLogReadBufferExtended(). Extending the file one-by-one is a lot more
> > expensive than doing it in bulk.
> 
> Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the gap,
> and then call ExtendRelationBuffered for the target page. Or the new
> ExtendRelationBufferedTo() function that you mentioned.

I don't think it's safe to just use smgrzeroextend(). Without the page-level
interlock from the buffer entry, a concurrent reader can read/write the
extended portion of the relation, while we're extending. That can lead to
loosing writes.

It also turns out that just doing smgrzeroextend(), without filling s_b, is
often bad for performance, because it may cause reads when trying to fill the
buffers. Although hopefully that's less of an issue during WAL replay, due to
REGBUF_WILL_INIT.


> In the common case that you load a lot of data to a relation extending it,
> and then crash, the WAL replay would still extend the relation one page at a
> time, which is inefficient. Changing that would need bigger changes, to
> WAL-log the relation extension as a separate WAL record, for example. I
> don't think we need to solve that right now, it can be addressed separately
> later.

Yea, that seems indeed something for later.

There's several things we could do without adding WAL logging of relation
extension themselves.

One relatively easy thing would be to add information about the number of
blocks we're extending by to XLOG_HEAP2_MULTI_INSERT records. Compared to the
insertions themselves that'd barely be noticable.

A slightly more complicated thing would be to peek ahead in the WAL (we have
infrastructure for that now) and extend by enough for the next few relation
extensions.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-21 11:22:26 -0800, Andres Freund wrote:
> On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:
> > Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
> > ExtendRelationBuffered?
> 
> Hm. That's a a good point. Probably not. Perhaps it could be useful to support
> RBM_NORMAL as well? But even if, it'd just be a lock release away if we always
> used RBM_ZERO_AND_LOCK.

There's a fair number of callers using RBM_NORMAL, via ReadBuffer[Extended]()
right now. While some of them are trivial to convert, others aren't (e.g.,
brin_getinsertbuffer()).  So I'm inclined to continue allowing RBM_NORMAL.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
On 21/02/2023 18:33, Alvaro Herrera wrote:
> On 2023-Feb-21, Heikki Linnakangas wrote:
> 
>>> +static BlockNumber
>>> +BulkExtendSharedRelationBuffered(Relation rel,
>>> +                                 SMgrRelation smgr,
>>> +                                 bool skip_extension_lock,
>>> +                                 char relpersistence,
>>> +                                 ForkNumber fork, ReadBufferMode mode,
>>> +                                 BufferAccessStrategy strategy,
>>> +                                 uint32 *num_pages,
>>> +                                 uint32 num_locked_pages,
>>> +                                 Buffer *buffers)
>>
>> Ugh, that's a lot of arguments, some are inputs and some are outputs. I
>> don't have any concrete suggestions, but could we simplify this somehow?
>> Needs a comment at least.
> 
> Yeah, I noticed this too.  I think it would be easy enough to add a new
> struct that can be passed as a pointer, which can be stack-allocated
> by the caller, and which holds the input arguments that are common to
> both functions, as is sensible.

We also do this in freespace.c and visibilitymap.c:

     /* Extend as needed. */
     while (fsm_nblocks_now < fsm_nblocks)
     {
         PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);

         smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
                    pg.data, false);
         fsm_nblocks_now++;
      }

We could use the new smgrzeroextend function here. But it would be 
better to go through the buffer cache, because after this, the last 
block, at 'fsm_nblocks', will be read with ReadBuffer() and modified.

We could use BulkExtendSharedRelationBuffered() to extend the relation 
and keep the last page locked, but the 
BulkExtendSharedRelationBuffered() signature doesn't allow that. It can 
return the first N pages locked, but there's no way to return the *last* 
page locked.

Perhaps we should decompose this function into several function calls. 
Something like:

/* get N victim buffers, pinned and !BM_VALID */
buffers = BeginExtendRelation(int npages);

LockRelationForExtension(rel)

/* Insert buffers into buffer table */
first_blk = smgrnblocks()
for (blk = first_blk; blk < last_blk; blk++)
     MapNewBuffer(blk, buffers[i])

/* extend the file on disk */
smgrzeroextend();

UnlockRelationForExtension(rel)

for (blk = first_blk; blk < last_blk; blk++)
{
     memset(BufferGetPage(buffers[i]), 0,
     FinishNewBuffer(buffers[i])
     /* optionally lock the buffer */
     LockBuffer(buffers[i]);
}

That's a lot more verbose, of course, but gives the callers the 
flexibility. And might even be more readable than one function call with 
lots of arguments.

This would expose the concept of a buffer that's mapped but marked as 
IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing 
details that shouldn't be exposed. On the other hand, it might come 
handy. Instead of RBM_ZERO_AND_LOCK mode, for example, it might be handy 
to have a function that returns an IO-in-progress buffer that you can 
initialize any way you want.

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote:
> We also do this in freespace.c and visibilitymap.c:
> 
>     /* Extend as needed. */
>     while (fsm_nblocks_now < fsm_nblocks)
>     {
>         PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
> 
>         smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
>                    pg.data, false);
>         fsm_nblocks_now++;
>      }
> 
> We could use the new smgrzeroextend function here. But it would be better to
> go through the buffer cache, because after this, the last block, at
> 'fsm_nblocks', will be read with ReadBuffer() and modified.

I doubt it's a particularly crucial thing to optimize.

But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.


> We could use BulkExtendSharedRelationBuffered() to extend the relation and
> keep the last page locked, but the BulkExtendSharedRelationBuffered()
> signature doesn't allow that. It can return the first N pages locked, but
> there's no way to return the *last* page locked.

We can't rely on bulk extending a, potentially, large number of pages in one
go anyway (since we might not be allowed to pin that many pages). So I don't
think requiring locking the last page is a really viable API.

I think for this case I'd just just use the ExtendRelationTo() API we were
discussing nearby. Compared to the cost of reducing syscalls / filesystem
overhead to extend the relation, the cost of the buffer mapping lookup does't
seem significant. That's different in e.g. the hio.c case, because there we
need a buffer with free space, and concurrent activity could otherwise fill up
the buffer before we can lock it again.


I had started hacking on ExtendRelationTo() that when I saw problems with the
existing code that made me hesitate:
https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de


> Perhaps we should decompose this function into several function calls.
> Something like:
> 
> /* get N victim buffers, pinned and !BM_VALID */
> buffers = BeginExtendRelation(int npages);
> 
> LockRelationForExtension(rel)
> 
> /* Insert buffers into buffer table */
> first_blk = smgrnblocks()
> for (blk = first_blk; blk < last_blk; blk++)
>     MapNewBuffer(blk, buffers[i])
> 
> /* extend the file on disk */
> smgrzeroextend();
> 
> UnlockRelationForExtension(rel)
> 
> for (blk = first_blk; blk < last_blk; blk++)
> {
>     memset(BufferGetPage(buffers[i]), 0,
>     FinishNewBuffer(buffers[i])
>     /* optionally lock the buffer */
>     LockBuffer(buffers[i]);
> }
> 
> That's a lot more verbose, of course, but gives the callers the flexibility.
> And might even be more readable than one function call with lots of
> arguments.

To me this seems like a quite bad idea. The amount of complexity this would
expose all over the tree is substantial. Which would also make it harder to
further improve relation extension at a later date. It certainly shouldn't be
the default interface. And I'm not sure I see a promisung usecase.


> This would expose the concept of a buffer that's mapped but marked as
> IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing details
> that shouldn't be exposed. On the other hand, it might come handy. Instead
> of RBM_ZERO_AND_LOCK mode, for example, it might be handy to have a function
> that returns an IO-in-progress buffer that you can initialize any way you
> want.

I'd much rather encapsulate that in additional functions, or perhaps a
callback that can make decisions about what to do.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote:
> On 2023-Feb-21, Heikki Linnakangas wrote:
> 
> > > +static BlockNumber
> > > +BulkExtendSharedRelationBuffered(Relation rel,
> > > +                                 SMgrRelation smgr,
> > > +                                 bool skip_extension_lock,
> > > +                                 char relpersistence,
> > > +                                 ForkNumber fork, ReadBufferMode mode,
> > > +                                 BufferAccessStrategy strategy,
> > > +                                 uint32 *num_pages,
> > > +                                 uint32 num_locked_pages,
> > > +                                 Buffer *buffers)
> > 
> > Ugh, that's a lot of arguments, some are inputs and some are outputs. I
> > don't have any concrete suggestions, but could we simplify this somehow?
> > Needs a comment at least.
> 
> Yeah, I noticed this too.  I think it would be easy enough to add a new
> struct that can be passed as a pointer, which can be stack-allocated
> by the caller, and which holds the input arguments that are common to
> both functions, as is sensible.

I played a fair bit with various options. I ended up not using a struct to
pass most options, but instead go for a flags argument. However, I did use a
struct for passing either relation or smgr.


typedef enum ExtendBufferedFlags {
    EB_SKIP_EXTENSION_LOCK = (1 << 0),
    EB_IN_RECOVERY = (1 << 1),
    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
    EB_LOCK_FIRST = (1 << 3),

    /* internal flags follow */
    EB_RELEASE_PINS = (1 << 4),
} ExtendBufferedFlags;

/*
 * To identify the relation - either relation or smgr + relpersistence has to
 * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
 * to use the same function for both crash recovery and normal operation.
 */
typedef struct ExtendBufferedWhat
{
    Relation rel;
    struct SMgrRelationData *smgr;
    char relpersistence;
} ExtendBufferedWhat;

#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
/* requires use of EB_SKIP_EXTENSION_LOCK */
#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})


extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
                                ForkNumber forkNum,
                                BufferAccessStrategy strategy,
                                uint32 flags);
extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
                                       ForkNumber fork,
                                       BufferAccessStrategy strategy,
                                       uint32 flags,
                                       uint32 extend_by,
                                       Buffer *buffers,
                                       uint32 *extended_by);
extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
                                  ForkNumber fork,
                                  BufferAccessStrategy strategy,
                                  uint32 flags,
                                  BlockNumber extend_to,
                                  ReadBufferMode mode);

As you can see I removed ReadBufferMode from most of the functions (as
suggested by Heikki earlier). When extending by 1/multiple pages, we only need
to know whether to lock or not.

The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
fall back to reading page normally if there was a concurrent relation
extension.

The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated,
gnarly, code to do so from vm_extend(), fsm_extend().


I'm not sure about the function naming pattern. I do like 'By' a lot more than
the Bulk prefix I used before.


What do you think?


Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
On 27/02/2023 23:45, Andres Freund wrote:
> On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote:
>> We also do this in freespace.c and visibilitymap.c:
>>
>>      /* Extend as needed. */
>>      while (fsm_nblocks_now < fsm_nblocks)
>>      {
>>          PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
>>
>>          smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
>>                     pg.data, false);
>>          fsm_nblocks_now++;
>>       }
>>
>> We could use the new smgrzeroextend function here. But it would be better to
>> go through the buffer cache, because after this, the last block, at
>> 'fsm_nblocks', will be read with ReadBuffer() and modified.
> 
> I doubt it's a particularly crucial thing to optimize.

Yeah, it won't make any practical difference to performance. I'm more 
thinking if we can make this more consistent with other places where we 
extend a relation.

> But, uh, isn't this code racy? Because this doesn't go through shared buffers,
> there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
> that writing pages isn't atomic vs readers. So another connection could
> connection could see the new relation size, but a read might return a
> partially written state of the page. Which then would cause checksum
> failures. And even worse, I think it could lead to loosing a write, if the
> concurrent connection writes out a page.

fsm_readbuf and vm_readbuf check the relation size first, with 
smgrnblocks(), before trying to read the page. So to have a problem, the 
smgrnblocks() would have to already return the new size, but the 
smgrread() would not return the new contents. I don't think that's 
possible, but not sure.

>> We could use BulkExtendSharedRelationBuffered() to extend the relation and
>> keep the last page locked, but the BulkExtendSharedRelationBuffered()
>> signature doesn't allow that. It can return the first N pages locked, but
>> there's no way to return the *last* page locked.
> 
> We can't rely on bulk extending a, potentially, large number of pages in one
> go anyway (since we might not be allowed to pin that many pages). So I don't
> think requiring locking the last page is a really viable API.
> 
> I think for this case I'd just just use the ExtendRelationTo() API we were
> discussing nearby. Compared to the cost of reducing syscalls / filesystem
> overhead to extend the relation, the cost of the buffer mapping lookup does't
> seem significant. That's different in e.g. the hio.c case, because there we
> need a buffer with free space, and concurrent activity could otherwise fill up
> the buffer before we can lock it again.

Works for me.

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote:
> On 27/02/2023 23:45, Andres Freund wrote:
> > But, uh, isn't this code racy? Because this doesn't go through shared buffers,
> > there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
> > that writing pages isn't atomic vs readers. So another connection could
> > connection could see the new relation size, but a read might return a
> > partially written state of the page. Which then would cause checksum
> > failures. And even worse, I think it could lead to loosing a write, if the
> > concurrent connection writes out a page.
> 
> fsm_readbuf and vm_readbuf check the relation size first, with
> smgrnblocks(), before trying to read the page. So to have a problem, the
> smgrnblocks() would have to already return the new size, but the smgrread()
> would not return the new contents. I don't think that's possible, but not
> sure.

I hacked Thomas' program to test torn reads to ftruncate the file on the write
side.

It frequently observes a file size that's not the write size (e.g. reading 4k
when writing an 8k block).

After extending the test to more than one reader, I indeed also see torn
reads. So far all the tears have been at a 4k block boundary. However so far
it always has been *prior* page contents, not 0s.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-01 09:02:00 -0800, Andres Freund wrote:
> On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote:
> > On 27/02/2023 23:45, Andres Freund wrote:
> > > But, uh, isn't this code racy? Because this doesn't go through shared buffers,
> > > there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
> > > that writing pages isn't atomic vs readers. So another connection could
> > > connection could see the new relation size, but a read might return a
> > > partially written state of the page. Which then would cause checksum
> > > failures. And even worse, I think it could lead to loosing a write, if the
> > > concurrent connection writes out a page.
> > 
> > fsm_readbuf and vm_readbuf check the relation size first, with
> > smgrnblocks(), before trying to read the page. So to have a problem, the
> > smgrnblocks() would have to already return the new size, but the smgrread()
> > would not return the new contents. I don't think that's possible, but not
> > sure.
> 
> I hacked Thomas' program to test torn reads to ftruncate the file on the write
> side.
> 
> It frequently observes a file size that's not the write size (e.g. reading 4k
> when writing an 8k block).
> 
> After extending the test to more than one reader, I indeed also see torn
> reads. So far all the tears have been at a 4k block boundary. However so far
> it always has been *prior* page contents, not 0s.

On tmpfs the failure rate is much higher, and we also end up reading 0s,
despite never writing them.

I've attached my version of the test program.

ext4: lots of 4k reads with 8k writes, some torn reads at 4k boundaries
xfs: no issues
tmpfs: loads of 4k reads with 8k writes, lots torn reads reading 0s, some torn reads at 4k boundaries


Greetings,

Andres Freund

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Heikki Linnakangas
Date:
On 01/03/2023 09:33, Andres Freund wrote:
> On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote:
>> On 2023-Feb-21, Heikki Linnakangas wrote:
>>
>>>> +static BlockNumber
>>>> +BulkExtendSharedRelationBuffered(Relation rel,
>>>> +                                 SMgrRelation smgr,
>>>> +                                 bool skip_extension_lock,
>>>> +                                 char relpersistence,
>>>> +                                 ForkNumber fork, ReadBufferMode mode,
>>>> +                                 BufferAccessStrategy strategy,
>>>> +                                 uint32 *num_pages,
>>>> +                                 uint32 num_locked_pages,
>>>> +                                 Buffer *buffers)
>>>
>>> Ugh, that's a lot of arguments, some are inputs and some are outputs. I
>>> don't have any concrete suggestions, but could we simplify this somehow?
>>> Needs a comment at least.
>>
>> Yeah, I noticed this too.  I think it would be easy enough to add a new
>> struct that can be passed as a pointer, which can be stack-allocated
>> by the caller, and which holds the input arguments that are common to
>> both functions, as is sensible.
> 
> I played a fair bit with various options. I ended up not using a struct to
> pass most options, but instead go for a flags argument. However, I did use a
> struct for passing either relation or smgr.
> 
> 
> typedef enum ExtendBufferedFlags {
>     EB_SKIP_EXTENSION_LOCK = (1 << 0),
>     EB_IN_RECOVERY = (1 << 1),
>     EB_CREATE_FORK_IF_NEEDED = (1 << 2),
>     EB_LOCK_FIRST = (1 << 3),
> 
>     /* internal flags follow */
>     EB_RELEASE_PINS = (1 << 4),
> } ExtendBufferedFlags;

Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really 
needed? What does EB_LOCK_FIRST do?

> /*
>   * To identify the relation - either relation or smgr + relpersistence has to
>   * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
>   * to use the same function for both crash recovery and normal operation.
>   */
> typedef struct ExtendBufferedWhat
> {
>     Relation rel;
>     struct SMgrRelationData *smgr;
>     char relpersistence;
> } ExtendBufferedWhat;
> 
> #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
> /* requires use of EB_SKIP_EXTENSION_LOCK */
> #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})

Clever. I'm still not 100% convinced we need the EB_SMGR variant, but 
with this we'll have the flexibility in any case.

> extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
>                                 ForkNumber forkNum,
>                                 BufferAccessStrategy strategy,
>                                 uint32 flags);
> extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
>                                        ForkNumber fork,
>                                        BufferAccessStrategy strategy,
>                                        uint32 flags,
>                                        uint32 extend_by,
>                                        Buffer *buffers,
>                                        uint32 *extended_by);
> extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
>                                   ForkNumber fork,
>                                   BufferAccessStrategy strategy,
>                                   uint32 flags,
>                                   BlockNumber extend_to,
>                                   ReadBufferMode mode);
> 
> As you can see I removed ReadBufferMode from most of the functions (as
> suggested by Heikki earlier). When extending by 1/multiple pages, we only need
> to know whether to lock or not.

Ok, that's better. Still complex and a lot of arguments, but I don't 
have any great suggestions on how to improve it.

> The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
> fall back to reading page normally if there was a concurrent relation
> extension.

Hmm, I think you'll need another return value, to let the caller know if 
the relation was extended or not. Or a flag to ereport(ERROR) if the 
page already exists, for ginbuild() and friends.

> The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated,
> gnarly, code to do so from vm_extend(), fsm_extend().

Makes sense.

> I'm not sure about the function naming pattern. I do like 'By' a lot more than
> the Bulk prefix I used before.

+1

- Heikki




Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-02 00:04:14 +0200, Heikki Linnakangas wrote:
> On 01/03/2023 09:33, Andres Freund wrote:
> > typedef enum ExtendBufferedFlags {
> >     EB_SKIP_EXTENSION_LOCK = (1 << 0),
> >     EB_IN_RECOVERY = (1 << 1),
> >     EB_CREATE_FORK_IF_NEEDED = (1 << 2),
> >     EB_LOCK_FIRST = (1 << 3),
> >
> >     /* internal flags follow */
> >     EB_RELEASE_PINS = (1 << 4),
> > } ExtendBufferedFlags;
>
> Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really needed?

Right now it's just passed in from the caller. It's at the moment just needed
to know what to pass to smgrcreate(isRedo).

However, XLogReadBufferExtended() doesn't currently use this path, so maybe
it's not actually needed.


> What does EB_LOCK_FIRST do?

Lock the first returned buffer, this is basically the replacement for
num_locked_buffers from the earlier version. I think it's likely that either
locking the first, or potentially at some later point locking all buffers, is
all that's needed for ExtendBufferedRelBy().

EB_LOCK_FIRST_BUFFER would perhaps be better?


> > /*
> >   * To identify the relation - either relation or smgr + relpersistence has to
> >   * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
> >   * to use the same function for both crash recovery and normal operation.
> >   */
> > typedef struct ExtendBufferedWhat
> > {
> >     Relation rel;
> >     struct SMgrRelationData *smgr;
> >     char relpersistence;
> > } ExtendBufferedWhat;
> >
> > #define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
> > /* requires use of EB_SKIP_EXTENSION_LOCK */
> > #define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence =
p_relpersistence})
>
> Clever. I'm still not 100% convinced we need the EB_SMGR variant, but with
> this we'll have the flexibility in any case.

Hm - how would you use it from XLogReadBufferExtended() without that?


XLogReadBufferExtended() spends a disappointing amount of time in
smgropen(). Quite visible in profiles.

In the plan read case at least one in XLogReadBufferExtended()
itself, then in ReadBufferWithoutRelcache(). The extension path right now is
worse - it does one smgropen() for each extended block.

I think we should avoid using ReadBufferWithoutRelcache() in
XLogReadBufferExtended() in the read path as well, but that's for later.


> > extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
> >                                 ForkNumber forkNum,
> >                                 BufferAccessStrategy strategy,
> >                                 uint32 flags);
> > extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
> >                                        ForkNumber fork,
> >                                        BufferAccessStrategy strategy,
> >                                        uint32 flags,
> >                                        uint32 extend_by,
> >                                        Buffer *buffers,
> >                                        uint32 *extended_by);
> > extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
> >                                   ForkNumber fork,
> >                                   BufferAccessStrategy strategy,
> >                                   uint32 flags,
> >                                   BlockNumber extend_to,
> >                                   ReadBufferMode mode);
> >
> > As you can see I removed ReadBufferMode from most of the functions (as
> > suggested by Heikki earlier). When extending by 1/multiple pages, we only need
> > to know whether to lock or not.
>
> Ok, that's better. Still complex and a lot of arguments, but I don't have
> any great suggestions on how to improve it.

I don't think there are going to be all that many callers of
ExtendBufferedRelBy() and ExtendBufferedRelTo(), most are going to be
ExtendBufferedRel(), I think. So the complexity seems acceptable.


> > The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
> > fall back to reading page normally if there was a concurrent relation
> > extension.
>
> Hmm, I think you'll need another return value, to let the caller know if the
> relation was extended or not. Or a flag to ereport(ERROR) if the page
> already exists, for ginbuild() and friends.

I don't think ginbuild() et al need to use ExtendBufferedRelTo()? A plain
ExtendBufferedRel() should suffice.  The places that do need it are
fsm_extend() and vm_extend() - I did end up avoiding the redundant lookup.

But I was wondering about a flag controlling this as well.


Attached is my current version of this. Still needs more polishing, including
comments explaining the flags. But I thought it'd be useful to have it out
there.

There's two new patches in the series:
- a patch to not initialize pages in the loop in fsm_extend(), vm_extend()
  anymore - we have a check about initializing pages at a later point, so
  there doesn't really seem to be a need for it?
- a patch to use the new ExtendBufferedRelTo() in fsm_extend(), vm_extend()
  and XLogReadBufferExtended()


In this version I also tries to address some of the other feedback raised in
the thread. One thing I haven't decided what to do about yet is David's
feedback about a version of PinLocalBuffer() that doesn't adjust the
usagecount, which wouldn't need to read the buf_state.

Greetings,

Andres Freund

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2022-10-28 19:54:20 -0700, Andres Freund wrote:
> I've done a fair bit of benchmarking of this patchset. For COPY it comes out
> ahead everywhere. It's possible that there's a very small regression for
> extremly IO miss heavy workloads, more below.
> 
> 
> server "base" configuration:
> 
> max_wal_size=150GB
> shared_buffers=24GB
> huge_pages=on
> autovacuum=0
> backend_flush_after=2MB
> max_connections=5000
> wal_buffers=128MB
> wal_segment_size=1GB
> 
> benchmark: pgbench running COPY into a single table. pgbench -t is set
> according to the client count, so that the same amount of data is inserted.
> This is done oth using small files ([1], ringbuffer not effective, no dirty
> data to write out within the benchmark window) and a bit larger files ([2],
> lots of data to write out due to ringbuffer).
> 
> To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.
> 
> s_b=24GB
> 
> test: unlogged_small_files, format: text, files: 1024, 9015MB total
>         seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
> clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
> 1       58.63   207     50.22   242     54.35   224
> 2       32.67   372     25.82   472     27.30   446
> 4       22.53   540     13.30   916     14.33   851
> 8       15.14   804     7.43    1640    7.48    1632
> 16      14.69   829     4.79    2544    4.50    2718
> 32      15.28   797     4.41    2763    3.32    3710
> 64      15.34   794     5.22    2334    3.06    4061
> 128     15.49   786     4.97    2452    3.13    3926
> 256     15.85   768     5.02    2427    3.26    3769
> 512     16.02   760     5.29    2303    3.54    3471

I just spent a few hours trying to reproduce these benchmark results. For the
longest time I could not get the numbers for *HEAD* to even get close to the
above, while the numbers for the patch were very close.

I was worried it was a performance regression in HEAD etc. But no, same git
commit as back then produces the same issue.


As it turns out, I somehow screwed up my benchmark tooling, and I did not set
set the CPU "scaling_governor" and "energy_performance_preference" to
"performance". In a crazy turn of events, that approximately makes no
difference with the patch applied, and a 2x difference for HEAD.


I suspect this is some pathological issue when encountering heavy lock
contention (likely leading to the CPU reducing speed into a deeper state,
which then takes longer to get out of when the lock is released). As the lock
contention is drastically reduced with the patch, that affect is not visible
anymore.


After fixing the performance scaling issue, the results are quite close to the
above numbers again...


Aargh, I want my afternoon back.


Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Melanie Plageman
Date:
Hi,

Below is my review of a slightly older version than you just posted --
much of it you may have already addressed.

From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 20 Mar 2023 21:57:40 -0700
Subject: [PATCH v5 01/15] createdb-using-wal-fixes

This could use a more detailed commit message -- I don't really get what
it is doing

From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v5 02/15] Add some error checking around pinning

---
 src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++---------
 src/include/storage/bufmgr.h        |  1 +
 2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 95212a3941..fa20fab5a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
                                     LW_EXCLUSIVE);
 }

+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+    if (BufferIsLocal(buffer))
+    {
+        /* There should be exactly one pin */
+        if (LocalRefCount[-buffer - 1] != 1)
+            elog(ERROR, "incorrect local pin count: %d",
+                LocalRefCount[-buffer - 1]);
+    }
+    else
+    {
+        /* There should be exactly one local pin */
+        if (GetPrivateRefCount(buffer) != 1)

I'd rather this be an else if (was already like this, but, no reason not
to change it now)

+            elog(ERROR, "incorrect local pin count: %d",
+                GetPrivateRefCount(buffer));
+    }
+}

From 00d3044770478eea31e00fee8d1680f22ca6adde Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 27 Feb 2023 17:36:37 -0800
Subject: [PATCH v5 04/15] Add smgrzeroextend(), FileZero(), FileFallocate()

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9fd8444ed4..c34ed41d52 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2206,6 +2206,92 @@ FileSync(File file, uint32 wait_event_info)
     return returnCode;
 }

+/*
+ * Zero a region of the file.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+    int            returnCode;
+    ssize_t        written;
+
+    Assert(FileIsValid(file));
+    returnCode = FileAccess(file);
+    if (returnCode < 0)
+        return returnCode;
+
+    pgstat_report_wait_start(wait_event_info);
+    written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
+    pgstat_report_wait_end();
+
+    if (written < 0)
+        return -1;
+    else if (written != amount)

this doesn't need to be an else if

+    {
+        /* if errno is unset, assume problem is no disk space */
+        if (errno == 0)
+            errno = ENOSPC;
+        return -1;
+    }

+int
+FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+    int            returnCode;
+
+    Assert(FileIsValid(file));
+    returnCode = FileAccess(file);
+    if (returnCode < 0)
+        return returnCode;
+
+    pgstat_report_wait_start(wait_event_info);
+    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+    pgstat_report_wait_end();
+
+    if (returnCode == 0)
+        return 0;
+
+    /* for compatibility with %m printing etc */
+    errno = returnCode;
+
+    /*
+    * Return in cases of a "real" failure, if fallocate is not supported,
+    * fall through to the FileZero() backed implementation.
+    */
+    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+        return returnCode;

I'm pretty sure you can just delete the below if statement

+    if (returnCode == 0 ||
+        (returnCode != EINVAL && returnCode != EINVAL))
+        return returnCode;

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..59a65a8305 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -500,6 +501,116 @@ mdextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
     Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }

+/*
+ *    mdzeroextend() -- Add ew zeroed out blocks to the specified relation.

not sure what ew is

+ *
+ *        Similar to mdrextend(), except the relation can be extended by

mdrextend->mdextend

+ *        multiple blocks at once, and that the added blocks will be
filled with

I would lose the comma and just say "and the added blocks will be filled..."

+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+            BlockNumber blocknum, int nblocks, bool skipFsync)

So, I think there are  a few too many local variables in here, and it
actually makes it more confusing.
Assuming you would like to keep the input parameters blocknum and
nblocks unmodified for debugging/other reasons, here is a suggested
refactor of this function
Also, I think you can combine the two error cases (I don't know if the
user cares what you were trying to extend the file with). I've done this
below also.

void
mdzeroextend(SMgrRelation reln, ForkNumber forknum,
            BlockNumber blocknum, int nblocks, bool skipFsync)
{
    MdfdVec    *v;
    BlockNumber curblocknum = blocknum;
    int            remblocks = nblocks;

    Assert(nblocks > 0);

    /* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
    Assert(blocknum >= mdnblocks(reln, forknum));
#endif

    /*
    * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
    * more --- we mustn't create a block whose number actually is
    * InvalidBlockNumber or larger.
    */
    if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
        ereport(ERROR,
                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                errmsg("cannot extend file \"%s\" beyond %u blocks",
                        relpath(reln->smgr_rlocator, forknum),
                        InvalidBlockNumber)));

    while (remblocks > 0)
    {
        int            segstartblock = curblocknum % ((BlockNumber)
RELSEG_SIZE);
        int            numblocks = remblocks;
        off_t        seekpos = (off_t) BLCKSZ * segstartblock;
        int            ret;

        if (segstartblock + remblocks > RELSEG_SIZE)
            numblocks = RELSEG_SIZE - segstartblock;

        v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync,
EXTENSION_CREATE);

        /*
        * If available and useful, use posix_fallocate() (via FileAllocate())
        * to extend the relation. That's often more efficient than using
        * write(), as it commonly won't cause the kernel to allocate page
        * cache space for the extended pages.
        *
        * However, we don't use FileAllocate() for small extensions, as it
        * defeats delayed allocation on some filesystems. Not clear where
        * that decision should be made though? For now just use a cutoff of
        * 8, anything between 4 and 8 worked OK in some local testing.
        */
        if (numblocks > 8)
            ret = FileFallocate(v->mdfd_vfd,
                                seekpos, (off_t) BLCKSZ * numblocks,
                                WAIT_EVENT_DATA_FILE_EXTEND);
        else
            /*
            * Even if we don't want to use fallocate, we can still extend a
            * bit more efficiently than writing each 8kB block individually.
            * pg_pwrite_zeroes() (via FileZero()) uses
            * pg_pwritev_with_retry() to avoid multiple writes or needing a
            * zeroed buffer for the whole length of the extension.
            */
            ret = FileZero(v->mdfd_vfd,
                          seekpos, (off_t) BLCKSZ * numblocks,
                          WAIT_EVENT_DATA_FILE_EXTEND);

        if (ret != 0)
            ereport(ERROR,
                    errcode_for_file_access(),
                    errmsg("could not extend file \"%s\": %m",
                            FilePathName(v->mdfd_vfd)),
                    errhint("Check free disk space."));

        if (!skipFsync && !SmgrIsTemp(reln))
            register_dirty_segment(reln, forknum, v);

        Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));

        remblocks -= numblocks;
        curblocknum += numblocks;
    }
}

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e5414..5224ca5259 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
+/*
+ *    smgrzeroextend() -- Add new zeroed out blocks to a file.
+ *
+ *        Similar to smgrextend(), except the relation can be extended by
+ *        multiple blocks at once, and that the added blocks will be
filled with
+ *        zeroes.
+ */

Similar grammatical feedback as mdzeroextend.


From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index fa20fab5a2..6f50dbd212 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
 }

 void
-BufferCheckOneLocalPin(Buffer buffer)
+BufferCheckWePinOnce(Buffer buffer)

This name is weird. Who is we?

diff --git a/src/backend/storage/buffer/localbuf.c
b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..798c5b93a8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+    uint32        buf_state;
+    Buffer        buffer = BufferDescriptorGetBuffer(buf_hdr);
+    int            bufid = -(buffer + 1);

You do
  int            buffid = -buffer - 1;
in UnpinLocalBuffer()
They should be consistent.

  int            bufid = -(buffer + 1);

I think this version is better:

  int            buffid = -buffer - 1;

Since if buffer is INT_MAX, then the -(buffer + 1) version invokes
undefined behavior while the -buffer - 1 version doesn't.

From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:26:34 -0800
Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful. By separating out the victim buffer acquisition, future commits will
be able to change relation extensions to scale better.

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 3d0683593f..ea423ae484 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,

     /*
     * Buffer contents are currently invalid.  Try to obtain the right to
     * start I/O.  If StartBufferIO returns false, then someone else managed
     * to read it before we did, so there's nothing left for BufferAlloc() to
     * do.
     */
-    if (StartBufferIO(buf, true))
+    if (StartBufferIO(victim_buf_hdr, true))
         *foundPtr = false;
     else
         *foundPtr = true;

I know it was already like this, but since you edited the line already,
can we just make this this now?

    *foundPtr = !StartBufferIO(victim_buf_hdr, true);

@@ -1595,6 +1413,237 @@ retry:
     StrategyFreeBuffer(buf);
 }

+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+    /*
+    * Clear out the buffer's tag and flags and usagecount.  This is not
+    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+    * doing anything with the buffer. But currently it's beneficial as the
+    * pre-check for several linear scans of shared buffers just checks the
+    * tag.

I don't really understand the above comment -- mainly the last sentence.

+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
+{
+    BufferDesc *buf_hdr;
+    Buffer        buf;
+    uint32        buf_state;
+    bool        from_ring;
+
+    /*
+    * Ensure, while the spinlock's not yet held, that there's a free refcount
+    * entry.
+    */
+    ReservePrivateRefCountEntry();
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    /* we return here if a prospective victim buffer gets used concurrently */
+again:

Why use goto instead of a loop here (again is the goto label)?


From a7597b79dffaf96807f4a9beea0a39634530298d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v5 08/15] bufmgr: Support multiple in-progress IOs by using
 resowner

Commit message should describe why we couldn't support multiple
in-progress IOs before, I think (e.g. we couldn't be sure that we
cleared IO_IN_PROGRESS if something happened).

@@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits)
 {
     uint32        buf_state;

I noticed that the comment above TermianteBufferIO() says
 * TerminateBufferIO: release a buffer we were doing I/O on
 *    (Assumptions)
 *    My process is executing IO for the buffer

Can we still say this is an assumption? What about when it is being
cleaned up after being called from AbortBufferIO()

diff --git a/src/backend/utils/resowner/resowner.c
b/src/backend/utils/resowner/resowner.c
index 19b6241e45..fccc59b39d 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData

     /* We have built-in support for remembering: */
     ResourceArray bufferarr;    /* owned buffers */
+    ResourceArray bufferioarr;    /* in-progress buffer IO */
     ResourceArray catrefarr;    /* catcache references */
     ResourceArray catlistrefarr;    /* catcache-list pins */
     ResourceArray relrefarr;    /* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)

Maybe worth mentioning in-progress buffer IO in resowner README? I know
it doesn't claim to be exhaustive, so, up to you.

Also, I realize that existing code in this file has the extraneous
parantheses, but maybe it isn't worth staying consistent with that?
as in: &(owner->bufferioarr)

+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+    ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+


From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:24:19 -0800
Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
 ExtendBufferedRel{By,To,}

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 3c95b87bca..4e07a5bc48 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+/*
+ * Extend relation by multiple blocks.
+ *
+ * Tries to extend the relation by extend_by blocks. Depending on the
+ * availability of resources the relation may end up being extended by a
+ * smaller number of pages (unless an error is thrown, always by at least one
+ * page). *extended_by is updated to the number of pages the relation has been
+ * extended to.
+ *
+ * buffers needs to be an array that is at least extend_by long. Upon
+ * completion, the first extend_by array elements will point to a pinned
+ * buffer.
+ *
+ * If EB_LOCK_FIRST is part of flags, the first returned buffer is
+ * locked. This is useful for callers that want a buffer that is guaranteed to
+ * be empty.

This should document what the returned BlockNumber is.
Also, instead of having extend_by and extended_by, how about just having
one which is set by the caller to the desired number to extend by and
then overwritten in this function to the value it successfully extended
by.

It would be nice if the function returned the number it extended by
instead of the BlockNumber.

+ */
+BlockNumber
+ExtendBufferedRelBy(ExtendBufferedWhat eb,
+                    ForkNumber fork,
+                    BufferAccessStrategy strategy,
+                    uint32 flags,
+                    uint32 extend_by,
+                    Buffer *buffers,
+                    uint32 *extended_by)
+{
+    Assert((eb.rel != NULL) ^ (eb.smgr != NULL));

Can we turn these into !=

  Assert((eb.rel != NULL) != (eb.smgr != NULL));

since it is easier to understand.

(it is also in ExtendBufferedRelTo())

+    Assert(eb.smgr == NULL || eb.relpersistence != 0);
+    Assert(extend_by > 0);
+
+    if (eb.smgr == NULL)
+    {
+        eb.smgr = RelationGetSmgr(eb.rel);
+        eb.relpersistence = eb.rel->rd_rel->relpersistence;
+    }
+
+    return ExtendBufferedRelCommon(eb, fork, strategy, flags,
+                                  extend_by, InvalidBlockNumber,
+                                  buffers, extended_by);
+}

+ * Extend the relation so it is at least extend_to blocks large, read buffer

Use of "read buffer" here is confusing. We only read the block if, after
we try extending the relation, someone else already did so and we have
to read the block they extended in, right?

+ * (extend_to - 1).
+ *
+ * This is useful for callers that want to write a specific page, regardless
+ * of the current size of the relation (e.g. useful for visibilitymap and for
+ * crash recovery).
+ */
+Buffer
+ExtendBufferedRelTo(ExtendBufferedWhat eb,
+                    ForkNumber fork,
+                    BufferAccessStrategy strategy,
+                    uint32 flags,
+                    BlockNumber extend_to,
+                    ReadBufferMode mode)
+{

+    while (current_size < extend_to)
+    {

Can declare buffers variable here.
+    Buffer buffers[64];

+        uint32 num_pages = lengthof(buffers);
+        BlockNumber first_block;
+
+        if ((uint64) current_size + num_pages > extend_to)
+            num_pages = extend_to - current_size;
+
+        first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
+                                             num_pages, extend_to,
+                                             buffers, &extended_by);
+
+        current_size = first_block + extended_by;
+        Assert(current_size <= extend_to);
+        Assert(num_pages != 0 || current_size >= extend_to);
+
+        for (int i = 0; i < extended_by; i++)
+        {
+            if (first_block + i != extend_to - 1)

Is there a way we could avoid pinning these other buffers to begin with
(e.g. passing a parameter to ExtendBufferedRelCommon())

+                ReleaseBuffer(buffers[i]);
+            else
+                buffer = buffers[i];
+        }
+    }

+    /*
+    * It's possible that another backend concurrently extended the
+    * relation. In that case read the buffer.
+    *
+    * XXX: Should we control this via a flag?
+    */

I feel like there needs to be a more explicit comment about how you
could end up in this situation -- e.g. someone else extends the relation
and so smgrnblocks returns a value that is greater than extend_to, so
buffer stays InvalidBuffer

+    if (buffer == InvalidBuffer)
+    {
+        bool hit;
+
+        Assert(extended_by == 0);
+        buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
+                                  fork, extend_to - 1, mode, strategy,
+                                  &hit);
+    }
+
+    return buffer;
+}

Do we use compound literals? Here, this could be:

        buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
                                  fork, extend_to - 1, mode, strategy,
                                  &(bool) {0});

To eliminate the extraneous hit variable.



 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,
     bool        found;
     IOContext    io_context;
     IOObject    io_object;
-    bool        isExtend;
     bool        isLocalBuf = SmgrIsTemp(smgr);

     *hit = false;

+    /*
+    * Backward compatibility path, most code should use
+    * ExtendRelationBuffered() instead, as acquiring the extension lock
+    * inside ExtendRelationBuffered() scales a lot better.

Think these are old function names in the comment

+static BlockNumber
+ExtendBufferedRelShared(ExtendBufferedWhat eb,
+                        ForkNumber fork,
+                        BufferAccessStrategy strategy,
+                        uint32 flags,
+                        uint32 extend_by,
+                        BlockNumber extend_upto,
+                        Buffer *buffers,
+                        uint32 *extended_by)
+{
+    BlockNumber first_block;
+    IOContext    io_context = IOContextForStrategy(strategy);
+
+    LimitAdditionalPins(&extend_by);
+
+    /*
+    * Acquire victim buffers for extension without holding extension lock.
+    * Writing out victim buffers is the most expensive part of extending the
+    * relation, particularly when doing so requires WAL flushes. Zeroing out
+    * the buffers is also quite expensive, so do that before holding the
+    * extension lock as well.
+    *
+    * These pages are pinned by us and not valid. While we hold the pin they
+    * can't be acquired as victim buffers by another backend.
+    */
+    for (uint32 i = 0; i < extend_by; i++)
+    {
+        Block        buf_block;
+
+        buffers[i] = GetVictimBuffer(strategy, io_context);
+        buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+        /* new buffers are zero-filled */
+        MemSet((char *) buf_block, 0, BLCKSZ);
+    }
+
+    /*
+    * Lock relation against concurrent extensions, unless requested not to.
+    *
+    * We use the same extension lock for all forks. That's unnecessarily
+    * restrictive, but currently extensions for forks don't happen often
+    * enough to make it worth locking more granularly.
+    *
+    * Note that another backend might have extended the relation by the time
+    * we get the lock.
+    */
+    if (!(flags & EB_SKIP_EXTENSION_LOCK))
+    {
+        LockRelationForExtension(eb.rel, ExclusiveLock);
+        eb.smgr = RelationGetSmgr(eb.rel);
+    }
+
+    /*
+    * If requested, invalidate size cache, so that smgrnblocks asks the
+    * kernel.
+    */
+    if (flags & EB_CLEAR_SIZE_CACHE)
+        eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;

I don't see this in master, is it new?

+    first_block = smgrnblocks(eb.smgr, fork);
+

The below needs a better comment explaining what it is handling. e.g. if
we end up extending by less than we planned, unpin all of the surplus
victim buffers.

+    if (extend_upto != InvalidBlockNumber)
+    {
+        uint32 old_num_pages = extend_by;

maybe call this something like original_extend_by

diff --git a/src/backend/storage/buffer/localbuf.c
b/src/backend/storage/buffer/localbuf.c
index 5b44b0be8b..0528fddf99 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
+BlockNumber
+ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+                      ForkNumber fork,
+                      uint32 flags,
+                      uint32 extend_by,
+                      BlockNumber extend_upto,
+                      Buffer *buffers,
+                      uint32 *extended_by)
+{

+        victim_buf_id = -(buffers[i] + 1);

same comment here as before.

+ * Flags influencing the behaviour of ExtendBufferedRel*
+ */
+typedef enum ExtendBufferedFlags
+{
+    /*
+    * Don't acquire extension lock. This is safe only if the relation isn't
+    * shared, an access exclusive lock is held or if this is the startup
+    * process.
+    */
+    EB_SKIP_EXTENSION_LOCK = (1 << 0),
+
+    /* Is this extension part of recovery? */
+    EB_PERFORMING_RECOVERY = (1 << 1),
+
+    /*
+    * Should the fork be created if it does not currently exist? This likely
+    * only ever makes sense for relation forks.
+    */
+    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+
+    /* Should the first (possibly only) return buffer be returned locked? */
+    EB_LOCK_FIRST = (1 << 3),
+
+    /* Should the smgr size cache be cleared? */
+    EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+    /* internal flags follow */

I don't understand what this comment means ("internal flags follow")

+    EB_LOCK_TARGET = (1 << 5),
+} ExtendBufferedFlags;

+typedef struct ExtendBufferedWhat

Maybe this should be called like BufferedExtendTarget or something?

+{
+    Relation rel;
+    struct SMgrRelationData *smgr;
+    char relpersistence;
+} ExtendBufferedWhat;

From e4438c0eb87035e4cefd1de89458a8d88c90c0e3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v5 11/15] heapam: Add num_pages to RelationGetBufferForTuple()

This will be useful to compute the number of pages to extend a relation by.


diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cf4b917eb4..500904897d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2050,6 +2053,33 @@ heap_prepare_insert(Relation relation,
HeapTuple tup, TransactionId xid,
         return tup;
 }

+/*
+ * Helper for heap_multi_insert() that computes the number of full pages s

no space after page before s

+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples,
Size saveFreeSpace)
+{
+    size_t        page_avail;
+    int            npages = 0;
+
+    page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+    npages++;

can this not just be this:

size_t        page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
int            npages = 1;


From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:14:11 -0700
Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy()

---
 src/backend/access/heap/hio.c | 285 +++++++++++++++++-----------------
 1 file changed, 146 insertions(+), 139 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 65886839e7..48cfcff975 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len,

so in RelationGetBufferForTuple() up above where your changes start,
there is this code


    /*
    * We first try to put the tuple on the same page we last inserted a tuple
    * on, as cached in the BulkInsertState or relcache entry.  If that
    * doesn't work, we ask the Free Space Map to locate a suitable page.
    * Since the FSM's info might be out of date, we have to be prepared to
    * loop around and retry multiple times. (To insure this isn't an infinite
    * loop, we must update the FSM with the correct amount of free space on
    * each page that proves not to be suitable.)  If the FSM has no record of
    * a page with enough free space, we give up and extend the relation.
    *
    * When use_fsm is false, we either put the tuple onto the existing target
    * page or extend the relation.
    */
    if (bistate && bistate->current_buf != InvalidBuffer)
    {
        targetBlock = BufferGetBlockNumber(bistate->current_buf);
    }
    else
        targetBlock = RelationGetTargetBlock(relation);

    if (targetBlock == InvalidBlockNumber && use_fsm)
    {
        /*
        * We have no cached target page, so ask the FSM for an initial
        * target.
        */
        targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
    }

And, I was thinking how, ReadBufferBI() only has one caller now
(RelationGetBufferForTuple()) and, this caller basically already has
checked for the case in the inside of ReadBufferBI() (the code I pasted
above)

    /* If we have the desired block already pinned, re-pin and return it */
    if (bistate->current_buf != InvalidBuffer)
    {
        if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
        {
            /*
            * Currently the LOCK variants are only used for extending
            * relation, which should never reach this branch.
            */
            Assert(mode != RBM_ZERO_AND_LOCK &&
                  mode != RBM_ZERO_AND_CLEANUP_LOCK);

            IncrBufferRefCount(bistate->current_buf);
            return bistate->current_buf;
        }
        /* ... else drop the old buffer */

So, I was thinking maybe there is some way to inline the logic for
ReadBufferBI(), because I think it would feel more streamlined to me.

@@ -558,18 +477,46 @@ loop:
             ReleaseBuffer(buffer);
         }

Oh, and I forget which commit introduced BulkInsertState->next_free and
last_free, but I remember thinking that it didn't seem to fit with the
other parts of that commit.

-        /* Without FSM, always fall out of the loop and extend */
-        if (!use_fsm)
-            break;
+        if (bistate
+            && bistate->next_free != InvalidBlockNumber
+            && bistate->next_free <= bistate->last_free)
+        {
+            /*
+            * We bulk extended the relation before, and there are still some
+            * unused pages from that extension, so we don't need to look in
+            * the FSM for a new page. But do record the free space from the
+            * last page, somebody might insert narrower tuples later.
+            */

Why couldn't we have found out that we bulk-extended before and get the
block from there up above the while loop?

+            if (use_fsm)
+                RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);

-        /*
-        * Update FSM as to condition of this page, and ask for another page
-        * to try.
-        */
-        targetBlock = RecordAndGetPageWithFreeSpace(relation,
-                                                    targetBlock,
-                                                    pageFreeSpace,
-                                                    targetFreeSpace);
+            Assert(bistate->last_free != InvalidBlockNumber &&

You don't need the below half of the assert.

+                  bistate->next_free <= bistate->last_free);
+            targetBlock = bistate->next_free;
+            if (bistate->next_free >= bistate->last_free)

they can only be equal at this point

+            {
+                bistate->next_free = InvalidBlockNumber;
+                bistate->last_free = InvalidBlockNumber;
+            }
+            else
+                bistate->next_free++;
+        }
+        else if (!use_fsm)
+        {
+            /* Without FSM, always fall out of the loop and extend */
+            break;
+        }

It would be nice to have a comment explaining why this is in its own
else if instead of breaking earlier (i.e. !use_fsm is still a valid case
in the if branch above it)

+        else
+        {
+            /*
+            * Update FSM as to condition of this page, and ask for another
+            * page to try.
+            */
+            targetBlock = RecordAndGetPageWithFreeSpace(relation,
+                                                        targetBlock,
+                                                        pageFreeSpace,
+                                                        targetFreeSpace);
+        }

we can get rid of needLock and waitcount variables like this

+#define MAX_BUFFERS 64
+        Buffer        victim_buffers[MAX_BUFFERS];
+        BlockNumber firstBlock = InvalidBlockNumber;
+        BlockNumber firstBlockFSM = InvalidBlockNumber;
+        BlockNumber curBlock;
+        uint32        extend_by_pages;
+        uint32        no_fsm_pages;
+        uint32        waitcount;
+
+        extend_by_pages = num_pages;
+
+        /*
+        * Multiply the number of pages to extend by the number of waiters. Do
+        * this even if we're not using the FSM, as it does relieve
+        * contention. Pages will be found via bistate->next_free.
+        */
+        if (needLock)
+            waitcount = RelationExtensionLockWaiterCount(relation);
+        else
+            waitcount = 0;
+        extend_by_pages += extend_by_pages * waitcount;

        if (!RELATION_IS_LOCAL(relation))
            extend_by_pages += extend_by_pages *
RelationExtensionLockWaiterCount(relation);

+
+        /*
+        * can't extend by more than MAX_BUFFERS, we need to pin them all
+        * concurrently. FIXME: Need an NBuffers / MaxBackends type limit
+        * here.
+        */
+        extend_by_pages = Min(extend_by_pages, MAX_BUFFERS);
+
+        /*
+        * How many of the extended pages not to enter into the FSM.
+        *
+        * Only enter pages that we don't need ourselves into the FSM.
+        * Otherwise every other backend will immediately try to use the pages
+        * this backend neds itself, causing unnecessary contention.
+        *
+        * Bulk extended pages are remembered in bistate->next_free_buffer. So
+        * without a bistate we can't directly make use of them.
+        *
+        * Never enter the page returned into the FSM, we'll immediately use
+        * it.
+        */
+        if (num_pages > 1 && bistate == NULL)
+            no_fsm_pages = 1;
+        else
+            no_fsm_pages = num_pages;

this is more clearly this:
  no_fsm_pages = bistate == NULL ? 1 : num_pages;

-    /*
-    * Release the file-extension lock; it's now OK for someone else to extend
-    * the relation some more.
-    */
-    if (needLock)
-        UnlockRelationForExtension(relation, ExclusiveLock);
+        if (bistate)
+        {
+            if (extend_by_pages > 1)
+            {
+                bistate->next_free = firstBlock + 1;
+                bistate->last_free = firstBlock + extend_by_pages - 1;
+            }
+            else
+            {
+                bistate->next_free = InvalidBlockNumber;
+                bistate->last_free = InvalidBlockNumber;
+            }
+        }
+
+        buffer = victim_buffers[0];

If we move buffer = up, we can have only one if (bistate)

+        if (bistate)
+        {
+            IncrBufferRefCount(buffer);
+            bistate->current_buf = buffer;
+        }
+    }

like this:

        buffer = victim_buffers[0];

        if (bistate)
        {
            if (extend_by_pages > 1)
            {
                bistate->next_free = firstBlock + 1;
                bistate->last_free = firstBlock + extend_by_pages - 1;
            }
            else
            {
                bistate->next_free = InvalidBlockNumber;
                bistate->last_free = InvalidBlockNumber;
            }

            IncrBufferRefCount(buffer);
            bistate->current_buf = buffer;
        }


From 6711e45bed59ee07ec277b9462f4745603a3d4a4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:41:46 -0700
Subject: [PATCH v5 15/15] bufmgr: debug: Add PrintBuffer[Desc]

Useful for development. Perhaps we should polish these and keep them?
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 4e07a5bc48..0d382cd787 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+
+    fprintf(stderr, "%d: [%u] msg: %s, rel: %s, block %u: refcount:
%u / %u, usagecount: %u, flags:%s%s%s%s%s%s%s%s%s%s\n",
+            MyProcPid,
+            buffer,
+            msg,
+            path,
+            blockno,
+            BUF_STATE_GET_REFCOUNT(buf_state),
+            GetPrivateRefCount(buffer),
+            BUF_STATE_GET_USAGECOUNT(buf_state),
+            buf_state & BM_LOCKED ? " BM_LOCKED" : "",
+            buf_state & BM_DIRTY ? " BM_DIRTY" : "",
+            buf_state & BM_VALID ? " BM_VALID" : "",
+            buf_state & BM_TAG_VALID ? " BM_TAG_VALID" : "",
+            buf_state & BM_IO_IN_PROGRESS ? " BM_IO_IN_PROGRESS" : "",
+            buf_state & BM_IO_ERROR ? " BM_IO_ERROR" : "",
+            buf_state & BM_JUST_DIRTIED ? " BM_JUST_DIRTIED" : "",
+            buf_state & BM_PIN_COUNT_WAITER ? " BM_PIN_COUNT_WAITER" : "",
+            buf_state & BM_CHECKPOINT_NEEDED ? " BM_CHECKPOINT_NEEDED" : "",
+            buf_state & BM_PERMANENT ? " BM_PERMANENT" : ""
+        );
+}

How about this

#define FLAG_DESC(flag) (buf_state & (flag) ? " " #flag : "")
            FLAG_DESC(BM_LOCKED),
            FLAG_DESC(BM_DIRTY),
            FLAG_DESC(BM_VALID),
            FLAG_DESC(BM_TAG_VALID),
            FLAG_DESC(BM_IO_IN_PROGRESS),
            FLAG_DESC(BM_IO_ERROR),
            FLAG_DESC(BM_JUST_DIRTIED),
            FLAG_DESC(BM_PIN_COUNT_WAITER),
            FLAG_DESC(BM_CHECKPOINT_NEEDED),
            FLAG_DESC(BM_PERMANENT)
#undef FLAG_DESC

+
+void
+PrintBuffer(Buffer buffer, const char *msg)
+{
+    BufferDesc *buf_hdr = GetBufferDescriptor(buffer - 1);

no need for this variable

+
+    PrintBufferDesc(buf_hdr, msg);
+}

- Melanie



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Kyotaro Horiguchi
Date:
At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the
> relation extension related code in hio.c back into its own function.
> 
> While reviewing the hio.c code, I did realize that too much stuff is done
> while holding the buffer lock. See also the pre-existing issue
> https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de

0001, 0002 looks fine to me.

0003 adds the new function FileFallocte, but we already have
AllocateFile. Although fd.c contains functions with varying word
orders, it could be confusing that closely named functions have
different naming conventions.

+    /*
+     * Return in cases of a "real" failure, if fallocate is not supported,
+     * fall through to the FileZero() backed implementation.
+     */
+    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+        return returnCode;

I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can
be returned. Some googling indicate that ENOSYS might need the same
amendment to EOPNOTSUPP. However, I'm not clear on why man
posix_fallocate donsn't mention the former.


+        (returnCode != EINVAL && returnCode != EINVAL))
:)


FileGetRawDesc(File file)
 {
     Assert(FileIsValid(file));
+
+    if (FileAccess(file) < 0)
+        return -1;
+

The function's comment is provided below.

> * The returned file descriptor will be valid until the file is closed, but
> * there are a lot of things that can make that happen.  So the caller should
> * be careful not to do much of anything else before it finishes using the
> * returned file descriptor.

So, the responsibility to make sure the file is valid seems to lie
with the callers, although I'm not sure since there aren't any
function users in the tree. I'm unclear as to why FileSize omits the
case lruLessRecently != file. When examining similar functions, such
as FileGetRawFlags and FileGetRawMode, I'm puzzled to find that
FileAccess() nor BasicOpenFilePermthe don't set the struct members
referred to by the functions.  This makes my question the usefulness
of these functions including FileGetRawDesc().  Regardless, since the
patchset doesn't use FileGetRawDesc(), I don't believe the fix is
necessary in this patch set.

+    if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)

I'm not sure it is appropriate to assume InvalidBlockNumber equals
MaxBlockNumber + 1 in this context.


+        int            segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+        int            segendblock = (curblocknum % ((BlockNumber) RELSEG_SIZE)) + remblocks;
+        off_t        seekpos = (off_t) BLCKSZ * segstartblock;

segendblock can be defined as "segstartblock + remblocks", which would
be clearer.

+         * If available and useful, use posix_fallocate() (via FileAllocate())

FileFallocate()?


+         * However, we don't use FileAllocate() for small extensions, as it
+         * defeats delayed allocation on some filesystems. Not clear where
+         * that decision should be made though? For now just use a cutoff of
+         * 8, anything between 4 and 8 worked OK in some local testing.

The chose is quite similar to what FileFallocate() makes. However, I'm
not sure FileFallocate() itself should be doing this.



regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:
> Below is my review of a slightly older version than you just posted --
> much of it you may have already addressed.

Far from all is already - thanks for the review!


> From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Mon, 20 Mar 2023 21:57:40 -0700
> Subject: [PATCH v5 01/15] createdb-using-wal-fixes
>
> This could use a more detailed commit message -- I don't really get what
> it is doing

It's a fix for a bug that I encountered while hacking on this, it since has
been committed.

commit 5df319f3d55d09fadb4f7e4b58c5b476a3aeceb4
Author: Andres Freund <andres@anarazel.de>
Date:   2023-03-20 21:57:40 -0700

    Fix memory leak and inefficiency in CREATE DATABASE ... STRATEGY WAL_LOG

    RelationCopyStorageUsingBuffer() did not free the strategies used to access
    the source / target relation. They memory was released at the end of the
    transaction, but when using a template database with a lot of relations, the
    temporary leak can become big prohibitively big.

    RelationCopyStorageUsingBuffer() acquired the buffer for the target relation
    with RBM_NORMAL, therefore requiring a read of a block guaranteed to be
    zero. Use RBM_ZERO_AND_LOCK instead.

    Reviewed-by: Robert Haas <robertmhaas@gmail.com>
    Discussion: https://postgr.es/m/20230321070113.o2vqqxogjykwgfrr@awork3.anarazel.de
    Backpatch: 15-, where STRATEGY WAL_LOG was introduced


> From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 1 Jul 2020 19:06:45 -0700
> Subject: [PATCH v5 02/15] Add some error checking around pinning
>
> ---
>  src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++---------
>  src/include/storage/bufmgr.h        |  1 +
>  2 files changed, 29 insertions(+), 12 deletions(-)
>

> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 95212a3941..fa20fab5a2 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
>                                      LW_EXCLUSIVE);
>  }
>
> +void
> +BufferCheckOneLocalPin(Buffer buffer)
> +{
> +    if (BufferIsLocal(buffer))
> +    {
> +        /* There should be exactly one pin */
> +        if (LocalRefCount[-buffer - 1] != 1)
> +            elog(ERROR, "incorrect local pin count: %d",
> +                LocalRefCount[-buffer - 1]);
> +    }
> +    else
> +    {
> +        /* There should be exactly one local pin */
> +        if (GetPrivateRefCount(buffer) != 1)
>
> I'd rather this be an else if (was already like this, but, no reason not
> to change it now)

I don't like that much - it'd break the symmetry between local / non-local.


> +/*
> + * Zero a region of the file.
> + *
> + * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
> + * appropriate error.
> + */
> +int
> +FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> +    int            returnCode;
> +    ssize_t        written;
> +
> +    Assert(FileIsValid(file));
> +    returnCode = FileAccess(file);
> +    if (returnCode < 0)
> +        return returnCode;
> +
> +    pgstat_report_wait_start(wait_event_info);
> +    written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
> +    pgstat_report_wait_end();
> +
> +    if (written < 0)
> +        return -1;
> +    else if (written != amount)
>
> this doesn't need to be an else if

You mean it could be a "bare" if instead? I don't really think that's clearer.


> +    {
> +        /* if errno is unset, assume problem is no disk space */
> +        if (errno == 0)
> +            errno = ENOSPC;
> +        return -1;
> +    }
>
> +int
> +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
> +{
> +#ifdef HAVE_POSIX_FALLOCATE
> +    int            returnCode;
> +
> +    Assert(FileIsValid(file));
> +    returnCode = FileAccess(file);
> +    if (returnCode < 0)
> +        return returnCode;
> +
> +    pgstat_report_wait_start(wait_event_info);
> +    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
> +    pgstat_report_wait_end();
> +
> +    if (returnCode == 0)
> +        return 0;
> +
> +    /* for compatibility with %m printing etc */
> +    errno = returnCode;
> +
> +    /*
> +    * Return in cases of a "real" failure, if fallocate is not supported,
> +    * fall through to the FileZero() backed implementation.
> +    */
> +    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> +        return returnCode;
>
> I'm pretty sure you can just delete the below if statement
>
> +    if (returnCode == 0 ||
> +        (returnCode != EINVAL && returnCode != EINVAL))
> +        return returnCode;

Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
FileFallocate() succeeded or failed (rather than not being supported)?

>
> +/*
> + *    mdzeroextend() -- Add ew zeroed out blocks to the specified relation.
>
> not sure what ew is

A hurried new :)


> + *
> + *        Similar to mdrextend(), except the relation can be extended by
>
> mdrextend->mdextend

> + *        multiple blocks at once, and that the added blocks will be
> filled with
>
> I would lose the comma and just say "and the added blocks will be filled..."

Done.


> +void
> +mdzeroextend(SMgrRelation reln, ForkNumber forknum,
> +            BlockNumber blocknum, int nblocks, bool skipFsync)
>
> So, I think there are  a few too many local variables in here, and it
> actually makes it more confusing.
> Assuming you would like to keep the input parameters blocknum and
> nblocks unmodified for debugging/other reasons, here is a suggested
> refactor of this function

I'm mostly adopting this.


> Also, I think you can combine the two error cases (I don't know if the
> user cares what you were trying to extend the file with).

Hm. I do find it a somewhat useful distinction for figuring out problems - we
haven't used posix_fallocate for files so far, it seems plausible we'd hit
some portability issues.  We could make it an errdetail(), I guess?



> void
> mdzeroextend(SMgrRelation reln, ForkNumber forknum,
>             BlockNumber blocknum, int nblocks, bool skipFsync)
> {
>     MdfdVec    *v;
>     BlockNumber curblocknum = blocknum;
>     int            remblocks = nblocks;
>     Assert(nblocks > 0);
>
>     /* This assert is too expensive to have on normally ... */
> #ifdef CHECK_WRITE_VS_EXTEND
>     Assert(blocknum >= mdnblocks(reln, forknum));
> #endif
>
>     /*
>     * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
>     * more --- we mustn't create a block whose number actually is
>     * InvalidBlockNumber or larger.
>     */
>     if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
>         ereport(ERROR,
>                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>                 errmsg("cannot extend file \"%s\" beyond %u blocks",
>                         relpath(reln->smgr_rlocator, forknum),
>                         InvalidBlockNumber)));
>
>     while (remblocks > 0)
>     {
>         int            segstartblock = curblocknum % ((BlockNumber)
> RELSEG_SIZE);

Hm - this shouldn't be an int - that's my fault, not yours...




> From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 26 Oct 2022 12:05:07 -0700
> Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index fa20fab5a2..6f50dbd212 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
>  }
>
>  void
> -BufferCheckOneLocalPin(Buffer buffer)
> +BufferCheckWePinOnce(Buffer buffer)
>
> This name is weird. Who is we?

The current backend. I.e. the function checks the current backend pins the
buffer exactly once, rather that *any* backend pins it once.

I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
even though it also just applies to pins by the current backend.


> diff --git a/src/backend/storage/buffer/localbuf.c
> b/src/backend/storage/buffer/localbuf.c
> index 5325ddb663..798c5b93a8 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> +bool
> +PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
> +{
> +    uint32        buf_state;
> +    Buffer        buffer = BufferDescriptorGetBuffer(buf_hdr);
> +    int            bufid = -(buffer + 1);
>
> You do
>   int            buffid = -buffer - 1;
> in UnpinLocalBuffer()
> They should be consistent.
>
>   int            bufid = -(buffer + 1);
>
> I think this version is better:
>
>   int            buffid = -buffer - 1;
>
> Since if buffer is INT_MAX, then the -(buffer + 1) version invokes
> undefined behavior while the -buffer - 1 version doesn't.

You are right! Not sure what I was doing there...

Ah - turns out there's pre-existing code in localbuf.c that do it that way :(
See at least MarkLocalBufferDirty().

We really need to wrap this in something, rather than open coding it all over
bufmgr.c/localbuf.c. I really dislike this indexing :(.


> From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Fri, 17 Feb 2023 18:26:34 -0800
> Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately
>
> Previously we held buffer locks for two buffer mapping partitions at the same
> time to change the identity of buffers. Particularly for extending relations
> needing to hold the extension lock while acquiring a victim buffer is
> painful. By separating out the victim buffer acquisition, future commits will
> be able to change relation extensions to scale better.
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 3d0683593f..ea423ae484 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char
> relpersistence, ForkNumber forkNum,
>
>      /*
>      * Buffer contents are currently invalid.  Try to obtain the right to
>      * start I/O.  If StartBufferIO returns false, then someone else managed
>      * to read it before we did, so there's nothing left for BufferAlloc() to
>      * do.
>      */
> -    if (StartBufferIO(buf, true))
> +    if (StartBufferIO(victim_buf_hdr, true))
>          *foundPtr = false;
>      else
>          *foundPtr = true;
>
> I know it was already like this, but since you edited the line already,
> can we just make this this now?
>
>     *foundPtr = !StartBufferIO(victim_buf_hdr, true);

Hm, I do think it's easier to review if largely unchanged code is just moved
around, rather also rewritten. So I'm hesitant to do that here.


> @@ -1595,6 +1413,237 @@ retry:
>      StrategyFreeBuffer(buf);
>  }
>
> +/*
> + * Helper routine for GetVictimBuffer()
> + *
> + * Needs to be called on a buffer with a valid tag, pinned, but without the
> + * buffer header spinlock held.
> + *
> + * Returns true if the buffer can be reused, in which case the buffer is only
> + * pinned by this backend and marked as invalid, false otherwise.
> + */
> +static bool
> +InvalidateVictimBuffer(BufferDesc *buf_hdr)
> +{
> +    /*
> +    * Clear out the buffer's tag and flags and usagecount.  This is not
> +    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
> +    * doing anything with the buffer. But currently it's beneficial as the
> +    * pre-check for several linear scans of shared buffers just checks the
> +    * tag.
>
> I don't really understand the above comment -- mainly the last sentence.

To start with, it's s/checks/check/

"linear scans" is a reference to functions like DropRelationBuffers(), which
iterate over all buffers, and just check the tag for a match. If we leave the
tag around, it'll still work, as InvalidateBuffer() etc will figure out that
the buffer is invalid. But of course that's slower then just skipping the
buffer "early on".


> +static Buffer
> +GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
> +{
> +    BufferDesc *buf_hdr;
> +    Buffer        buf;
> +    uint32        buf_state;
> +    bool        from_ring;
> +
> +    /*
> +    * Ensure, while the spinlock's not yet held, that there's a free refcount
> +    * entry.
> +    */
> +    ReservePrivateRefCountEntry();
> +    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> +
> +    /* we return here if a prospective victim buffer gets used concurrently */
> +again:
>
> Why use goto instead of a loop here (again is the goto label)?

I find it way more readable this way. I'd use a loop if it were the common
case to loop, but it's the rare case, and for that I find the goto more
readable.



> @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
> clear_dirty, uint32 set_flag_bits)
>  {
>      uint32        buf_state;
>
> I noticed that the comment above TermianteBufferIO() says
>  * TerminateBufferIO: release a buffer we were doing I/O on
>  *    (Assumptions)
>  *    My process is executing IO for the buffer
>
> Can we still say this is an assumption? What about when it is being
> cleaned up after being called from AbortBufferIO()

That hasn't really changed - it was already called by AbortBufferIO().

I think it's still correct, too. We must have marked the IO as being in
progress to get there.


> diff --git a/src/backend/utils/resowner/resowner.c
> b/src/backend/utils/resowner/resowner.c
> index 19b6241e45..fccc59b39d 100644
> --- a/src/backend/utils/resowner/resowner.c
> +++ b/src/backend/utils/resowner/resowner.c
> @@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
>
>      /* We have built-in support for remembering: */
>      ResourceArray bufferarr;    /* owned buffers */
> +    ResourceArray bufferioarr;    /* in-progress buffer IO */
>      ResourceArray catrefarr;    /* catcache references */
>      ResourceArray catlistrefarr;    /* catcache-list pins */
>      ResourceArray relrefarr;    /* relcache references */
> @@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
>
> Maybe worth mentioning in-progress buffer IO in resowner README? I know
> it doesn't claim to be exhaustive, so, up to you.

Hm. Given the few types of resources mentioned in the README, I don't think
it's worth doing so.


> Also, I realize that existing code in this file has the extraneous
> parantheses, but maybe it isn't worth staying consistent with that?
> as in: &(owner->bufferioarr)

I personally don't find it worth being consistent with that, but if you /
others think it is, I'd be ok with adapting to that.


> From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 1 Mar 2023 13:24:19 -0800
> Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
>  ExtendBufferedRel{By,To,}
>
> diff --git a/src/backend/storage/buffer/bufmgr.c
> b/src/backend/storage/buffer/bufmgr.c
> index 3c95b87bca..4e07a5bc48 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
>
> +/*
> + * Extend relation by multiple blocks.
> + *
> + * Tries to extend the relation by extend_by blocks. Depending on the
> + * availability of resources the relation may end up being extended by a
> + * smaller number of pages (unless an error is thrown, always by at least one
> + * page). *extended_by is updated to the number of pages the relation has been
> + * extended to.
> + *
> + * buffers needs to be an array that is at least extend_by long. Upon
> + * completion, the first extend_by array elements will point to a pinned
> + * buffer.
> + *
> + * If EB_LOCK_FIRST is part of flags, the first returned buffer is
> + * locked. This is useful for callers that want a buffer that is guaranteed to
> + * be empty.
>
> This should document what the returned BlockNumber is.

Ok.


> Also, instead of having extend_by and extended_by, how about just having
> one which is set by the caller to the desired number to extend by and
> then overwritten in this function to the value it successfully extended
> by.

I had it that way at first - but I think it turned out to be more confusing.


> It would be nice if the function returned the number it extended by
> instead of the BlockNumber.

It's not actually free to get the block number from a buffer (it causes more
sharing of the BufferDesc cacheline, which then makes modifications of the
cacheline more expensive). We should work on removing all those
BufferGetBlockNumber(). So I don't want to introduce a new function that
requires using BufferGetBlockNumber().

So I don't think this would be an improvement.


> +    Assert((eb.rel != NULL) ^ (eb.smgr != NULL));
>
> Can we turn these into !=
>
>   Assert((eb.rel != NULL) != (eb.smgr != NULL));
>
> since it is easier to understand.

Done.


> + * Extend the relation so it is at least extend_to blocks large, read buffer
>
> Use of "read buffer" here is confusing. We only read the block if, after
> we try extending the relation, someone else already did so and we have
> to read the block they extended in, right?

That's one case, yes. I think there's also some unfortunate other case that
I'd like to get rid of. See my archeology at
https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de



> +        uint32 num_pages = lengthof(buffers);
> +        BlockNumber first_block;
> +
> +        if ((uint64) current_size + num_pages > extend_to)
> +            num_pages = extend_to - current_size;
> +
> +        first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
> +                                             num_pages, extend_to,
> +                                             buffers, &extended_by);
> +
> +        current_size = first_block + extended_by;
> +        Assert(current_size <= extend_to);
> +        Assert(num_pages != 0 || current_size >= extend_to);
> +
> +        for (int i = 0; i < extended_by; i++)
> +        {
> +            if (first_block + i != extend_to - 1)
>
> Is there a way we could avoid pinning these other buffers to begin with
> (e.g. passing a parameter to ExtendBufferedRelCommon())

We can't avoid pinning them. We could make ExtendBufferedRelCommon() release
them though - but I'm not sure that'd be an improvement.  I actually had a
flag for that temporarily, but



> +    if (buffer == InvalidBuffer)
> +    {
> +        bool hit;
> +
> +        Assert(extended_by == 0);
> +        buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
> +                                  fork, extend_to - 1, mode, strategy,
> +                                  &hit);
> +    }
> +
> +    return buffer;
> +}
>
> Do we use compound literals? Here, this could be:
>
>         buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
>                                   fork, extend_to - 1, mode, strategy,
>                                   &(bool) {0});
>
> To eliminate the extraneous hit variable.

We do use compound literals in a few places. However, I don't think it's a
good idea to pass a pointer to a temporary.  At least I need to look up the
lifetime rules for those every time. And this isn't a huge win, so I wouldn't
go for it here.



>  /*
>   * ReadBuffer_common -- common logic for all ReadBuffer variants
> @@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char
> relpersistence, ForkNumber forkNum,
>      bool        found;
>      IOContext    io_context;
>      IOObject    io_object;
> -    bool        isExtend;
>      bool        isLocalBuf = SmgrIsTemp(smgr);
>
>      *hit = false;
>
> +    /*
> +    * Backward compatibility path, most code should use
> +    * ExtendRelationBuffered() instead, as acquiring the extension lock
> +    * inside ExtendRelationBuffered() scales a lot better.
>
> Think these are old function names in the comment

Indeed.


> +static BlockNumber
> +ExtendBufferedRelShared(ExtendBufferedWhat eb,
> +                        ForkNumber fork,
> +                        BufferAccessStrategy strategy,
> +                        uint32 flags,
> +                        uint32 extend_by,
> +                        BlockNumber extend_upto,
> +                        Buffer *buffers,
> +                        uint32 *extended_by)
> +{
> +    BlockNumber first_block;
> +    IOContext    io_context = IOContextForStrategy(strategy);
> +
> +    LimitAdditionalPins(&extend_by);
> +
> +    /*
> +    * Acquire victim buffers for extension without holding extension lock.
> +    * Writing out victim buffers is the most expensive part of extending the
> +    * relation, particularly when doing so requires WAL flushes. Zeroing out
> +    * the buffers is also quite expensive, so do that before holding the
> +    * extension lock as well.
> +    *
> +    * These pages are pinned by us and not valid. While we hold the pin they
> +    * can't be acquired as victim buffers by another backend.
> +    */
> +    for (uint32 i = 0; i < extend_by; i++)
> +    {
> +        Block        buf_block;
> +
> +        buffers[i] = GetVictimBuffer(strategy, io_context);
> +        buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
> +
> +        /* new buffers are zero-filled */
> +        MemSet((char *) buf_block, 0, BLCKSZ);
> +    }
> +
> +    /*
> +    * Lock relation against concurrent extensions, unless requested not to.
> +    *
> +    * We use the same extension lock for all forks. That's unnecessarily
> +    * restrictive, but currently extensions for forks don't happen often
> +    * enough to make it worth locking more granularly.
> +    *
> +    * Note that another backend might have extended the relation by the time
> +    * we get the lock.
> +    */
> +    if (!(flags & EB_SKIP_EXTENSION_LOCK))
> +    {
> +        LockRelationForExtension(eb.rel, ExclusiveLock);
> +        eb.smgr = RelationGetSmgr(eb.rel);
> +    }
> +
> +    /*
> +    * If requested, invalidate size cache, so that smgrnblocks asks the
> +    * kernel.
> +    */
> +    if (flags & EB_CLEAR_SIZE_CACHE)
> +        eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
>
> I don't see this in master, is it new?

Not really - it's just elsewhere. See vm_extend() and fsm_extend(). I could
move this part back into "Convert a few places to ExtendBufferedRelTo", but I
doin't think that'd be better.


> + * Flags influencing the behaviour of ExtendBufferedRel*
> + */
> +typedef enum ExtendBufferedFlags
> +{
> +    /*
> +    * Don't acquire extension lock. This is safe only if the relation isn't
> +    * shared, an access exclusive lock is held or if this is the startup
> +    * process.
> +    */
> +    EB_SKIP_EXTENSION_LOCK = (1 << 0),
> +
> +    /* Is this extension part of recovery? */
> +    EB_PERFORMING_RECOVERY = (1 << 1),
> +
> +    /*
> +    * Should the fork be created if it does not currently exist? This likely
> +    * only ever makes sense for relation forks.
> +    */
> +    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
> +
> +    /* Should the first (possibly only) return buffer be returned locked? */
> +    EB_LOCK_FIRST = (1 << 3),
> +
> +    /* Should the smgr size cache be cleared? */
> +    EB_CLEAR_SIZE_CACHE = (1 << 4),
> +
> +    /* internal flags follow */
>
> I don't understand what this comment means ("internal flags follow")

Hm - just that the flags defined subsequently are for internal use, not for
callers to specify.





> + */
> +static int
> +heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples,
> Size saveFreeSpace)
> +{
> +    size_t        page_avail;
> +    int            npages = 0;
> +
> +    page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
> +    npages++;
>
> can this not just be this:
>
> size_t        page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
> int            npages = 1;

Yes.


>
> From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Wed, 26 Oct 2022 14:14:11 -0700
> Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy()

> ---
>  src/backend/access/heap/hio.c | 285 +++++++++++++++++-----------------
>  1 file changed, 146 insertions(+), 139 deletions(-)
>
> diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
> index 65886839e7..48cfcff975 100644
> --- a/src/backend/access/heap/hio.c
> +++ b/src/backend/access/heap/hio.c
> @@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
>
> so in RelationGetBufferForTuple() up above where your changes start,
> there is this code
>
>
>     /*
>     * We first try to put the tuple on the same page we last inserted a tuple
>     * on, as cached in the BulkInsertState or relcache entry.  If that
>     * doesn't work, we ask the Free Space Map to locate a suitable page.
>     * Since the FSM's info might be out of date, we have to be prepared to
>     * loop around and retry multiple times. (To insure this isn't an infinite
>     * loop, we must update the FSM with the correct amount of free space on
>     * each page that proves not to be suitable.)  If the FSM has no record of
>     * a page with enough free space, we give up and extend the relation.
>     *
>     * When use_fsm is false, we either put the tuple onto the existing target
>     * page or extend the relation.
>     */
>     if (bistate && bistate->current_buf != InvalidBuffer)
>     {
>         targetBlock = BufferGetBlockNumber(bistate->current_buf);
>     }
>     else
>         targetBlock = RelationGetTargetBlock(relation);
>
>     if (targetBlock == InvalidBlockNumber && use_fsm)
>     {
>         /*
>         * We have no cached target page, so ask the FSM for an initial
>         * target.
>         */
>         targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
>     }
>
> And, I was thinking how, ReadBufferBI() only has one caller now
> (RelationGetBufferForTuple()) and, this caller basically already has
> checked for the case in the inside of ReadBufferBI() (the code I pasted
> above)
>
>     /* If we have the desired block already pinned, re-pin and return it */
>     if (bistate->current_buf != InvalidBuffer)
>     {
>         if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
>         {
>             /*
>             * Currently the LOCK variants are only used for extending
>             * relation, which should never reach this branch.
>             */
>             Assert(mode != RBM_ZERO_AND_LOCK &&
>                   mode != RBM_ZERO_AND_CLEANUP_LOCK);
>
>             IncrBufferRefCount(bistate->current_buf);
>             return bistate->current_buf;
>         }
>         /* ... else drop the old buffer */
>
> So, I was thinking maybe there is some way to inline the logic for
> ReadBufferBI(), because I think it would feel more streamlined to me.

I don't really see how - I'd welcome suggestions?


> @@ -558,18 +477,46 @@ loop:
>              ReleaseBuffer(buffer);
>          }
>
> Oh, and I forget which commit introduced BulkInsertState->next_free and
> last_free, but I remember thinking that it didn't seem to fit with the
> other parts of that commit.

I'll move it into the one using ExtendBufferedRelBy().


> -        /* Without FSM, always fall out of the loop and extend */
> -        if (!use_fsm)
> -            break;
> +        if (bistate
> +            && bistate->next_free != InvalidBlockNumber
> +            && bistate->next_free <= bistate->last_free)
> +        {
> +            /*
> +            * We bulk extended the relation before, and there are still some
> +            * unused pages from that extension, so we don't need to look in
> +            * the FSM for a new page. But do record the free space from the
> +            * last page, somebody might insert narrower tuples later.
> +            */
>
> Why couldn't we have found out that we bulk-extended before and get the
> block from there up above the while loop?

I'm not quite sure I follow - above the loop there might still have been space
on the prior page? We also need the ability to loop if the space has been used
since.

I guess there's an argument for also checking above the loop, but I don't
think that'd currently ever be reachable.


> +            {
> +                bistate->next_free = InvalidBlockNumber;
> +                bistate->last_free = InvalidBlockNumber;
> +            }
> +            else
> +                bistate->next_free++;
> +        }
> +        else if (!use_fsm)
> +        {
> +            /* Without FSM, always fall out of the loop and extend */
> +            break;
> +        }
>
> It would be nice to have a comment explaining why this is in its own
> else if instead of breaking earlier (i.e. !use_fsm is still a valid case
> in the if branch above it)

I'm not quite following. Breaking where earlier?

Note that that branch is old code, it's just that a new way of getting a page
that potentially has free space is preceding it.



> we can get rid of needLock and waitcount variables like this
>
> +#define MAX_BUFFERS 64
> +        Buffer        victim_buffers[MAX_BUFFERS];
> +        BlockNumber firstBlock = InvalidBlockNumber;
> +        BlockNumber firstBlockFSM = InvalidBlockNumber;
> +        BlockNumber curBlock;
> +        uint32        extend_by_pages;
> +        uint32        no_fsm_pages;
> +        uint32        waitcount;
> +
> +        extend_by_pages = num_pages;
> +
> +        /*
> +        * Multiply the number of pages to extend by the number of waiters. Do
> +        * this even if we're not using the FSM, as it does relieve
> +        * contention. Pages will be found via bistate->next_free.
> +        */
> +        if (needLock)
> +            waitcount = RelationExtensionLockWaiterCount(relation);
> +        else
> +            waitcount = 0;
> +        extend_by_pages += extend_by_pages * waitcount;
>
>         if (!RELATION_IS_LOCAL(relation))
>             extend_by_pages += extend_by_pages *
> RelationExtensionLockWaiterCount(relation);

I guess I find it useful to be able to quickly add logging messages for stuff
like this. I don't think local variables are as bad as you make them out to be
:)

Thanks for the review!

Andres



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-27 15:32:47 +0900, Kyotaro Horiguchi wrote:
> At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund <andres@anarazel.de> wrote in
> > Hi,
> >
> > Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the
> > relation extension related code in hio.c back into its own function.
> >
> > While reviewing the hio.c code, I did realize that too much stuff is done
> > while holding the buffer lock. See also the pre-existing issue
> > https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de
>
> 0001, 0002 looks fine to me.
>
> 0003 adds the new function FileFallocte, but we already have
> AllocateFile. Although fd.c contains functions with varying word
> orders, it could be confusing that closely named functions have
> different naming conventions.

The syscall is named fallocate, I don't think we'd gain anything by inventing
a different name for it? Given that there's a number of File$syscall
operations, I think it's clear enough that it just fits into that. Unless you
have a better proposal?


> +    /*
> +     * Return in cases of a "real" failure, if fallocate is not supported,
> +     * fall through to the FileZero() backed implementation.
> +     */
> +    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> +        return returnCode;
>
> I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can
> be returned. Some googling indicate that ENOSYS might need the same
> amendment to EOPNOTSUPP. However, I'm not clear on why man
> posix_fallocate donsn't mention the former.

posix_fallocate() and its errors are specified by posix, I guess. I think
glibc etc will map ENOSYS to EOPNOTSUPP.

I really dislike this bit from the posix_fallocate manpage:

EINVAL offset was less than 0, or len was less than or equal to 0, or the underlying filesystem does not support the
operation.

Why oh why would you add the "or .." portion into EINVAL, when there's also
EOPNOTSUPP?

>
> +        (returnCode != EINVAL && returnCode != EINVAL))
> :)
>
>
> FileGetRawDesc(File file)
>  {
>      Assert(FileIsValid(file));
> +
> +    if (FileAccess(file) < 0)
> +        return -1;
> +
>
> The function's comment is provided below.
>
> > * The returned file descriptor will be valid until the file is closed, but
> > * there are a lot of things that can make that happen.  So the caller should
> > * be careful not to do much of anything else before it finishes using the
> > * returned file descriptor.
>
> So, the responsibility to make sure the file is valid seems to lie
> with the callers, although I'm not sure since there aren't any
> function users in the tree.

Except, as I think you realized as well, external callers *can't* call
FileAccess(), it's static.


> I'm unclear as to why FileSize omits the case lruLessRecently != file.

Not quite following - why would FileSize() deal with lruLessRecently itself?
Or do you mean why FileSize() uses FileIsNotOpen() itself, rather than relying
on FileAccess() doing that internally?


> When examining similar functions, such as FileGetRawFlags and
> FileGetRawMode, I'm puzzled to find that FileAccess() nor
> BasicOpenFilePermthe don't set the struct members referred to by the
> functions.

Those aren't involved with LRU mechanism, IIRC. Note that BasicOpenFilePerm()
returns an actual fd, not a File. So you can't call FileGetRawMode() on it. As
BasicOpenFilePerm() says:
 * This is exported for use by places that really want a plain kernel FD,
 * but need to be proof against running out of FDs. ...


I don't think FileAccess() needs to set those struct members, that's already
been done in PathNameOpenFilePerm().


> This makes my question the usefulness of these functions including
> FileGetRawDesc().

It's quite weird that we have FileGetRawDesc(), but don't allow to use it in a
safe way...


> Regardless, since the
> patchset doesn't use FileGetRawDesc(), I don't believe the fix is
> necessary in this patch set.

Yea. It was used in an earlier version, but not anymore.



>
> +    if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
>
> I'm not sure it is appropriate to assume InvalidBlockNumber equals
> MaxBlockNumber + 1 in this context.

Hm. This is just the multi-block equivalent of what mdextend() already
does. It's not pretty, indeed. I'm not sure there's really a better thing to
do for mdzeroextend(), given the mdextend() precedent? mdzeroextend() (just as
mdextend()) will be called with blockNum == InvalidBlockNumber, if you try to
extend past the size limit.


>
> +         * However, we don't use FileAllocate() for small extensions, as it
> +         * defeats delayed allocation on some filesystems. Not clear where
> +         * that decision should be made though? For now just use a cutoff of
> +         * 8, anything between 4 and 8 worked OK in some local testing.
>
> The chose is quite similar to what FileFallocate() makes. However, I'm
> not sure FileFallocate() itself should be doing this.

I'm not following - there's no such choice in FileFallocate()? Do you mean
that FileFallocate() also falls back to FileZero()? I don't think those are
comparable.

We don't want the code outside of fd.c to have to implement a fallback for
platforms that don't have fallocate (or something similar), that's why
FileFallocate() falls back to FileZero().

Here we care about not calling FileFallocate() in too small increments, when
the relation might extend further. If we somehow knew in mdzeroextend() that
the file won't be extended further, it'd be a good idea to call
FileFallocate(), even if just for a single block - it'd prevent the kernel
from wasting memory for delayed allocation.  But unfortunately we don't know
if it's the final size, hence the heuristic.

Does that make sense?


Thanks!

Andres



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Melanie Plageman
Date:
On Tue, Mar 28, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:
> > +    {
> > +        /* if errno is unset, assume problem is no disk space */
> > +        if (errno == 0)
> > +            errno = ENOSPC;
> > +        return -1;
> > +    }
> >
> > +int
> > +FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
> > +{
> > +#ifdef HAVE_POSIX_FALLOCATE
> > +    int            returnCode;
> > +
> > +    Assert(FileIsValid(file));
> > +    returnCode = FileAccess(file);
> > +    if (returnCode < 0)
> > +        return returnCode;
> > +
> > +    pgstat_report_wait_start(wait_event_info);
> > +    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
> > +    pgstat_report_wait_end();
> > +
> > +    if (returnCode == 0)
> > +        return 0;
> > +
> > +    /* for compatibility with %m printing etc */
> > +    errno = returnCode;
> > +
> > +    /*
> > +    * Return in cases of a "real" failure, if fallocate is not supported,
> > +    * fall through to the FileZero() backed implementation.
> > +    */
> > +    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> > +        return returnCode;
> >
> > I'm pretty sure you can just delete the below if statement
> >
> > +    if (returnCode == 0 ||
> > +        (returnCode != EINVAL && returnCode != EINVAL))
> > +        return returnCode;
>
> Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
> FileFallocate() succeeded or failed (rather than not being supported)?

Uh...I'm confused...maybe my eyes aren't working. If returnCode was 0,
you already would have returned and if returnCode wasn't EINVAL, you
also already would have returned.
Not to mention (returnCode != EINVAL && returnCode != EINVAL) contains
two identical operands.

> > +void
> > +mdzeroextend(SMgrRelation reln, ForkNumber forknum,
> > +            BlockNumber blocknum, int nblocks, bool skipFsync)
> >
> > So, I think there are  a few too many local variables in here, and it
> > actually makes it more confusing.
> > Assuming you would like to keep the input parameters blocknum and
> > nblocks unmodified for debugging/other reasons, here is a suggested
> > refactor of this function
>
> I'm mostly adopting this.
>
>
> > Also, I think you can combine the two error cases (I don't know if the
> > user cares what you were trying to extend the file with).
>
> Hm. I do find it a somewhat useful distinction for figuring out problems - we
> haven't used posix_fallocate for files so far, it seems plausible we'd hit
> some portability issues.  We could make it an errdetail(), I guess?

I think that would be clearer.

> > From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Wed, 26 Oct 2022 12:05:07 -0700
> > Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()
> >
> > diff --git a/src/backend/storage/buffer/bufmgr.c
> > b/src/backend/storage/buffer/bufmgr.c
> > index fa20fab5a2..6f50dbd212 100644
> > --- a/src/backend/storage/buffer/bufmgr.c
> > +++ b/src/backend/storage/buffer/bufmgr.c
> > @@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
> >  }
> >
> >  void
> > -BufferCheckOneLocalPin(Buffer buffer)
> > +BufferCheckWePinOnce(Buffer buffer)
> >
> > This name is weird. Who is we?
>
> The current backend. I.e. the function checks the current backend pins the
> buffer exactly once, rather that *any* backend pins it once.
>
> I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
> even though it also just applies to pins by the current backend.

Maybe there is a way to use "self" instead of a pronoun? But, if you
feel quite strongly about a pronoun, I think "we" implies more than one
backend, so "I" would be better.

> > @@ -1595,6 +1413,237 @@ retry:
> >      StrategyFreeBuffer(buf);
> >  }
> >
> > +/*
> > + * Helper routine for GetVictimBuffer()
> > + *
> > + * Needs to be called on a buffer with a valid tag, pinned, but without the
> > + * buffer header spinlock held.
> > + *
> > + * Returns true if the buffer can be reused, in which case the buffer is only
> > + * pinned by this backend and marked as invalid, false otherwise.
> > + */
> > +static bool
> > +InvalidateVictimBuffer(BufferDesc *buf_hdr)
> > +{
> > +    /*
> > +    * Clear out the buffer's tag and flags and usagecount.  This is not
> > +    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
> > +    * doing anything with the buffer. But currently it's beneficial as the
> > +    * pre-check for several linear scans of shared buffers just checks the
> > +    * tag.
> >
> > I don't really understand the above comment -- mainly the last sentence.
>
> To start with, it's s/checks/check/
>
> "linear scans" is a reference to functions like DropRelationBuffers(), which
> iterate over all buffers, and just check the tag for a match. If we leave the
> tag around, it'll still work, as InvalidateBuffer() etc will figure out that
> the buffer is invalid. But of course that's slower then just skipping the
> buffer "early on".

Ah. I see the updated comment on your branch and find it to be more clear.

> > @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
> > clear_dirty, uint32 set_flag_bits)
> >  {
> >      uint32        buf_state;
> >
> > I noticed that the comment above TermianteBufferIO() says
> >  * TerminateBufferIO: release a buffer we were doing I/O on
> >  *    (Assumptions)
> >  *    My process is executing IO for the buffer
> >
> > Can we still say this is an assumption? What about when it is being
> > cleaned up after being called from AbortBufferIO()
>
> That hasn't really changed - it was already called by AbortBufferIO().
>
> I think it's still correct, too. We must have marked the IO as being in
> progress to get there.

Oh, no I meant the "my process is executing IO for the buffer" --
couldn't another backend clear IO_IN_PROGRESS (i.e. not the original one
who set it on all the victim buffers)?

> > Also, I realize that existing code in this file has the extraneous
> > parantheses, but maybe it isn't worth staying consistent with that?
> > as in: &(owner->bufferioarr)
>
> I personally don't find it worth being consistent with that, but if you /
> others think it is, I'd be ok with adapting to that.

Right, so I'm saying you should remove the extraneous parentheses in the
code you added.

> > From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Wed, 1 Mar 2023 13:24:19 -0800
> > Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
> >  ExtendBufferedRel{By,To,}
> > + * Flags influencing the behaviour of ExtendBufferedRel*
> > + */
> > +typedef enum ExtendBufferedFlags
> > +{
> > +    /*
> > +    * Don't acquire extension lock. This is safe only if the relation isn't
> > +    * shared, an access exclusive lock is held or if this is the startup
> > +    * process.
> > +    */
> > +    EB_SKIP_EXTENSION_LOCK = (1 << 0),
> > +
> > +    /* Is this extension part of recovery? */
> > +    EB_PERFORMING_RECOVERY = (1 << 1),
> > +
> > +    /*
> > +    * Should the fork be created if it does not currently exist? This likely
> > +    * only ever makes sense for relation forks.
> > +    */
> > +    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
> > +
> > +    /* Should the first (possibly only) return buffer be returned locked? */
> > +    EB_LOCK_FIRST = (1 << 3),
> > +
> > +    /* Should the smgr size cache be cleared? */
> > +    EB_CLEAR_SIZE_CACHE = (1 << 4),
> > +
> > +    /* internal flags follow */
> >
> > I don't understand what this comment means ("internal flags follow")
>
> Hm - just that the flags defined subsequently are for internal use, not for
> callers to specify.

If EB_LOCK_TARGET is the only one of these, it might be more clear, for
now, to just say "an internal flag" or "for internal use" above
EB_LOCK_TARGET, since it is the only one.

> > -        /* Without FSM, always fall out of the loop and extend */
> > -        if (!use_fsm)
> > -            break;
> > +        if (bistate
> > +            && bistate->next_free != InvalidBlockNumber
> > +            && bistate->next_free <= bistate->last_free)
> > +        {
> > +            /*
> > +            * We bulk extended the relation before, and there are still some
> > +            * unused pages from that extension, so we don't need to look in
> > +            * the FSM for a new page. But do record the free space from the
> > +            * last page, somebody might insert narrower tuples later.
> > +            */
> >
> > Why couldn't we have found out that we bulk-extended before and get the
> > block from there up above the while loop?
>
> I'm not quite sure I follow - above the loop there might still have been space
> on the prior page? We also need the ability to loop if the space has been used
> since.
>
> I guess there's an argument for also checking above the loop, but I don't
> think that'd currently ever be reachable.

My idea was that directly below this code in RelationGetBufferForTuple():

    if (bistate && bistate->current_buf != InvalidBuffer)
        targetBlock = BufferGetBlockNumber(bistate->current_buf);
    else
        targetBlock = RelationGetTargetBlock(relation);

We could check bistate->next_free instead of checking the freespace map first.

But, perhaps we couldn't hit this because we would have already have set
current_buf if we had a next_free?

- Melanie



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-29 20:51:04 -0400, Melanie Plageman wrote:
> > > +    if (returnCode == 0)
> > > +        return 0;
> > > +
> > > +    /* for compatibility with %m printing etc */
> > > +    errno = returnCode;
> > > +
> > > +    /*
> > > +    * Return in cases of a "real" failure, if fallocate is not supported,
> > > +    * fall through to the FileZero() backed implementation.
> > > +    */
> > > +    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> > > +        return returnCode;
> > >
> > > I'm pretty sure you can just delete the below if statement
> > >
> > > +    if (returnCode == 0 ||
> > > +        (returnCode != EINVAL && returnCode != EINVAL))
> > > +        return returnCode;
> >
> > Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
> > FileFallocate() succeeded or failed (rather than not being supported)?
> 
> Uh...I'm confused...maybe my eyes aren't working. If returnCode was 0,
> you already would have returned and if returnCode wasn't EINVAL, you
> also already would have returned.
> Not to mention (returnCode != EINVAL && returnCode != EINVAL) contains
> two identical operands.

I'm afraid it was not your eyes that weren't working...


> > >  void
> > > -BufferCheckOneLocalPin(Buffer buffer)
> > > +BufferCheckWePinOnce(Buffer buffer)
> > >
> > > This name is weird. Who is we?
> >
> > The current backend. I.e. the function checks the current backend pins the
> > buffer exactly once, rather that *any* backend pins it once.
> >
> > I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
> > even though it also just applies to pins by the current backend.
> 
> Maybe there is a way to use "self" instead of a pronoun? But, if you
> feel quite strongly about a pronoun, I think "we" implies more than one
> backend, so "I" would be better.

I have no strong feelings around this in any form :)




> > > @@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
> > > clear_dirty, uint32 set_flag_bits)
> > >  {
> > >      uint32        buf_state;
> > >
> > > I noticed that the comment above TermianteBufferIO() says
> > >  * TerminateBufferIO: release a buffer we were doing I/O on
> > >  *    (Assumptions)
> > >  *    My process is executing IO for the buffer
> > >
> > > Can we still say this is an assumption? What about when it is being
> > > cleaned up after being called from AbortBufferIO()
> >
> > That hasn't really changed - it was already called by AbortBufferIO().
> >
> > I think it's still correct, too. We must have marked the IO as being in
> > progress to get there.
> 
> Oh, no I meant the "my process is executing IO for the buffer" --
> couldn't another backend clear IO_IN_PROGRESS (i.e. not the original one
> who set it on all the victim buffers)?

No. Or at least not yet ;) - with AIO we will... Only the IO issuing backend
currently is allowed to reset IO_IN_PROGRESS.


> > > +    /* Should the smgr size cache be cleared? */
> > > +    EB_CLEAR_SIZE_CACHE = (1 << 4),
> > > +
> > > +    /* internal flags follow */
> > >
> > > I don't understand what this comment means ("internal flags follow")
> >
> > Hm - just that the flags defined subsequently are for internal use, not for
> > callers to specify.
> 
> If EB_LOCK_TARGET is the only one of these, it might be more clear, for
> now, to just say "an internal flag" or "for internal use" above
> EB_LOCK_TARGET, since it is the only one.

I am quite certain it won't be the only one...


> > > -        /* Without FSM, always fall out of the loop and extend */
> > > -        if (!use_fsm)
> > > -            break;
> > > +        if (bistate
> > > +            && bistate->next_free != InvalidBlockNumber
> > > +            && bistate->next_free <= bistate->last_free)
> > > +        {
> > > +            /*
> > > +            * We bulk extended the relation before, and there are still some
> > > +            * unused pages from that extension, so we don't need to look in
> > > +            * the FSM for a new page. But do record the free space from the
> > > +            * last page, somebody might insert narrower tuples later.
> > > +            */
> > >
> > > Why couldn't we have found out that we bulk-extended before and get the
> > > block from there up above the while loop?
> >
> > I'm not quite sure I follow - above the loop there might still have been space
> > on the prior page? We also need the ability to loop if the space has been used
> > since.
> >
> > I guess there's an argument for also checking above the loop, but I don't
> > think that'd currently ever be reachable.
> 
> My idea was that directly below this code in RelationGetBufferForTuple():
> 
>     if (bistate && bistate->current_buf != InvalidBuffer)
>         targetBlock = BufferGetBlockNumber(bistate->current_buf);
>     else
>         targetBlock = RelationGetTargetBlock(relation);
> 
> We could check bistate->next_free instead of checking the freespace map first.
> 
> But, perhaps we couldn't hit this because we would have already have set
> current_buf if we had a next_free?

Correct. I think it might be worth doing a larger refactoring of that function
at some point not too far away...

It's definitely somewhat sad that we spend time locking the buffer, recheck
pins etc, for the callers like heap_multi_insert() and heap_update() that
already know that the page is full. But that seems like independent enough
that I'd not tackle it now.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

Attached is v6. Changes:

- Try to address Melanie and Horiguchi-san's review. I think there's one or
  two further things that need to be done

- Avoided inserting newly extended pages into the FSM while holding a buffer
  lock. If we need to do so, we now drop the buffer lock and recheck if there
  still is space (very very likely). See also [1]. I use the infrastructure
  introduced over in that in this patchset.

- Lots of comment and commit message polishing. More needed, particularly for
  the latter, but ...

- Added a patch to fix the pre-existing undefined behaviour in localbuf.c that
  Melanie pointed out. Plan to commit that soon.

- Added a patch to fix some pre-existing DO_DB() format code issues. Plan to
  commit that soon.


I did some benchmarking on "bufmgr: Acquire and clean victim buffer
separately" in isolation. For workloads that do a *lot* of reads, that proves
to be a substantial benefit on its own.  For the, obviously unrealistically
extreme, workload of N backends doing
  SELECT pg_prewarm('pgbench_accounts', 'buffer');
in a scale 100 database (with a 1281MB pgbench_accounts) and shared_buffers of
128MB, I see > 2x gains at 128, 512 clients.  Of course realistic workloads
will have much smaller gains, but it's still good to see.


Looking at the patchset, I am mostly happy with the breakdown into individual
commits. However "bufmgr: Move relation extension handling into
ExtendBufferedRel{By,To,}" is quite large. But I don't quite see how to break
it into smaller pieces without making things awkward (e.g. due to static
functions being unused, or temporarily duplicating the code doing relation
extensions).


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
John Naylor
Date:

On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> Attached is v6. Changes:

0006:

+        ereport(ERROR,
+            errcode_for_file_access(),
+            errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+                 FilePathName(v->mdfd_vfd)),
+            errhint("Check free disk space."));

Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was used in FileFallocate().

--
John Naylor
EDB: http://www.enterprisedb.com

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-30 12:28:57 +0700, John Naylor wrote:
> On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> > Attached is v6. Changes:
> 
> 0006:
> 
> +        ereport(ERROR,
> +            errcode_for_file_access(),
> +            errmsg("could not extend file \"%s\" with posix_fallocate():
> %m",
> +                 FilePathName(v->mdfd_vfd)),
> +            errhint("Check free disk space."));
> 
> Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was
> used in FileFallocate().

Fair point. I would however like to see a different error message for the two
ways of extending, at least initially. What about referencing FileFallocate()?

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-03-29 20:02:33 -0700, Andres Freund wrote:
> Attached is v6. Changes:

Attached is v7. Not much in the way of changes:
- polished a lot of the commit messages
- reordered commits to not be blocked as immediately by
  https://www.postgresql.org/message-id/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de
- Used the new relation extension function in two more places (finding an
  independent bug on the way), not sure why I didn't convert those earlier...

I'm planning to push the patches up to the hio.c changes soon, unless somebody
would like me to hold off.

After that I'm planning to wait for a buildfarm cycle, and push the changes
necessary to use bulk extension in hio.c (the main win).

I might split the patch to use ExtendBufferedRelTo() into two, one for
vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more
complicated and has more of a complicated history (see [1]).

Greetings,

Andres Freund


https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
John Naylor
Date:

On Wed, Apr 5, 2023 at 7:32 AM Andres Freund <andres@anarazel.de> wrote:

> > Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was
> > used in FileFallocate().
>
> Fair point. I would however like to see a different error message for the two
> ways of extending, at least initially. What about referencing FileFallocate()?

Seems logical.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-04-04 17:39:45 -0700, Andres Freund wrote:
> I'm planning to push the patches up to the hio.c changes soon, unless somebody
> would like me to hold off.

Done that.


> After that I'm planning to wait for a buildfarm cycle, and push the changes
> necessary to use bulk extension in hio.c (the main win).

Working on that. Might end up being tomorrow.


> I might split the patch to use ExtendBufferedRelTo() into two, one for
> vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more
> complicated and has more of a complicated history (see [1]).

I've pushed the vm_extend() and fsm_extend() piece, and did split out the
xlogutils.c case.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-04-05 18:46:16 -0700, Andres Freund wrote:
> On 2023-04-04 17:39:45 -0700, Andres Freund wrote:
> > After that I'm planning to wait for a buildfarm cycle, and push the changes
> > necessary to use bulk extension in hio.c (the main win).
> 
> Working on that. Might end up being tomorrow.

It did. So far no complaints from the buildfarm. Although I pushed the last
piece just now...

Besides editing the commit message, not a lot of changes between what I posted
last and what I pushed. A few typos and awkward sentences, code formatting,
etc. I did change the API of RelationAddBlocks() to set *did_unlock = false,
if it didn't unlock (and thus removed setting it in the caller).


> > I might split the patch to use ExtendBufferedRelTo() into two, one for
> > vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more
> > complicated and has more of a complicated history (see [1]).
> 
> I've pushed the vm_extend() and fsm_extend() piece, and did split out the
> xlogutils.c case.

Which I pushed, just now. I did perform manual testing with creating
disconnected segments on the standby, and checking that everything behaves
well in that case.


I think it might be worth having a C test for some of the bufmgr.c API. Things
like testing that retrying a failed relation extension works the second time
round.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-04-06 18:15:14 -0700, Andres Freund wrote:
> I think it might be worth having a C test for some of the bufmgr.c API. Things
> like testing that retrying a failed relation extension works the second time
> round.

A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would
hopefully have been uncovered by such a test...

I guess we could even test this specific instance without a more complicated
framework.  Create table with some data, rename the file, checkpoint - should
fail, rename back, checkpoint - should succeed.

It's much harder to exercise the error paths inside the backend extending the
relation unfortunately, because we require the file to be opened rw before
doing much. And once the FD is open, removing the permissions doesn't help.
The least complicated approach I scan see is creating directory qutoas, but
that's quite file system specific...

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Alexander Lakhin
Date:
Hi Andres,

07.04.2023 11:39, Andres Freund wrote:
> Hi,
>
> On 2023-04-06 18:15:14 -0700, Andres Freund wrote:
>> I think it might be worth having a C test for some of the bufmgr.c API. Things
>> like testing that retrying a failed relation extension works the second time
>> round.
> A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would
> hopefully have been uncovered by such a test...

A few days later I've found a new defect introduced with 31966b151.
The following script:
echo "
CREATE TABLE tbl(id int);
INSERT INTO tbl(id) SELECT i FROM generate_series(1, 1000) i;
DELETE FROM tbl;
CHECKPOINT;
" | psql -q

sleep 2

grep -C2 'automatic vacuum of table ".*.tbl"' server.log

tf=$(psql -Aqt -c "SELECT format('%s/%s', pg_database.oid, relfilenode) FROM pg_database, pg_class WHERE datname = 
current_database() AND relname = 'tbl'")
ls -l "$PGDB/base/$tf"

pg_ctl -D $PGDB stop -m immediate
pg_ctl -D $PGDB -l server.log start

with the autovacuum enabled as follows:
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_naptime = 1

gives:
2023-04-11 20:56:56.261 MSK [675708] LOG:  checkpoint starting: immediate force wait
2023-04-11 20:56:56.324 MSK [675708] LOG:  checkpoint complete: wrote 900 buffers (5.5%); 0 WAL file(s) added, 0 
removed, 0 recycled; write=0.016 s, sync=0.034 s, total=0.063 s; sync files=252, longest=0.017 s, average=0.001 s; 
distance=4162 kB, estimate=4162 kB; lsn=0/1898588, redo lsn=0/1898550
2023-04-11 20:56:57.558 MSK [676060] LOG:  automatic vacuum of table "testdb.public.tbl": index scans: 0
         pages: 5 removed, 0 remain, 5 scanned (100.00% of total)
         tuples: 1000 removed, 0 remain, 0 are dead but not yet removable
-rw------- 1 law law 0 апр 11 20:56 .../tmpdb/base/16384/16385
waiting for server to shut down.... done
server stopped
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.

server stops with the following stack trace:
Core was generated by `postgres: startup recovering 000000010000000000000001                         '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/790626' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140209454906240, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f850ec53476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f850ec397f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000557950889c0b in ExceptionalCondition (
     conditionName=0x557950a67680 "mode == RBM_NORMAL || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_ON_ERROR",
     fileName=0x557950a673e8 "bufmgr.c", lineNumber=1008) at assert.c:66
#6  0x000055795064f2d0 in ReadBuffer_common (smgr=0x557952739f38, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
     blockNum=4294967295, mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, hit=0x7fff22dd648f) at bufmgr.c:1008
#7  0x000055795064ebe7 in ReadBufferWithoutRelcache (rlocator=..., forkNum=MAIN_FORKNUM, blockNum=4294967295,
     mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, permanent=true) at bufmgr.c:800
#8  0x000055795021c0fa in XLogReadBufferExtended (rlocator=..., forknum=MAIN_FORKNUM, blkno=0,
     mode=RBM_ZERO_AND_CLEANUP_LOCK, recent_buffer=0) at xlogutils.c:536
#9  0x000055795021bd92 in XLogReadBufferForRedoExtended (record=0x5579526c4998, block_id=0 '\000', mode=RBM_NORMAL,
     get_cleanup_lock=true, buf=0x7fff22dd6598) at xlogutils.c:391
#10 0x00005579501783b1 in heap_xlog_prune (record=0x5579526c4998) at heapam.c:8726
#11 0x000055795017b7db in heap2_redo (record=0x5579526c4998) at heapam.c:9960
#12 0x0000557950215b34 in ApplyWalRecord (xlogreader=0x5579526c4998, record=0x7f85053d0120, replayTLI=0x7fff22dd6720)
     at xlogrecovery.c:1915
#13 0x0000557950215611 in PerformWalRecovery () at xlogrecovery.c:1746
#14 0x0000557950201ce3 in StartupXLOG () at xlog.c:5433
#15 0x00005579505cb6d2 in StartupProcessMain () at startup.c:267
#16 0x00005579505be9f7 in AuxiliaryProcessMain (auxtype=StartupProcess) at auxprocess.c:141
#17 0x00005579505ca2b5 in StartChildProcess (type=StartupProcess) at postmaster.c:5369
#18 0x00005579505c5224 in PostmasterMain (argc=3, argv=0x5579526c3e70) at postmaster.c:1455
#19 0x000055795047a97d in main (argc=3, argv=0x5579526c3e70) at main.c:200

As I can see, autovacuum removes pages from the table file, and this causes
the crash while replaying the record:
rmgr: Heap2       len (rec/tot):     60/   988, tx:          0, lsn: 0/01898600, prev 0/01898588, desc: PRUNE 
snapshotConflictHorizon 732 nredirected 0 ndead 226, blkref #0: rel 1663/16384/16385 blk 0 FPW

Best regards,
Alexander



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:
> A few days later I've found a new defect introduced with 31966b151.

That's the same issue that Tom also just reported, at
https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us

Attached is my WIP fix, including a test.

Greetings,

Andres Freund

Attachment

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Alexander Lakhin
Date:
12.04.2023 02:21, Andres Freund wrote:
> Hi,
>
> On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:
>> A few days later I've found a new defect introduced with 31966b151.
> That's the same issue that Tom also just reported, at
> https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us
>
> Attached is my WIP fix, including a test.

Thanks for the fix. I can confirm that the issue is gone.
ReadBuffer_common() contains an Assert(), that is similar to the fixed one,
but it looks unreachable for the WAL replay case after 26158b852.

Best regards,
Alexander



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-04-12 08:00:00 +0300, Alexander Lakhin wrote:
> 12.04.2023 02:21, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:
> > > A few days later I've found a new defect introduced with 31966b151.
> > That's the same issue that Tom also just reported, at
> > https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us
> > 
> > Attached is my WIP fix, including a test.
> 
> Thanks for the fix. I can confirm that the issue is gone.
> ReadBuffer_common() contains an Assert(), that is similar to the fixed one,
> but it looks unreachable for the WAL replay case after 26158b852.

Good catch. I implemented it there too. As now all of the modes are supported,
I removed the assertion.

I also extended the test slightly to also test the case of dropped relations.

Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Muhammad Malik
Date:
Hi,

Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want to use the same benchmarks.

Thanks,
Muhammad

From: Andres Freund <andres@anarazel.de>
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; Thomas Munro <thomas.munro@gmail.com>; Melanie Plageman <melanieplageman@gmail.com>
Cc: Yura Sokolov <y.sokolov@postgrespro.ru>; Robert Haas <robertmhaas@gmail.com>
Subject: refactoring relation extension and BufferAlloc(), faster COPY
 
Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock.  We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
   smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS


1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.


The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.


My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.


To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.



Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
   does, nearly doubles the amount of writes. First the kernel ends up writing
   out all the zeroed out buffers after a while, then when we write out the
   actual buffer contents.

   The best fix for that seems to be to optionally use posix_fallocate() to
   reserve space, without dirtying pages in the kernel page cache. However, it
   looks like that's only beneficial when extending by multiple pages at once,
   because it ends up causing one filesystem-journal entry for each extension
   on at least some filesystems.

   I added 'smgrzeroextend()' that can extend by multiple blocks, without the
   caller providing a buffer to write out. When extending by 8 or more blocks,
   posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
   used to extend the file.


b) I found that is quite beneficial to bulk-extend the relation with
   smgrextend() even without concurrency. The reason for that is the primarily
   the aforementioned dirty buffers that our current extension method causes.

   One bit that stumped me for quite a while is to know how much to extend the
   relation by. RelationGetBufferForTuple() drives the decision whether / how
   much to bulk extend purely on the contention on the extension lock, which
   obviously does not work for non-concurrent workloads.

   After quite a while I figured out that we actually have good information on
   how much to extend by, at least for COPY /
   heap_multi_insert(). heap_multi_insert() can compute how much space is
   needed to store all tuples, and pass that on to
   RelationGetBufferForTuple().

   For that to be accurate we need to recompute that number whenever we use an
   already partially filled page. That's not great, but doesn't appear to be a
   measurable overhead.


c) Contention on the FSM and the pages returned by it is a serious bottleneck
   after a) and b).

   The biggest issue is that the current bulk insertion logic in hio.c enters
   all but one of the new pages into the freespacemap. That will immediately
   cause all the other backends to contend on the first few pages returned the
   FSM, and cause contention on the FSM pages itself.

   I've, partially, addressed that by using the information about the required
   number of pages from b). Whether we bulk insert or not, the number of pages
   we know we're going to need for one heap_multi_insert() don't need to be
   added to the FSM - we're going to use them anyway.

   I've stashed the number of free blocks in the BulkInsertState for now, but
   I'm not convinced that that's the right place.

   If I revert just this part, the "concurrent COPY into unlogged table"
   benchmark goes from ~240 tps to ~190 tps.


   Even after that change the FSM is a major bottleneck. Below I included
   benchmarks showing this by just removing the use of the FSM, but I haven't
   done anything further about it. The contention seems to be both from
   updating the FSM, as well as thundering-herd like symptoms from accessing
   the FSM.

   The update part could likely be addressed to some degree with a batch
   update operation updating the state for multiple pages.

   The access part could perhaps be addressed by adding an operation that gets
   a page and immediately marks it as fully used, so other backends won't also
   try to access it.



d) doing
                /* new buffers are zero-filled */
                MemSet((char *) bufBlock, 0, BLCKSZ);

  under the extension lock is surprisingly expensive on my two socket
  workstation (but much less noticable on my laptop).

  If I move the MemSet back under the extension lock, the "concurrent COPY
  into unlogged table" benchmark goes from ~240 tps to ~200 tps.


e) When running a few benchmarks for this email, I noticed that there was a
   sharp performance dropoff for the patched code for a pgbench -S -s100 on a
   database with 1GB s_b, start between 512 and 1024 clients. This started with
   the patch only acquiring one buffer partition lock at a time. Lots of
   debugging ensued, resulting in [3].

   The problem isn't actually related to the change, it just makes it more
   visible, because the "lock chains" between two partitions reduce the
   average length of the wait queues substantially, by distribution them
   between more partitions.  [3] has a reproducer that's entirely independent
   of this patchset.




Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.


The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock).  That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.



I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.


server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       58.63   207     50.22   242     54.35   224
2       32.67   372     25.82   472     27.30   446
4       22.53   540     13.30   916     14.33   851
8       15.14   804     7.43    1640    7.48    1632
16      14.69   829     4.79    2544    4.50    2718
32      15.28   797     4.41    2763    3.32    3710
64      15.34   794     5.22    2334    3.06    4061
128     15.49   786     4.97    2452    3.13    3926
256     15.85   768     5.02    2427    3.26    3769
512     16.02   760     5.29    2303    3.54    3471

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       68.18   178     59.41   205     63.43   192
2       39.71   306     33.10   368     34.99   348
4       27.26   446     19.75   617     20.09   607
8       18.84   646     12.86   947     12.68   962
16      15.96   763     9.62    1266    8.51    1436
32      15.43   789     8.20    1486    7.77    1579
64      16.11   756     8.91    1367    8.90    1383
128     16.41   742     10.00   1218    9.74    1269
256     17.33   702     11.91   1023    10.89   1136
512     18.46   659     14.07   866     11.82   1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       63.27s  192     56.14   217     59.25   205
2       40.17s  303     29.88   407     31.50   386
4       27.57s  442     16.16   754     17.18   709
8       21.26s  573     11.89   1025    11.09   1099
16      21.25s  573     10.68   1141    10.22   1192
32      21.00s  580     10.72   1136    10.35   1178
64      20.64s  590     10.15   1200    9.76    1249
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       71.89s  169     65.57   217     69.09   69.09
2       47.36s  257     36.22   407     38.71   38.71
4       33.10s  368     21.76   754     22.78   22.78
8       26.62s  457     15.89   1025    15.30   15.30
16      24.89s  489     17.08   1141    15.20   15.20
32      25.15s  484     17.41   1136    16.14   16.14
64      26.11s  466     17.89   1200    16.76   16.76
128     skipped
256     skipped
512     skipped


Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       34.14   357     28.04   434     29.46   413
2       22.67   537     14.42   845     14.75   826
4       16.63   732     7.62    1599    7.69    1587
8       13.48   904     4.36    2795    4.13    2959
16      14.37   848     3.78    3224    2.74    4493
32      14.79   823     4.20    2902    2.07    5974
64      14.76   825     5.03    2423    2.21    5561
128     14.95   815     4.36    2796    2.30    5343
256     15.18   802     4.31    2828    2.49    4935
512     15.41   790     4.59    2656    2.84    4327


s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       62.55   194     54.22   224
2       37.11   328     28.94   421
4       25.97   469     16.42   742
8       20.01   609     11.92   1022
16      19.55   623     11.05   1102
32      19.34   630     11.27   1081
64      19.07   639     12.04   1012
128     19.22   634     12.27   993
256     19.34   630     12.28   992
512     19.60   621     11.74   1038

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       71.71   169     63.63   191
2       46.93   259     36.31   335
4       30.37   401     22.41   543
8       22.86   533     16.90   721
16      20.18   604     14.07   866
32      19.64   620     13.06   933
64      19.71   618     15.08   808
128     19.95   610     15.47   787
256     20.48   595     16.53   737
512     21.56   565     16.86   722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       62.65   194     55.74   218
2       40.25   302     29.45   413
4       27.37   445     16.26   749
8       22.07   552     11.75   1037
16      21.29   572     10.64   1145
32      20.98   580     10.70   1139
64      20.65   590     10.21   1193
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       71.72   169     65.12   187
2       46.46   262     35.74   341
4       32.61   373     21.60   564
8       26.69   456     16.30   747
16      25.31   481     17.00   716
32      24.96   488     17.47   697
64      26.05   467     17.90   680
128     skipped
256     skipped
512     skipped


test: unlogged_small_files, format: binary, files: 1024, 9505MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       37.62   323     32.77   371
2       28.35   429     18.89   645
4       20.87   583     12.18   1000
8       19.37   629     10.38   1173
16      19.41   627     10.36   1176
32      18.62   654     11.04   1103
64      18.33   664     11.89   1024
128     18.41   661     11.91   1023
256     18.52   658     12.10   1007
512     18.78   648     11.49   1060


benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.


The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.


s_b=1GB
        tps             tps
clients master          patched
1          49541           48805
2          85342           90010
4         167340          168918
8         308194          303222
16        524294          523678
32        649516          649100
64        932547          937702
128       908249          906281
256       856496          903979
512       764254          934702
1024      653886          925113
2048      569695          917262
4096      526782          903258


s_b=128MB:
        tps             tps
clients master          patched
1          40407           39854
2          73180           72252
4         143334          140860
8         240982          245331
16        429265          420810
32        544593          540127
64        706408          726678
128       713142          718087
256       611030          695582
512       552751          686290
1024      508248          666370
2048      474108          656735
4096      448582          633040


As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
  that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
  same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
  resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...


The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-05-03 18:25:51 +0000, Muhammad Malik wrote:
> Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want
touse the same benchmarks.
 

The email should contain all the necessary information. What are you missing?


c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data
text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench -n -P1
-c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0'


> I've done a fair bit of benchmarking of this patchset. For COPY it comes out
> ahead everywhere. It's possible that there's a very small regression for
> extremly IO miss heavy workloads, more below.
>
>
> server "base" configuration:
>
> max_wal_size=150GB
> shared_buffers=24GB
> huge_pages=on
> autovacuum=0
> backend_flush_after=2MB
> max_connections=5000
> wal_buffers=128MB
> wal_segment_size=1GB
>
> benchmark: pgbench running COPY into a single table. pgbench -t is set
> according to the client count, so that the same amount of data is inserted.
> This is done oth using small files ([1], ringbuffer not effective, no dirty
> data to write out within the benchmark window) and a bit larger files ([2],
> lots of data to write out due to ringbuffer).

I use a script like:

c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time
/srv/dev/build/m-opt/src/bin/pgbench/pgbench-n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE
copytest_0'


> [1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtest);
 
> [2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtext);
 


Greetings,

Andres Freund



Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Muhammad Malik
Date:

> I use a script like:

> c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time /srv/dev/build/m-opt/src/bin/pgbench/pgbench -n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE copytest_0'

> >[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
> >[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);

When I ran this script it did not insert anything into the copytest_0 table. It only generated a single copytest_data_text.copy file of size 9.236MB. 
Please help me understand how is this 'pgbench running COPY into a single table'. Also what are the 'seconds' and 'tbl-MBs' metrics that were reported.

Thank you,
Muhammad


Re: refactoring relation extension and BufferAlloc(), faster COPY

From
Andres Freund
Date:
Hi,

On 2023-05-03 19:29:46 +0000, Muhammad Malik wrote:
> > I use a script like:
>
> > c=16;psql -c 'DROP TABLE IF EXISTS copytest_0; CREATE TABLE copytest_0(data text not null);' && time
/srv/dev/build/m-opt/src/bin/pgbench/pgbench-n -P1 -c$c -j$c -t$((1024/$c)) -f ~/tmp/copy.sql && psql -c 'TRUNCATE
copytest_0'
>
> > >[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH
(FORMATtest);
 
> > >[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy'
WITH(FORMAT text);
 
>
> When I ran this script it did not insert anything into the copytest_0 table. It only generated a single
copytest_data_text.copyfile of size 9.236MB.
 
> Please help me understand how is this 'pgbench running COPY into a single table'.

That's the data generation for the file to be COPYed in. The script passed to
pgbench is just something like

COPY copytest_0 FROM '/tmp/copytest_data_text.copy';
or
COPY copytest_0 FROM '/tmp/copytest_data_binary.copy';


> Also what are the 'seconds' and 'tbl-MBs' metrics that were reported.

The total time for inserting N (1024 for the small files, 64 for the larger
ones). "tbl-MBs" is size of the resulting table, divided by time. I.e. a
measure of throughput.

Greetings,

Andres Freund