Thread: Make ringbuffer threshold and ringbuffer sizes configurable?

Make ringbuffer threshold and ringbuffer sizes configurable?

From
Andres Freund
Date:
Hi,

The ringbuffers we use for seqscans, vacuum, copy etc can cause very
drastic slowdowns (see e.g. [1]), an can cause some workloads to
practically never end up utilizing shared buffers. ETL workloads
e.g. regularly fight with that problem.

While I think there's a number of improvements[2] we could make to the
ringbuffer logic, I think we should also just allow to make them
configurable.  I think that'll allow a decent number of systems perform
better (especially on slightly bigger systems the the current
ringbuffers are *way* too small) , make the thresholds more discoverable
(e.g. the NBuffers / 4 threshold is very confusing), and will make it
easier to experiment with better default values.

I think it would make sense to have seqscan_ringbuffer_threshold,
{bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly
are set in proportion of shared_buffers, so I suggest defining them as
floats, where negative values divide shared_buffers, whereas positive
values are absolute sizes, and 0 disables the use of ringbuffers.

I.e. to maintain the current defaults, seqscan_ringbuffer_threshold
would be -4.0, but could be also be set to an absolute 4GB (converted to
pages). Probably would want a GUC show function that displays
proportional values in a nice way.

We probably should also just increase all the ringbuffer sizes by an
order of magnitude or two, especially the one for VACUUM.

Greetings,

Andres Freund

[1] https://postgr.es/m/20190507201619.lnyg2nyhmpxcgeau%40alap3.anarazel.de

[2] The two most important things imo:
    a) Don't evict buffers when falling off the ringbuffer as long as
       there unused buffers on the freelist. Possibly just set their
       usagecount to zero as long that is the case.
    b) The biggest performance pain comes from ringbuffers where it's
       likely that buffers are dirty (vacuum, copy), because doing so
       requires that the corresponding WAL be flushed. Which often ends
       up turning many individual buffer evictions into an fdatasync,
       slowing things down to a crawl. And the contention caused by that
       is a significant concurrency issue too. By doing writes, but not
       flushes, shortly after the insertion, we can reduce the cost
       significantly.



RE: Make ringbuffer threshold and ringbuffer sizes configurable?

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Andres Freund <andres@anarazel.de>
> While I think there's a number of improvements[2] we could make to the
> ringbuffer logic, I think we should also just allow to make them
> configurable.  I think that'll allow a decent number of systems perform
> better (especially on slightly bigger systems the the current
> ringbuffers are *way* too small) , make the thresholds more discoverable
> (e.g. the NBuffers / 4 threshold is very confusing), and will make it
> easier to experiment with better default values.

+1
The NBuffers / 4 logic sometimes caused unexpected behavior.  IIRC, even when some batch or analytic processing needed
toread large tables sequentially multiple times, the second and subsequent reads didn't get the benefit of caching.
anotherexample is that before pg_prewarm became available, I couldn't cache the entire table by running "SELECT * from
table"before benchmarking performance. 


> I think it would make sense to have seqscan_ringbuffer_threshold,
> {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly
> are set in proportion of shared_buffers, so I suggest defining them as
> floats, where negative values divide shared_buffers, whereas positive
> values are absolute sizes, and 0 disables the use of ringbuffers.
>
> I.e. to maintain the current defaults, seqscan_ringbuffer_threshold
> would be -4.0, but could be also be set to an absolute 4GB (converted to
> pages). Probably would want a GUC show function that displays
> proportional values in a nice way.

I think per-table reloption is necessary as well as or instead of GUC, because the need for caching depends on the
table(see below for Oracle's manual.) 

I'm afraid it would be confusing for a user-settable parameter to have different units (percent and size).  I think
justthe positive percentage would suffice. 


CREATE TABLE

https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-TABLE.html#GUID-F9CE0CC3-13AE-4744-A43C-EAC7A71AAAB6
--------------------------------------------------
CACHE | NOCACHE | CACHE READS

Use these clauses to indicate how Oracle Database should store blocks in the buffer cache. For LOB storage, you can
specifyCACHE, NOCACHE, or CACHE READS. For other types of storage, you can specify only CACHE or NOCACHE.  

The behavior of CACHE and NOCACHE described in this section does not apply when Oracle Database chooses to use direct
readsor to perform table scans using parallel query.  

CACHE

For data that is accessed frequently, this clause indicates that the blocks retrieved for this table are placed at the
mostrecently used end of the least recently used (LRU) list in the buffer cache when a full table scan is performed.
Thisattribute is useful for small lookup tables. 

NOCACHE

For data that is not accessed frequently, this clause indicates that the blocks retrieved for this table are placed at
theleast recently used end of the LRU list in the buffer cache when a full table scan is performed. NOCACHE is the
defaultfor LOB storage.  

CACHE READS

CACHE READS applies only to LOB storage. It specifies that LOB values are brought into the buffer cache only during
readoperations but not during write operations.  
--------------------------------------------------


Regards
Takayuki Tsunakawa




Re: Make ringbuffer threshold and ringbuffer sizes configurable?

From
Andres Freund
Date:
Hi,

On 2020-02-06 05:12:11 +0000, tsunakawa.takay@fujitsu.com wrote:
> I think per-table reloption is necessary as well as or instead of GUC, because the need for caching depends on the
table(see below for Oracle's manual.)
 

I'm inclined to not do that initially. It's going to be controversial
enough to add the GUCs.


> I'm afraid it would be confusing for a user-settable parameter to have
> different units (percent and size).  I think just the positive
> percentage would suffice.

IDK, I feel like there's good reasons to use either. But I'd gladly take
just percent if that's the general concensus, rather than not getting
the improvement at all.

Greetings,

Andres Freund



Re: Make ringbuffer threshold and ringbuffer sizes configurable?

From
Laurenz Albe
Date:
On Wed, 2020-02-05 at 20:00 -0800, Andres Freund wrote:
> The ringbuffers we use for seqscans, vacuum, copy etc can cause very
> drastic slowdowns (see e.g. [1]), an can cause some workloads to
> practically never end up utilizing shared buffers. ETL workloads
> e.g. regularly fight with that problem.
> 
> I think it would make sense to have seqscan_ringbuffer_threshold,
> {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly
> are set in proportion of shared_buffers, so I suggest defining them as
> floats, where negative values divide shared_buffers, whereas positive
> values are absolute sizes, and 0 disables the use of ringbuffers.

Sounds reasonable.

I feel that it should be as few GUCs as possible, so I think that
having one per type of operation might be too granular.

This should of course also be a storage parameter that can be
set per table.

Yours,
Laurenz Albe




Re: Make ringbuffer threshold and ringbuffer sizes configurable?

From
Andres Freund
Date:
Hi,

On 2020-02-06 07:18:00 +0100, Laurenz Albe wrote:
> On Wed, 2020-02-05 at 20:00 -0800, Andres Freund wrote:
> > The ringbuffers we use for seqscans, vacuum, copy etc can cause very
> > drastic slowdowns (see e.g. [1]), an can cause some workloads to
> > practically never end up utilizing shared buffers. ETL workloads
> > e.g. regularly fight with that problem.
> > 
> > I think it would make sense to have seqscan_ringbuffer_threshold,
> > {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly
> > are set in proportion of shared_buffers, so I suggest defining them as
> > floats, where negative values divide shared_buffers, whereas positive
> > values are absolute sizes, and 0 disables the use of ringbuffers.
> 
> Sounds reasonable.

> I feel that it should be as few GUCs as possible, so I think that
> having one per type of operation might be too granular.

They already are set to different sizes, so I don't really see how
that's something doable. The relevant bits of code are:

BufferAccessStrategy
GetAccessStrategy(BufferAccessStrategyType btype)
{
    BufferAccessStrategy strategy;
    int            ring_size;

    /*
     * Select ring size to use.  See buffer/README for rationales.
     *
     * Note: if you change the ring size for BAS_BULKREAD, see also
     * SYNC_SCAN_REPORT_INTERVAL in access/heap/syncscan.c.
     */
    switch (btype)
    {
        case BAS_NORMAL:
            /* if someone asks for NORMAL, just give 'em a "default" object */
            return NULL;

        case BAS_BULKREAD:
            ring_size = 256 * 1024 / BLCKSZ;
            break;
        case BAS_BULKWRITE:
            ring_size = 16 * 1024 * 1024 / BLCKSZ;
            break;
        case BAS_VACUUM:
            ring_size = 256 * 1024 / BLCKSZ;
            break;

and


    /*
     * If the table is large relative to NBuffers, use a bulk-read access
     * strategy and enable synchronized scanning (see syncscan.c).  Although
     * the thresholds for these features could be different, we make them the
     * same so that there are only two behaviors to tune rather than four.
     * (However, some callers need to be able to disable one or both of these
     * behaviors, independently of the size of the table; also there is a GUC
     * variable that can disable synchronized scanning.)
     *
     * Note that table_block_parallelscan_initialize has a very similar test;
     * if you change this, consider changing that one, too.
     */
    if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
        scan->rs_nblocks > NBuffers / 4)
    {
        allow_strat = (scan->rs_base.rs_flags & SO_ALLOW_STRAT) != 0;
        allow_sync = (scan->rs_base.rs_flags & SO_ALLOW_SYNC) != 0;
    }
    else
        allow_strat = allow_sync = false;


> This should of course also be a storage parameter that can be
> set per table.

I honestly don't quite get that. What precisely is this addressing? I
mean fine, I can add that, but it's sufficiently more complicated than
the GUCs, and I don't really forsee that being particularly useful to
tune on a per table basis. A lot of the reason behind having the
ringbuffers is about managing the whole system impact, rather than
making individual table fast/slow.

- Andres



Re: Make ringbuffer threshold and ringbuffer sizes configurable?

From
Robert Haas
Date:
On Wed, Feb 5, 2020 at 11:00 PM Andres Freund <andres@anarazel.de> wrote:
> I.e. to maintain the current defaults, seqscan_ringbuffer_threshold
> would be -4.0, but could be also be set to an absolute 4GB (converted to
> pages). Probably would want a GUC show function that displays
> proportional values in a nice way.

I think this is kind of awkward given that our GUC system attributes
units to everything. It'd sort of be nicer to have two separate GUCs,
one measured as a multiple and the other measured in bytes, but maybe
that's just exchanging one form of confusion for another.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Make ringbuffer threshold and ringbuffer sizes configurable?

From
Andres Freund
Date:
Hi,

On 2020-02-06 13:15:04 -0500, Robert Haas wrote:
> On Wed, Feb 5, 2020 at 11:00 PM Andres Freund <andres@anarazel.de> wrote:
> > I.e. to maintain the current defaults, seqscan_ringbuffer_threshold
> > would be -4.0, but could be also be set to an absolute 4GB (converted to
> > pages). Probably would want a GUC show function that displays
> > proportional values in a nice way.
> 
> I think this is kind of awkward given that our GUC system attributes
> units to everything.

I admit it's awkward. I think we possibly could still just make the size
displayed in bytes in either case, reducing that issue a *bit*?


> It'd sort of be nicer to have two separate GUCs,
> one measured as a multiple and the other measured in bytes, but maybe
> that's just exchanging one form of confusion for another.

We don't really have a good way to deal with GUCs where setting one
precludes the other, especially when those GUCs should be changable at
runtime :(.

Greetings,

Andres Freund



Re: Make ringbuffer threshold and ringbuffer sizes configurable?

From
Robert Haas
Date:
On Thu, Feb 6, 2020 at 1:52 PM Andres Freund <andres@anarazel.de> wrote:
> I admit it's awkward. I think we possibly could still just make the size
> displayed in bytes in either case, reducing that issue a *bit*?

That seems like it makes it even more confusing, honestly.

> > It'd sort of be nicer to have two separate GUCs,
> > one measured as a multiple and the other measured in bytes, but maybe
> > that's just exchanging one form of confusion for another.
>
> We don't really have a good way to deal with GUCs where setting one
> precludes the other, especially when those GUCs should be changable at
> runtime :(.

It can work if one of the GUCs is king, and the other one takes effect
only the first one is set to some value that means "ignore me". We
have a number of examples of that, e.g. autovacuum_work_mem,
autovacuum_vacuum_cost_limit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Make ringbuffer threshold and ringbuffer sizes configurable?

From
Justin Pryzby
Date:
On Wed, Feb 05, 2020 at 08:00:26PM -0800, Andres Freund wrote:
> I think it would make sense to have seqscan_ringbuffer_threshold,
> {bulkread,bulkwrite,vacuum}_ringbuffer_size.

I suggest the possibility of somehow forcing a ringbuffer for nonbulk writes
for the current session.

In our use-case, we have loader processes INSERTing data using prepared
statements, UPSERT, and/or multiple VALUES(),() lists.  Some of that data will
be accessed in the near future (15min-24hr) but some parts (large parts, even)
may never be accessed.  I imagine most of the buffer pages never get
usagecount > 0 before being evicted.

I think it'd still be desirable to make the backend do write() its own dirty
buffers to the OS, rather than leaving behind large numbers of dirty buffers
for another backend to deal with, since that *could* be a customer facing
report.  I'd prefer the report run 10% faster due to rarely hitting dirty
buffer (by avoiding the need to write out lots of someone elses data), than the
loaders to run 25% slower, due to constantly writing to the OS.

The speed of loaders is not something our customers would be concerned with.
It's okay if they are slower than they might be.  They need to keep up with
incoming data, but it'd rarely matter if we load a 15min interval of data in
5min instead of in 4.

We would use copy if we could, to get ring buffer during writes.  But cannot
due to UPSERT (and maybe other reasons).  

I have considered the possibility of loading data into a separate instance with
small (or in any case separate) shared_buffers and then tranferring its data to
a customer-facing report instance using pg_restore (COPY)...but the overhead to
maintain that would be significant for us (me).

-- 
Justin