Thread: Make ringbuffer threshold and ringbuffer sizes configurable?
Hi, The ringbuffers we use for seqscans, vacuum, copy etc can cause very drastic slowdowns (see e.g. [1]), an can cause some workloads to practically never end up utilizing shared buffers. ETL workloads e.g. regularly fight with that problem. While I think there's a number of improvements[2] we could make to the ringbuffer logic, I think we should also just allow to make them configurable. I think that'll allow a decent number of systems perform better (especially on slightly bigger systems the the current ringbuffers are *way* too small) , make the thresholds more discoverable (e.g. the NBuffers / 4 threshold is very confusing), and will make it easier to experiment with better default values. I think it would make sense to have seqscan_ringbuffer_threshold, {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly are set in proportion of shared_buffers, so I suggest defining them as floats, where negative values divide shared_buffers, whereas positive values are absolute sizes, and 0 disables the use of ringbuffers. I.e. to maintain the current defaults, seqscan_ringbuffer_threshold would be -4.0, but could be also be set to an absolute 4GB (converted to pages). Probably would want a GUC show function that displays proportional values in a nice way. We probably should also just increase all the ringbuffer sizes by an order of magnitude or two, especially the one for VACUUM. Greetings, Andres Freund [1] https://postgr.es/m/20190507201619.lnyg2nyhmpxcgeau%40alap3.anarazel.de [2] The two most important things imo: a) Don't evict buffers when falling off the ringbuffer as long as there unused buffers on the freelist. Possibly just set their usagecount to zero as long that is the case. b) The biggest performance pain comes from ringbuffers where it's likely that buffers are dirty (vacuum, copy), because doing so requires that the corresponding WAL be flushed. Which often ends up turning many individual buffer evictions into an fdatasync, slowing things down to a crawl. And the contention caused by that is a significant concurrency issue too. By doing writes, but not flushes, shortly after the insertion, we can reduce the cost significantly.
RE: Make ringbuffer threshold and ringbuffer sizes configurable?
From
"tsunakawa.takay@fujitsu.com"
Date:
From: Andres Freund <andres@anarazel.de> > While I think there's a number of improvements[2] we could make to the > ringbuffer logic, I think we should also just allow to make them > configurable. I think that'll allow a decent number of systems perform > better (especially on slightly bigger systems the the current > ringbuffers are *way* too small) , make the thresholds more discoverable > (e.g. the NBuffers / 4 threshold is very confusing), and will make it > easier to experiment with better default values. +1 The NBuffers / 4 logic sometimes caused unexpected behavior. IIRC, even when some batch or analytic processing needed toread large tables sequentially multiple times, the second and subsequent reads didn't get the benefit of caching. anotherexample is that before pg_prewarm became available, I couldn't cache the entire table by running "SELECT * from table"before benchmarking performance. > I think it would make sense to have seqscan_ringbuffer_threshold, > {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly > are set in proportion of shared_buffers, so I suggest defining them as > floats, where negative values divide shared_buffers, whereas positive > values are absolute sizes, and 0 disables the use of ringbuffers. > > I.e. to maintain the current defaults, seqscan_ringbuffer_threshold > would be -4.0, but could be also be set to an absolute 4GB (converted to > pages). Probably would want a GUC show function that displays > proportional values in a nice way. I think per-table reloption is necessary as well as or instead of GUC, because the need for caching depends on the table(see below for Oracle's manual.) I'm afraid it would be confusing for a user-settable parameter to have different units (percent and size). I think justthe positive percentage would suffice. CREATE TABLE https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-TABLE.html#GUID-F9CE0CC3-13AE-4744-A43C-EAC7A71AAAB6 -------------------------------------------------- CACHE | NOCACHE | CACHE READS Use these clauses to indicate how Oracle Database should store blocks in the buffer cache. For LOB storage, you can specifyCACHE, NOCACHE, or CACHE READS. For other types of storage, you can specify only CACHE or NOCACHE. The behavior of CACHE and NOCACHE described in this section does not apply when Oracle Database chooses to use direct readsor to perform table scans using parallel query. CACHE For data that is accessed frequently, this clause indicates that the blocks retrieved for this table are placed at the mostrecently used end of the least recently used (LRU) list in the buffer cache when a full table scan is performed. Thisattribute is useful for small lookup tables. NOCACHE For data that is not accessed frequently, this clause indicates that the blocks retrieved for this table are placed at theleast recently used end of the LRU list in the buffer cache when a full table scan is performed. NOCACHE is the defaultfor LOB storage. CACHE READS CACHE READS applies only to LOB storage. It specifies that LOB values are brought into the buffer cache only during readoperations but not during write operations. -------------------------------------------------- Regards Takayuki Tsunakawa
Hi, On 2020-02-06 05:12:11 +0000, tsunakawa.takay@fujitsu.com wrote: > I think per-table reloption is necessary as well as or instead of GUC, because the need for caching depends on the table(see below for Oracle's manual.) I'm inclined to not do that initially. It's going to be controversial enough to add the GUCs. > I'm afraid it would be confusing for a user-settable parameter to have > different units (percent and size). I think just the positive > percentage would suffice. IDK, I feel like there's good reasons to use either. But I'd gladly take just percent if that's the general concensus, rather than not getting the improvement at all. Greetings, Andres Freund
On Wed, 2020-02-05 at 20:00 -0800, Andres Freund wrote: > The ringbuffers we use for seqscans, vacuum, copy etc can cause very > drastic slowdowns (see e.g. [1]), an can cause some workloads to > practically never end up utilizing shared buffers. ETL workloads > e.g. regularly fight with that problem. > > I think it would make sense to have seqscan_ringbuffer_threshold, > {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly > are set in proportion of shared_buffers, so I suggest defining them as > floats, where negative values divide shared_buffers, whereas positive > values are absolute sizes, and 0 disables the use of ringbuffers. Sounds reasonable. I feel that it should be as few GUCs as possible, so I think that having one per type of operation might be too granular. This should of course also be a storage parameter that can be set per table. Yours, Laurenz Albe
Hi, On 2020-02-06 07:18:00 +0100, Laurenz Albe wrote: > On Wed, 2020-02-05 at 20:00 -0800, Andres Freund wrote: > > The ringbuffers we use for seqscans, vacuum, copy etc can cause very > > drastic slowdowns (see e.g. [1]), an can cause some workloads to > > practically never end up utilizing shared buffers. ETL workloads > > e.g. regularly fight with that problem. > > > > I think it would make sense to have seqscan_ringbuffer_threshold, > > {bulkread,bulkwrite,vacuum}_ringbuffer_size. I think they often sensibly > > are set in proportion of shared_buffers, so I suggest defining them as > > floats, where negative values divide shared_buffers, whereas positive > > values are absolute sizes, and 0 disables the use of ringbuffers. > > Sounds reasonable. > I feel that it should be as few GUCs as possible, so I think that > having one per type of operation might be too granular. They already are set to different sizes, so I don't really see how that's something doable. The relevant bits of code are: BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype) { BufferAccessStrategy strategy; int ring_size; /* * Select ring size to use. See buffer/README for rationales. * * Note: if you change the ring size for BAS_BULKREAD, see also * SYNC_SCAN_REPORT_INTERVAL in access/heap/syncscan.c. */ switch (btype) { case BAS_NORMAL: /* if someone asks for NORMAL, just give 'em a "default" object */ return NULL; case BAS_BULKREAD: ring_size = 256 * 1024 / BLCKSZ; break; case BAS_BULKWRITE: ring_size = 16 * 1024 * 1024 / BLCKSZ; break; case BAS_VACUUM: ring_size = 256 * 1024 / BLCKSZ; break; and /* * If the table is large relative to NBuffers, use a bulk-read access * strategy and enable synchronized scanning (see syncscan.c). Although * the thresholds for these features could be different, we make them the * same so that there are only two behaviors to tune rather than four. * (However, some callers need to be able to disable one or both of these * behaviors, independently of the size of the table; also there is a GUC * variable that can disable synchronized scanning.) * * Note that table_block_parallelscan_initialize has a very similar test; * if you change this, consider changing that one, too. */ if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) && scan->rs_nblocks > NBuffers / 4) { allow_strat = (scan->rs_base.rs_flags & SO_ALLOW_STRAT) != 0; allow_sync = (scan->rs_base.rs_flags & SO_ALLOW_SYNC) != 0; } else allow_strat = allow_sync = false; > This should of course also be a storage parameter that can be > set per table. I honestly don't quite get that. What precisely is this addressing? I mean fine, I can add that, but it's sufficiently more complicated than the GUCs, and I don't really forsee that being particularly useful to tune on a per table basis. A lot of the reason behind having the ringbuffers is about managing the whole system impact, rather than making individual table fast/slow. - Andres
On Wed, Feb 5, 2020 at 11:00 PM Andres Freund <andres@anarazel.de> wrote: > I.e. to maintain the current defaults, seqscan_ringbuffer_threshold > would be -4.0, but could be also be set to an absolute 4GB (converted to > pages). Probably would want a GUC show function that displays > proportional values in a nice way. I think this is kind of awkward given that our GUC system attributes units to everything. It'd sort of be nicer to have two separate GUCs, one measured as a multiple and the other measured in bytes, but maybe that's just exchanging one form of confusion for another. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-02-06 13:15:04 -0500, Robert Haas wrote: > On Wed, Feb 5, 2020 at 11:00 PM Andres Freund <andres@anarazel.de> wrote: > > I.e. to maintain the current defaults, seqscan_ringbuffer_threshold > > would be -4.0, but could be also be set to an absolute 4GB (converted to > > pages). Probably would want a GUC show function that displays > > proportional values in a nice way. > > I think this is kind of awkward given that our GUC system attributes > units to everything. I admit it's awkward. I think we possibly could still just make the size displayed in bytes in either case, reducing that issue a *bit*? > It'd sort of be nicer to have two separate GUCs, > one measured as a multiple and the other measured in bytes, but maybe > that's just exchanging one form of confusion for another. We don't really have a good way to deal with GUCs where setting one precludes the other, especially when those GUCs should be changable at runtime :(. Greetings, Andres Freund
On Thu, Feb 6, 2020 at 1:52 PM Andres Freund <andres@anarazel.de> wrote: > I admit it's awkward. I think we possibly could still just make the size > displayed in bytes in either case, reducing that issue a *bit*? That seems like it makes it even more confusing, honestly. > > It'd sort of be nicer to have two separate GUCs, > > one measured as a multiple and the other measured in bytes, but maybe > > that's just exchanging one form of confusion for another. > > We don't really have a good way to deal with GUCs where setting one > precludes the other, especially when those GUCs should be changable at > runtime :(. It can work if one of the GUCs is king, and the other one takes effect only the first one is set to some value that means "ignore me". We have a number of examples of that, e.g. autovacuum_work_mem, autovacuum_vacuum_cost_limit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 05, 2020 at 08:00:26PM -0800, Andres Freund wrote: > I think it would make sense to have seqscan_ringbuffer_threshold, > {bulkread,bulkwrite,vacuum}_ringbuffer_size. I suggest the possibility of somehow forcing a ringbuffer for nonbulk writes for the current session. In our use-case, we have loader processes INSERTing data using prepared statements, UPSERT, and/or multiple VALUES(),() lists. Some of that data will be accessed in the near future (15min-24hr) but some parts (large parts, even) may never be accessed. I imagine most of the buffer pages never get usagecount > 0 before being evicted. I think it'd still be desirable to make the backend do write() its own dirty buffers to the OS, rather than leaving behind large numbers of dirty buffers for another backend to deal with, since that *could* be a customer facing report. I'd prefer the report run 10% faster due to rarely hitting dirty buffer (by avoiding the need to write out lots of someone elses data), than the loaders to run 25% slower, due to constantly writing to the OS. The speed of loaders is not something our customers would be concerned with. It's okay if they are slower than they might be. They need to keep up with incoming data, but it'd rarely matter if we load a 15min interval of data in 5min instead of in 4. We would use copy if we could, to get ring buffer during writes. But cannot due to UPSERT (and maybe other reasons). I have considered the possibility of loading data into a separate instance with small (or in any case separate) shared_buffers and then tranferring its data to a customer-facing report instance using pg_restore (COPY)...but the overhead to maintain that would be significant for us (me). -- Justin