Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKGLDqKhxOODYD2C0Q5YGM697hWMksMoSN-SsUMRGgcH2+A@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
(Thomas Munro <thomas.munro@gmail.com>)
|
List | pgsql-hackers |
1. I tried out Tomas's suggestion ALTER TABLESPACE ts SET (io_combine_limit = ...). I like it, it's simple and works nicely. Unfortunately we don't have support for units like '128kB' in reloptions.c, so for now it requires a number of blocks. That's not great, so we should probably fix that before merging it, so I'm leaving that patch (v14-0004) separate, hopefully for later. 2. I also tried Tomas's suggestion of inventing a way to tell PostgreSQL what the OS readahead window size is. That allows for better modelling of whether the kernel is going to consider an access pattern to be sequential. Again, the ALTER TABLESPACE version would probably need unit support. This is initially just for experimentation as it came up in discussions of BHS behaviour. I was against this idea originally as it seemed like more complexity to have to explain and tune and I'm not sure if it's worth the trouble... but in fact we are already in the business of second guessing kernel logic, so there doesn't seem to be any good reason not to do it a bit better if we can. Do you think I have correctly understood what Linux is doing? The name I came up with is: effective_io_readahead_window. I wanted to make clear that it's a property of the system we are telling it about, not to be confused with our own look-ahead concept or whatever. Better names very welcome. This is also in a separate patch (v14-0005), left for later. 3. Another question I wondered about while retesting: does this need to be so low? I don't think so, so I've added a patch for that. src/include/port/pg_iovec.h:#define PG_IOV_MAX Min(IOV_MAX, 32) Now that I'm not using an array full of arrays of that size, I don't care so much how big we make that 32 (= 256kB @ 8kB), which clamps io_combine_limit. I think 128 (= 1MB @ 8kB) might be a decent arbitrary number. Sometimes we use it to size stack arrays, so I don't want to make it insanely large, but 128 should be fine. I think it would be good to be able to at least experiment with up to 1MB (I'm not saying it's a good idea to do it, who knows?, just that there isn't a technical reason why not to allow it AFAIK). FWIW every system on our target list that has p{read,write}v has IOV_MAX == 1024 (I checked {Free,Net,Open}BSD, macOS, illumos and Linux), so the Min(IOV_MAX, ...) really only clamps the systems where pg_{read,write}v fall back to loop-based emulation (Windows, Solaris) which is fine. PG_IOV_MAX also affects the routine that initialises new WAL files. I don't currently see a downside to doing that in 1MB chunks, as there was nothing sacred about the previous arbitrary number and the code deals with short writes by retrying as it should. 4. I agree with Heikki's complaints about the BMR interface. It should be made more consistent and faster. I didn't want to make all of those changes touching AMs etc a dependency though, so I spent some time trying to squeeze out regressions using some of these clues about calling conventions, likely hints, memory access and batching. I'm totally open to later improvements and refactoring of that stuff later! Attached is the version with the best results I've managed to get. My test is GCC -O3, pg_prewarm of a table of 220_000_000 integers = 7.6GB, which sometimes comes out around the same ~250ms on master and streaming pg_prewarm v14 on a random cloud ARM box I'm testing with, but not always, sometimes it's ~5-7ms more. (Unfortunately I don't have access to good benchmarking equipment right now, better numbers welcome.) Two new ideas: * give fast path mode a single flag, instead of testing all the conditions for every block * give fast path mode a little buffer of future block numbers, so it can call the callback in batches I'd tried that batch-calling thing before, and results were inconclusive, but I think sometimes it helps a bit. Note that it replaces the 'unget' thing from before and it is possibly a tiny bit nicer anyway. I'm a bit stumped about how to improve this further -- if anyone has any ideas for further improvements I'm all ears. Zooming back out of micro-benchmark mode, it must be pretty hard to see in a real workload that actually does something with the buffers, like a sequential scan. Earlier complaints about all-cached sequential scan regressions were resolved many versions ago AFAIK by minimising pin count in that case. I just tried Melanie's streaming sequential scan patch, with a simple SELECT COUNT(*) WHERE i = -1, with the same all-cached table of 220 million integers. Patched consistently comes out ahead for all-in-kernel-cache none-in-PG-cache: ~14.7-> ~14.4, and all-in-PG-cache ~13.5s -> ~13.3s (which I don't have an explanation for). I don't claim any of that is particularly scientific, I just wanted to note that single digit numbers of milliseconds of regression while pinning a million pages is clearly lost in the noise of other effects once you add in real query execution. That's half a dozen nanoseconds per page if I counted right. So, I am finally starting to think we should commit this, and decide which user patches are candidates.
Attachment
- v14-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v14-0002-Provide-API-for-streaming-relation-data.patch
- v14-0003-Use-streaming-I-O-in-pg_prewarm.patch
- v14-0004-ALTER-TABLESPACE-.-SET-io_combine_limit.patch
- v14-0005-Add-effective_io_readahead_window-setting.patch
- v14-0006-Increase-PG_IOV_MAX-for-bigger-io_combine_limit.patch
pgsql-hackers by date: