Re: Asynchronous and "direct" IO support for PostgreSQL. - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Asynchronous and "direct" IO support for PostgreSQL. |
Date | |
Msg-id | 20210223210545.67wihyu47z3nyii5@alap3.anarazel.de Whole thread Raw |
In response to | Re: Asynchronous and "direct" IO support for PostgreSQL. (Greg Stark <stark@mit.edu>) |
List | pgsql-hackers |
Hi, On 2021-02-23 14:58:32 -0500, Greg Stark wrote: > So firstly this is all just awesome work and I have questions but I > don't want them to come across in any way as criticism or as a demand > for more work. I posted it to get argued with ;). > The callbacks make me curious about two questions: > > 1) Is there a chance that a backend issues i/o, the i/o completes in > some other backend and by the time this backend gets around to looking > at the buffer it's already been overwritten again? Do we have to > initiate I/O again or have you found a way to arrange that this > backend has the buffer pinned from the time the i/o starts even though > it doesn't handle the comletion? The initiator of the IO can just keep a pin for the buffer to prevent that. There's a lot of complexity around how to handle pinning and locking around asynchronous buffer IO. I plan to send a separate email with a more details. In short: I made it so that for shared buffer IO holds a separate refcount for the duration of the IO - that way the issuer can release its own pin without causing a problem (consider e.g. an error while an IO is in flight). The pin held by the IO gets released in the completion callback. There's similar trickery with locks - but more about that later. > 2) Have you made (or considered making) things like sequential scans > (or more likely bitmap index scans) asynchronous at a higher level. > That is, issue a bunch of asynchronous i/o and then handle the pages > and return the tuples as the pages arrive. Since sequential scans and > bitmap scans don't guarantee to read the pages in order they're > generally free to return tuples from any page in any order. I'm not > sure how much of a win that would actually be since all the same i/o > would be getting executed and the savings in shared buffers would be > small but if there are mostly hot pages you could imagine interleaving > a lot of in-memory pages with the few i/os instead of sitting idle > waiting for the async i/o to return. I have not. Mostly because it'd basically break the entire regression test suite. And probably a lot of user expectations (yes, synchronize_seqscan exists, but it pretty rarely triggers). I'm not sure how big the win would be - the readahead for heapam.c that is in the patch set tries to keep ahead of the "current scan position" by a certain amount - so it'll issue the reads for the "missing" pages before they're needed. Hopefully early enough to avoid unnecessary stalls. But I'm pretty sure there'll be cases where that'd require a prohibitively long "readahead distance". I think there's a lot of interesting things along these lines that we could tackle, but since they involve changing results and/or larger changes to avoid those (e.g. detecting when sequential scan order isn't visible to the user) I think it'd make sense to separate them from the aio patchset itself. If we had the infrastructure to detect whether seqscan order matters, we could also switch to the tuple-iteration order to "backwards" in heapgetpage() - right now iterating forward in "itemid order" causes a lot of cache misses because the hardware prefetcher doesn't predict that the tuples are layed out "decreasing pointer order". > > ## Stats > > > > There are two new views: pg_stat_aios showing AIOs that are currently > > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO. > > This is impressive. How easy is it to correlate with system aio stats? Could you say a bit more about what you are trying to correlate? Here's some example IOs from pg_stat_aios. ┌─[ RECORD 1 ]─┬───────────────────────────────────────────────────────────────────────────────────────────────────┐ │ backend_type │ client backend │ │ id │ 98 │ │ gen │ 13736 │ │ op │ write │ │ scb │ sb │ │ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │ │ ring │ 5 │ │ owner_pid │ 1866051 │ │ merge_with │ (null) │ │ result │ 0 │ │ desc │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 238576 │ ├─[ RECORD 24 ]┼───────────────────────────────────────────────────────────────────────────────────────────────┤ │ backend_type │ checkpointer │ │ id │ 1501 │ │ gen │ 15029 │ │ op │ write │ │ scb │ sb │ │ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │ │ ring │ 3 │ │ owner_pid │ 1865275 │ │ merge_with │ 1288 │ │ result │ 0 │ │ desc │ fd: 24, offset: 105136128, nbytes: 8192, already_done: 0, release_lock: 1, buffid: 202705 │ ├─[ RECORD 31 ]┼───────────────────────────────────────────────────────────────────────────────────────────────────┤ │ backend_type │ walwriter │ │ id │ 90 │ │ gen │ 26498 │ │ op │ write │ │ scb │ wal │ │ flags │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT │ │ ring │ 5 │ │ owner_pid │ 1865281 │ │ merge_with │ 181 │ │ result │ 0 │ │ desc │ write_no: 17, fd: 12, offset: 6307840, nbytes: 1048576, already_done: 0, bufdata: 0x7f94dd670000 │ └──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘ And the per-backend AIO view shows information like this (too wide to display all cols): SELECT pid, last_context, backend_type, executed_total_count,issued_total_count,submissions_total_count, inflight_count FROM pg_stat_aio_backends sab JOIN pg_stat_activity USING (pid); ┌─────────┬──────────────┬──────────────────────────────┬──────────────────────┬────────────────────┬─────────────────────────┬────────────────┐ │ pid │ last_context │ backend_type │ executed_total_count │ issued_total_count │ submissions_total_count│ inflight_count │ ├─────────┼──────────────┼──────────────────────────────┼──────────────────────┼────────────────────┼─────────────────────────┼────────────────┤ │ 1865291 │ 3 │ logical replication launcher │ 0 │ 0 │ 0 │ 0 │ │ 1865296 │ 0 │ client backend │ 85 │ 85 │ 85 │ 0 │ │ 1865341 │ 7 │ client backend │ 9574416 │ 2905321 │ 345642│ 0 │ ... │ 1873501 │ 3 │ client backend │ 3565 │ 3565 │ 467 │ 10 │ ... │ 1865277 │ 0 │ background writer │ 695402 │ 575906 │ 13513│ 0 │ │ 1865275 │ 3 │ checkpointer │ 4664110 │ 3488530 │ 1399896│ 0 │ │ 1865281 │ 3 │ walwriter │ 77203 │ 7759 │ 7747 │ 3 │ └─────────┴──────────────┴──────────────────────────────┴──────────────────────┴────────────────────┴─────────────────────────┴────────────────┘ It's not super obvious at this point but executed_total_count / issued_total_count shows the rate at which IOs have been merged. issued_total_count / submissions_total_count shows how many (already merged) IOs were submitted together in one "submission batch" (for io_method=io_uring, io_uring_enter()). So pg_stat_aios should allow to enrich some system stats - e.g. by being able to split out WAL writes from data file writes. And - except that the current callbacks for that aren't great - it should even allow to split the IO by different relfilenodes etc. I assume pg_stat_aio_backends also can be helpful, e.g. by seeing which backends currently the deep IO queues that cause latency delays in other backends, and which backends do a lot of sequential IO (high executed_total_count / issued_total_count) and which only random... Greetings, Andres Freund
pgsql-hackers by date: