Re: automatic analyze: readahead - add "IO read time" log message - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: automatic analyze: readahead - add "IO read time" log message |
Date | |
Msg-id | VI1PR0701MB69603772FDF522FE0B4BFD80F6160@VI1PR0701MB6960.eurprd07.prod.outlook.com Whole thread Raw |
In response to | automatic analyze: readahead - add "IO read time" log message (Jakub Wartak <Jakub.Wartak@tomtom.com>) |
Responses |
Re: automatic analyze: readahead - add "IO read time" log message
|
List | pgsql-hackers |
Hi Stephen, hackers, > The analyze is doing more-or-less random i/o since it's skipping through > the table picking out select blocks, not doing regular sequential i/o. VS >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984 >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984 >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984 >Not really sure what's interesting here, but it does look like we're >skipping through the table as expected. Yes, but not randomly in this case. I wanted to point out that this is incrementing block number, therefore I've includedthis debug output which might trigger readahead heuristics. Perhaps this depends on how the table was built / vacuumed ? (in this case, pure-INSERT-only; I would expect the same intime series DBs and DWHs). > With all those 'readahead' calls it certainly makes one wonder if the > Linux kernel is reading more than just the block we're looking for > because it thinks we're doing a sequential read and will therefore want > the next few blocks when, in reality, we're going to skip past them, > meaning that any readahead the kernel is doing is likely just wasted > I/O. I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests". Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where apparentlythere is some boost due to readahead. > That definitely seems like a useful thing to include and thanks for the > patch! Please be sure to register it in the commitfest app: > https://commitfest.postgresql.org Thank you! Thread is now registered. > I would think that, ideally, we'd teach analyze.c to work in the same > way that bitmap heap scans do- that is, use posix_fadvise to let the > kernel know what pages we're going to want next instead of the kernel > guessing (incorrectly) or not doing any pre-fetching. I didn't spend a > lot of time poking, but it doesn't look like analyze.c tries to do any > prefetching today. In a similar vein, I wonder if VACUUM should be > doing prefetching too today, at least when it's skipping through the > heap based on the visibility map and jumping over all-frozen pages. My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issuedon block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) > Haven't looked too closely at this but in general +1 on the idea and > this approach looks pretty reasonable to me. Only thing I can think of > off-hand is to check how it compares to other places where we report IO > read time and make sure that it looks similar. Ok, I've changed the output in 002 version to include "avg read rate" just like in the autovacuum case but still maintainingsingle line output, e.g: automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s (read time: 2.52 s), system usage: CPU: user:0.28 s, system: 0.26 s, elapsed: 2.94 s -J.
Attachment
pgsql-hackers by date: