Re: automatic analyze: readahead - add "IO read time" log message - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: automatic analyze: readahead - add "IO read time" log message
Date
Msg-id VI1PR0701MB69603772FDF522FE0B4BFD80F6160@VI1PR0701MB6960.eurprd07.prod.outlook.com
Whole thread Raw
In response to automatic analyze: readahead - add "IO read time" log message  (Jakub Wartak <Jakub.Wartak@tomtom.com>)
Responses Re: automatic analyze: readahead - add "IO read time" log message
List pgsql-hackers
Hi Stephen, hackers,

> The analyze is doing more-or-less random i/o since it's skipping through
> the table picking out select blocks, not doing regular sequential i/o.
VS
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at
heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at
heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at
heapam_handler.c:984
>Not really sure what's interesting here, but it does look like we're
>skipping through the table as expected.

Yes, but not randomly in this case. I wanted to point out that this is incrementing block number, therefore I've
includedthis debug output which might trigger readahead heuristics. 
Perhaps this depends on how the table was built / vacuumed ? (in this case, pure-INSERT-only; I would expect the same
intime series DBs and DWHs).  

> With all those 'readahead' calls it certainly makes one wonder if the
> Linux kernel is reading more than just the block we're looking for
> because it thinks we're doing a sequential read and will therefore want
> the next few blocks when, in reality, we're going to skip past them,
> meaning that any readahead the kernel is doing is likely just wasted
> I/O.

I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API
changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests".
Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where
apparentlythere is some boost due to readahead. 

> That definitely seems like a useful thing to include and thanks for the
> patch!  Please be sure to register it in the commitfest app:
> https://commitfest.postgresql.org

Thank you! Thread is now registered.

> I would think that, ideally, we'd teach analyze.c to work in the same
> way that bitmap heap scans do- that is, use posix_fadvise to let the
> kernel know what pages we're going to want next instead of the kernel
> guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> lot of time poking, but it doesn't look like analyze.c tries to do any
> prefetching today.  In a similar vein, I wonder if VACUUM should be
> doing prefetching too today, at least when it's skipping through the
> heap based on the visibility map and jumping over all-frozen pages.

My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM
issuedon block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) 

> Haven't looked too closely at this but in general +1 on the idea and
> this approach looks pretty reasonable to me.  Only thing I can think of
> off-hand is to check how it compares to other places where we report IO
> read time and make sure that it looks similar.

Ok, I've changed the output in 002 version to include "avg read rate" just like in the autovacuum case but still
maintainingsingle line output, e.g:   
automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s (read time: 2.52 s), system usage: CPU:
user:0.28 s, system: 0.26 s, elapsed: 2.94 s 

-J.
Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: parallel distinct union and aggregate support patch
Next
From: vignesh C
Date:
Subject: Re: Parallel copy