Re: automatic analyze: readahead - add "IO read time" log message - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: automatic analyze: readahead - add "IO read time" log message
Date
Msg-id 20201104143041.GB16415@tamriel.snowman.net
Whole thread Raw
In response to Re: automatic analyze: readahead - add "IO read time" log message  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On Wed, Nov 04, 2020 at 09:07:59AM +0000, Jakub Wartak wrote:
> >I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
> >huge ~5.2x improvement which is amazing for a such simple patch.

Nice!

> >The system and data was identical like last time, so results are directly comparable
> >to the previous post. The only change is that I've applied Yours patch on top of
> >REL12_STABLE as this is cluster where I was having original data. I think it deserves
> >it's own commitfest entry, right? Or do we combine those two?
> >
> >Each test was conducted with
> >1. alter system set effective_io_concurrency=$N;
> >2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart;
> >3. analyze verbose t;
> >4. the measurements were observed by pidstat -d and iostat -dxm.
> >
> >no readahead (blockdev --setra 0), cold cache:
> >effective_io_concurrency=64; => ~26MB/s
> >effective_io_concurrency=32; => ~29MB/s
> >effective_io_concurrency=8; => ~31MB/s
> >effective_io_concurrency=4; => ~32MB/s
> >effective_io_concurrency=1; => ~32MB/s
> >effective_io_concurrency=0; => ~32MB/s

I'm speculating here, but it seems likely that by disabling read-ahead,
the posix_fadvise() calls are basically getting ignored and that's why
there's not much difference here...?

> >readahead=256 (128kB, Linux defaults), cold cache:
> >effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline)
> >effective_io_concurrency=1; => ~62MB/s
> >effective_io_concurrency=4; => ~370MB/s
> >effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100%
> >effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)

Very nice.

> Thanks for the testing, those are indeed quite promising numbers for
> such a simple patch. I wonder how would this behave on a different
> storage - I'll see if I can do some testing, but I guess it should be
> the same as for other places that do prefetching.

That would certainly be cool.

> >I think one needs to be sure to restart, so that autovacuum
> >workers get the new effective_io_concurrency, maybe that's a
> >documentation thing only.
> >
> >I have just one question, with this patch the strace I/O of analyze
> >look like below:
> >
> >pread(72, ..., 8192, 738181120) = 8192
> >fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, "..., 8192, 738500608) = 8192
> >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, .., 8192, 738607104) = 8192
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, ..., 8192, 738754560) = 8192
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
> >
> >If you highlight "738754560" in the output it appears to duplicate the
> >syscalls issued until it preads() - in case of "738754560" offset it was
> >asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
>
> IMHO that'a a bug in the patch, which always tries to prefetch all
> "future" blocks, including those that were already prefetched. It
> probably needs to do something like bitmap heap scan where we track
> what was already prefetched and only issue the new blocks.

Yeah, this was just something quickly thrown together to see if it'd
help.  I'll clean it up by, as you say, doing something similar to what
we do with bitmap heap scan, by having a separate BlockSampler that's
set up early on and then have the pre-fetching keep pace with the actual
scan instead of posix_fadvise()'ing the same pages over and over.

I'll also look at doing something for VACUUM too, as that could also
benefit from posix_fadvise() calls, when we're skipping over all-frozen
pages.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: automatic analyze: readahead - add "IO read time" log message
Next
From: Bharath Rupireddy
Date:
Subject: Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module