Thread: automatic analyze: readahead - add "IO read time" log message

automatic analyze: readahead - add "IO read time" log message

From

Jakub Wartak

Date:

26 October 2020, 12:21:01

Greetings hackers,

I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical
factor.Autovacuum/autogathering statistics was tuned to be pretty very aggressive: 
autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
autovacuum_naptime=1s
autovacuum_max_workers=4

Some critical table partitions are configured with:  autovacuum_analyze_scale_factor=0.001,
autovacuum_analyze_threshold=50000to force auto-analyze jobs pretty often. The interesting logs are like this: 
automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s
automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s
automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s
automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s

and this is root-cause of my question.  As you can see there is huge 3x-4x time discrepancy between "elapsed" and
user+systemwhich is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According
tovarious tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1)
alsoreport the same for the worker: 

06:56:12 AM       PID    %usr %system  %guest    %CPU   CPU  Command
06:56:13 AM    114774    8.00   10.00    0.00   18.00    18  postgres
06:56:14 AM    114774    8.00   11.00    0.00   19.00    15  postgres
06:56:15 AM    114774    5.00   13.00    0.00   18.00    18  postgres

06:56:17 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
06:56:18 AM    114774  63746.53      0.00      0.00  postgres
06:56:19 AM    114774  62896.00      0.00      0.00  postgres
06:56:20 AM    114774  62920.00      0.00      0.00  postgres

One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not
performinga lot (total = 242MB/s reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although reporting
highutilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in
parallel(let's say 300MB/s for statistics collection alone). 

According to various gdb backtraces, a lot of time is spent here:
#0  0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at
/usr/include/bits/unistd.h:84
#2  FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760,
wait_event_info=167772173)at fd.c:1883 
#3  0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902,
buffer=0x7f9413ab7280"\037\005") at md.c:596 
#4  0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
blockNum=19890902,mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f) 
    at bufmgr.c:897
#5  0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902,
mode=<optimizedout>, strategy=<optimized out>) at bufmgr.c:665 
#6  0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>,
bstrategy=<optimizedout>) at heapam_handler.c:998 
#7  0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>,
scan=0x10c8098)at ../../../src/include/access/tableam.h:1462 
#8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000, totalrows=0x7fffba7e3160,
totaldeadrows=0x7fffba7e3158)at analyze.c:1048 
#9  0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0
<acquire_sample_rows>,relpages=26763525, inh=false, 
    in_outer_xact=false, elevel=13) at analyze.c:502
[..]
#12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118
[..]

The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order:

Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at
heapam_handler.c:984

And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on
readaheadsdone by ext4, trimmed for clarity: 

# Children      Self       sys       usr  Command   Shared Object       Symbol
    63.64%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] entry_SYSCALL_64_after_hwframe
            ---entry_SYSCALL_64_after_hwframe
               do_syscall_64
               |--59.66%--sys_pread64
               |          vfs_read
               |           --59.09%--__vfs_read
               |                      --58.24%--generic_file_read_iter
               |                                |--47.44%--ondemand_readahead
               |                                |           --46.88%--__do_page_cache_readahead
               |                                |                     |--32.67%--ext4_mpage_readpages
               |                                |                     |          |--16.76%--submit_bio
               |                                |                     |--10.23%--blk_finish_plug
[..]
    63.64%     1.99%     1.99%     0.00%  postgres  [kernel.kallsyms]   [k] do_syscall_64
            |--61.65%--do_syscall_64
            |          |--59.66%--sys_pread64
            |          |          vfs_read
            |          |           --59.09%--__vfs_read
            |          |                      --58.24%--generic_file_read_iter
            |          |                                |--47.44%--ondemand_readahead
            |          |                                |           --46.88%--__do_page_cache_readahead

    61.36%     0.00%     0.00%     0.00%  postgres  postgres            [.] FileRead
            ---FileRead
               __pread_nocancel
                --60.51%--entry_SYSCALL_64_after_hwframe
                          do_syscall_64
                           --59.66%--sys_pread64
                                     vfs_read
                                      --59.09%--__vfs_read
                                                 --58.24%--generic_file_read_iter
                                                           |--47.44%--ondemand_readahead
                                                           |           --46.88%--__do_page_cache_readahead

    61.36%     0.85%     0.00%     0.85%  postgres  libpthread-2.17.so  [.] __pread_nocancel
            |--60.51%--__pread_nocancel
            |          entry_SYSCALL_64_after_hwframe
            |          do_syscall_64
            |           --59.66%--sys_pread64
            |                     vfs_read
            |                      --59.09%--__vfs_read
            |                                 --58.24%--generic_file_read_iter
            |                                           |--47.44%--ondemand_readahead
            |                                           |           --46.88%--__do_page_cache_readahead


    59.66%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] sys_pread64
            ---sys_pread64
               vfs_read
                --59.09%--__vfs_read
                           --58.24%--generic_file_read_iter
                                     |--47.44%--ondemand_readahead
                                     |           --46.88%--__do_page_cache_readahead
                                     |                     |--32.67%--ext4_mpage_readpages


[..]
Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker:

# Overhead       sys       usr  Command   Shared Object       Symbol
     5.40%     0.00%     5.40%  postgres  [vdso]              [.] __vdso_clock_gettime
     5.11%     0.00%     5.11%  postgres  postgres            [.] acquire_sample_rows
            ---acquire_sample_rows
     3.98%     0.00%     3.98%  postgres  postgres            [.] heapam_scan_analyze_next_tuple
            ---heapam_scan_analyze_next_tuple
     3.69%     3.69%     0.00%  postgres  [kernel.kallsyms]   [k] pvclock_clocksource_read

My questions are:
a) does anybody know if it is expected that getrusage() doesn't include readahead times  as current thread system time
?(I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of time
spentexecuting in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?) 

b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add
muchvalue to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the
badidea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as
strictlyI/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in
heavyI/O conditions (with track_io_timings=on) logs the following:  
"LOG:  automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s,
system:0.13 s, elapsed: 0.92 s" 
my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was
just13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the
autovacuum_vacuum_cost_delay=0.In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent
onsleeping? 

c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads? E.g.
maybedisabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of blocks
areneeded? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by autovacuums
andenable even workers). Any hints or comments? 

All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs.

-Jakub Wartak.

Attachment

0001-do_analyze_rel_ioreadtimings_patch.c

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

26 October 2020, 15:44:15

Greetings,

* Jakub Wartak (Jakub.Wartak@tomtom.com) wrote:
> I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical
factor.Autovacuum/autogathering statistics was tuned to be pretty very aggressive: 
> autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
> autovacuum_naptime=1s
> autovacuum_max_workers=4
>
> Some critical table partitions are configured with:  autovacuum_analyze_scale_factor=0.001,
autovacuum_analyze_threshold=50000to force auto-analyze jobs pretty often. The interesting logs are like this: 
> automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s
> automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s
> automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s
> automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s

That's certainly pretty aggressive. :)

> and this is root-cause of my question.  As you can see there is huge 3x-4x time discrepancy between "elapsed" and
user+systemwhich is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According
tovarious tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1)
alsoreport the same for the worker: 

The user/system time there is time-on-CPU (hence the 'CPU: ' prefix).

> 06:56:12 AM       PID    %usr %system  %guest    %CPU   CPU  Command
> 06:56:13 AM    114774    8.00   10.00    0.00   18.00    18  postgres
> 06:56:14 AM    114774    8.00   11.00    0.00   19.00    15  postgres
> 06:56:15 AM    114774    5.00   13.00    0.00   18.00    18  postgres
>
> 06:56:17 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
> 06:56:18 AM    114774  63746.53      0.00      0.00  postgres
> 06:56:19 AM    114774  62896.00      0.00      0.00  postgres
> 06:56:20 AM    114774  62920.00      0.00      0.00  postgres
>
> One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not
performinga lot (total = 242MB/s reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although reporting
highutilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in
parallel(let's say 300MB/s for statistics collection alone). 

The analyze is doing more-or-less random i/o since it's skipping through
the table picking out select blocks, not doing regular sequential i/o.

> According to various gdb backtraces, a lot of time is spent here:
> #0  0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
> #1  0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at
/usr/include/bits/unistd.h:84
> #2  FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760,
wait_event_info=167772173)at fd.c:1883 
> #3  0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902,
buffer=0x7f9413ab7280"\037\005") at md.c:596 
> #4  0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
blockNum=19890902,mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f) 
>     at bufmgr.c:897
> #5  0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902,
mode=<optimizedout>, strategy=<optimized out>) at bufmgr.c:665 
> #6  0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>,
bstrategy=<optimizedout>) at heapam_handler.c:998 
> #7  0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>,
scan=0x10c8098)at ../../../src/include/access/tableam.h:1462 
> #8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000,
totalrows=0x7fffba7e3160,totaldeadrows=0x7fffba7e3158) at analyze.c:1048 
> #9  0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0
<acquire_sample_rows>,relpages=26763525, inh=false, 
>     in_outer_xact=false, elevel=13) at analyze.c:502
> [..]
> #12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118
> [..]

Sure, we're blocked on a read call trying to get the next block.

> The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order:
>
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at
heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at
heapam_handler.c:984

Not really sure what's interesting here, but it does look like we're
skipping through the table as expected.

> And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on
readaheadsdone by ext4, trimmed for clarity: 
>
> # Children      Self       sys       usr  Command   Shared Object       Symbol
>     63.64%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] entry_SYSCALL_64_after_hwframe
>             ---entry_SYSCALL_64_after_hwframe
>                do_syscall_64
>                |--59.66%--sys_pread64
>                |          vfs_read
>                |           --59.09%--__vfs_read
>                |                      --58.24%--generic_file_read_iter
>                |                                |--47.44%--ondemand_readahead
>                |                                |           --46.88%--__do_page_cache_readahead
>                |                                |                     |--32.67%--ext4_mpage_readpages
>                |                                |                     |          |--16.76%--submit_bio
>                |                                |                     |--10.23%--blk_finish_plug
> [..]
>     63.64%     1.99%     1.99%     0.00%  postgres  [kernel.kallsyms]   [k] do_syscall_64
>             |--61.65%--do_syscall_64
>             |          |--59.66%--sys_pread64
>             |          |          vfs_read
>             |          |           --59.09%--__vfs_read
>             |          |                      --58.24%--generic_file_read_iter
>             |          |                                |--47.44%--ondemand_readahead
>             |          |                                |           --46.88%--__do_page_cache_readahead
>
>     61.36%     0.00%     0.00%     0.00%  postgres  postgres            [.] FileRead
>             ---FileRead
>                __pread_nocancel
>                 --60.51%--entry_SYSCALL_64_after_hwframe
>                           do_syscall_64
>                            --59.66%--sys_pread64
>                                      vfs_read
>                                       --59.09%--__vfs_read
>                                                  --58.24%--generic_file_read_iter
>                                                            |--47.44%--ondemand_readahead
>                                                            |           --46.88%--__do_page_cache_readahead
>
>     61.36%     0.85%     0.00%     0.85%  postgres  libpthread-2.17.so  [.] __pread_nocancel
>             |--60.51%--__pread_nocancel
>             |          entry_SYSCALL_64_after_hwframe
>             |          do_syscall_64
>             |           --59.66%--sys_pread64
>             |                     vfs_read
>             |                      --59.09%--__vfs_read
>             |                                 --58.24%--generic_file_read_iter
>             |                                           |--47.44%--ondemand_readahead
>             |                                           |           --46.88%--__do_page_cache_readahead
>
>
>     59.66%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] sys_pread64
>             ---sys_pread64
>                vfs_read
>                 --59.09%--__vfs_read
>                            --58.24%--generic_file_read_iter
>                                      |--47.44%--ondemand_readahead
>                                      |           --46.88%--__do_page_cache_readahead
>                                      |                     |--32.67%--ext4_mpage_readpages
>

With all those 'readahead' calls it certainly makes one wonder if the
Linux kernel is reading more than just the block we're looking for
because it thinks we're doing a sequential read and will therefore want
the next few blocks when, in reality, we're going to skip past them,
meaning that any readahead the kernel is doing is likely just wasted
I/O.

> [..]
> Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker:
>
> # Overhead       sys       usr  Command   Shared Object       Symbol
>      5.40%     0.00%     5.40%  postgres  [vdso]              [.] __vdso_clock_gettime
>      5.11%     0.00%     5.11%  postgres  postgres            [.] acquire_sample_rows
>             ---acquire_sample_rows
>      3.98%     0.00%     3.98%  postgres  postgres            [.] heapam_scan_analyze_next_tuple
>             ---heapam_scan_analyze_next_tuple
>      3.69%     3.69%     0.00%  postgres  [kernel.kallsyms]   [k] pvclock_clocksource_read

Sure, makes sense.

> My questions are:
> a) does anybody know if it is expected that getrusage() doesn't include readahead times  as current thread system
time? (I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of
timespent executing in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?) 

getrusage()'s user/system CPU times are reporting time-on-CPU, not
counting time blocking for i/o.  Waiting isn't the same as executing,
no.

> b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add
muchvalue to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the
badidea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as
strictlyI/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in
heavyI/O conditions (with track_io_timings=on) logs the following:  
> "LOG:  automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s,
system:0.13 s, elapsed: 0.92 s" 

That definitely seems like a useful thing to include and thanks for the
patch!  Please be sure to register it in the commitfest app:
https://commitfest.postgresql.org

> my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was
just13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the
autovacuum_vacuum_cost_delay=0.In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent
onsleeping? 

Yeah, that would certainly be useful.

> c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads?
E.g.maybe disabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of
blocksare needed? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by
autovacuumsand enable even workers). Any hints or comments? 

I would think that, ideally, we'd teach analyze.c to work in the same
way that bitmap heap scans do- that is, use posix_fadvise to let the
kernel know what pages we're going to want next instead of the kernel
guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
lot of time poking, but it doesn't look like analyze.c tries to do any
prefetching today.  In a similar vein, I wonder if VACUUM should be
doing prefetching too today, at least when it's skipping through the
heap based on the visibility map and jumping over all-frozen pages.

> All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs.
>
> -Jakub Wartak.

> diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
> index 8af12b5c6b..fea1bd6f44 100644
> --- a/src/backend/commands/analyze.c
> +++ b/src/backend/commands/analyze.c
> @@ -312,6 +312,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>      Oid            save_userid;
>      int            save_sec_context;
>      int            save_nestlevel;
> +    PgStat_Counter startblockreadtime = 0;
>
>      if (inh)
>          ereport(elevel,
> @@ -347,6 +348,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>      if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
>      {
>          pg_rusage_init(&ru0);
> +        startblockreadtime = pgStatBlockReadTime;
>          if (params->log_min_duration > 0)
>              starttime = GetCurrentTimestamp();
>      }
> @@ -686,10 +688,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
>              TimestampDifferenceExceeds(starttime, GetCurrentTimestamp(),
>                                         params->log_min_duration))
>              ereport(LOG,
> -                    (errmsg("automatic analyze of table \"%s.%s.%s\" system usage: %s",
> +                    (errmsg("automatic analyze of table \"%s.%s.%s\" system usage: IO read time %.2f s, %s",
>                              get_database_name(MyDatabaseId),
>                              get_namespace_name(RelationGetNamespace(onerel)),
>                              RelationGetRelationName(onerel),
> +                            (double) (pgStatBlockReadTime - startblockreadtime)/1000000,
>                              pg_rusage_show(&ru0))));
>      }
>

Haven't looked too closely at this but in general +1 on the idea and
this approach looks pretty reasonable to me.  Only thing I can think of
off-hand is to check how it compares to other places where we report IO
read time and make sure that it looks similar.

Thanks,

Stephen

Attachment

signature.asc

Re: automatic analyze: readahead - add "IO read time" log message

From

Jakub Wartak

Date:

27 October 2020, 15:06:05

Hi Stephen, hackers,

> The analyze is doing more-or-less random i/o since it's skipping through
> the table picking out select blocks, not doing regular sequential i/o.
VS
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at
heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at
heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at
heapam_handler.c:984
>Not really sure what's interesting here, but it does look like we're
>skipping through the table as expected.

Yes, but not randomly in this case. I wanted to point out that this is incrementing block number, therefore I've
includedthis debug output which might trigger readahead heuristics. 
Perhaps this depends on how the table was built / vacuumed ? (in this case, pure-INSERT-only; I would expect the same
intime series DBs and DWHs).  

> With all those 'readahead' calls it certainly makes one wonder if the
> Linux kernel is reading more than just the block we're looking for
> because it thinks we're doing a sequential read and will therefore want
> the next few blocks when, in reality, we're going to skip past them,
> meaning that any readahead the kernel is doing is likely just wasted
> I/O.

I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API
changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests".
Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where
apparentlythere is some boost due to readahead. 

> That definitely seems like a useful thing to include and thanks for the
> patch!  Please be sure to register it in the commitfest app:
> https://commitfest.postgresql.org

Thank you! Thread is now registered.

> I would think that, ideally, we'd teach analyze.c to work in the same
> way that bitmap heap scans do- that is, use posix_fadvise to let the
> kernel know what pages we're going to want next instead of the kernel
> guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> lot of time poking, but it doesn't look like analyze.c tries to do any
> prefetching today.  In a similar vein, I wonder if VACUUM should be
> doing prefetching too today, at least when it's skipping through the
> heap based on the visibility map and jumping over all-frozen pages.

My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM
issuedon block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) 

> Haven't looked too closely at this but in general +1 on the idea and
> this approach looks pretty reasonable to me.  Only thing I can think of
> off-hand is to check how it compares to other places where we report IO
> read time and make sure that it looks similar.

Ok, I've changed the output in 002 version to include "avg read rate" just like in the autovacuum case but still
maintainingsingle line output, e.g:   
automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s (read time: 2.52 s), system usage: CPU:
user:0.28 s, system: 0.26 s, elapsed: 2.94 s 

-J.

Attachment

0002-do_analyze_rel_ioreadtimings_patch.c

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

27 October 2020, 15:35:26

Greetings Jakub,

* Jakub Wartak (Jakub.Wartak@tomtom.com) wrote:
> > The analyze is doing more-or-less random i/o since it's skipping through
> > the table picking out select blocks, not doing regular sequential i/o.
> VS
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at
heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at
heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at
heapam_handler.c:984
> >Not really sure what's interesting here, but it does look like we're
> >skipping through the table as expected.
>
> Yes, but not randomly in this case. I wanted to point out that this is incrementing block number, therefore I've
includedthis debug output which might trigger readahead heuristics. 

Sure, it's incrementing, but it's skipping- this is very similar to what
we do with Bitmap Heap Scans, and that makes it different from a typical
sequential scan.

> Perhaps this depends on how the table was built / vacuumed ? (in this case, pure-INSERT-only; I would expect the same
intime series DBs and DWHs).  

No, this is how ANALYZE behaves and hasn't got much to do with how the
table was built or vacuumed.

> > With all those 'readahead' calls it certainly makes one wonder if the
> > Linux kernel is reading more than just the block we're looking for
> > because it thinks we're doing a sequential read and will therefore want
> > the next few blocks when, in reality, we're going to skip past them,
> > meaning that any readahead the kernel is doing is likely just wasted
> > I/O.
>
> I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API
changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests".
Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where
apparentlythere is some boost due to readahead. 

Interesting that you weren't seeing any benefit to disabling readahead.
Were you able to see where the time in the kernel was going when
readahead was turned off for the ANALYZE?

Note that you shouldn't need to make changes to smgr/md/fd to leverage
posix_fadvise- what you would do is use PrefetchBuffer(), see
BitmapPrefetch().

The VACUUM case is going to be complicated by what's in the visibility
map.  A VACUUM that isn't able to skip any pages yet is certainly going
to benefit from the kernel's readahead, but a VACUUM that's able to skip
over pages likely wouldn't benefit as much.

The way to test this would look something like:

- COPY a bunch of data into a table
- make sure to commit that and any other ongoing transactions
- VACUUM FREEZE the table
- check the visibility map to make sure most of the pages are marked as
  all-frozen after the VACUUM FREEZE
- randomly UPDATE the table, to really get the effect, maybe update 20%
  of the pages while leaving the rest alone (and therefore 80% of the
  table should still have the all-frozen bit set in the visibility map)
- *then* do a VACUUM on the table and see what happens with different
  amounts of read-ahead (or, ideally, with posix_fadvise() being used to
  let the kernel know what pages we're going to actually want).

> > That definitely seems like a useful thing to include and thanks for the
> > patch!  Please be sure to register it in the commitfest app:
> > https://commitfest.postgresql.org
>
> Thank you! Thread is now registered.

Great!

> > I would think that, ideally, we'd teach analyze.c to work in the same
> > way that bitmap heap scans do- that is, use posix_fadvise to let the
> > kernel know what pages we're going to want next instead of the kernel
> > guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> > lot of time poking, but it doesn't look like analyze.c tries to do any
> > prefetching today.  In a similar vein, I wonder if VACUUM should be
> > doing prefetching too today, at least when it's skipping through the
> > heap based on the visibility map and jumping over all-frozen pages.
>
> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM
issuedon block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) 

Well, the idea is that posix_fadvise() usage through PrefetchBuffer()
gets us some of that by letting the kernel know what we're going to ask
for next.  AIO is a whole other animal that's been discussed off and on
around here but it's a much larger and more invasive change than just
calling posix_fadvise().

> > Haven't looked too closely at this but in general +1 on the idea and
> > this approach looks pretty reasonable to me.  Only thing I can think of
> > off-hand is to check how it compares to other places where we report IO
> > read time and make sure that it looks similar.
>
> Ok, I've changed the output in 002 version to include "avg read rate" just like in the autovacuum case but still
maintainingsingle line output, e.g:   
> automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s (read time: 2.52 s), system usage:
CPU:user: 0.28 s, system: 0.26 s, elapsed: 2.94 s 

Ah, in this case I had been thinking actually about the other things
that track_io_timing controls, as in:

            ExplainPropertyFloat("I/O Read Time", "ms",
                                 INSTR_TIME_GET_MILLISEC(usage->blk_read_time),
                                 3, es);
            ExplainPropertyFloat("I/O Write Time", "ms",
                                 INSTR_TIME_GET_MILLISEC(usage->blk_write_time),
                                 3, es);

Seems we already aren't exactly consistent in how we report these
things, unfortunately.

That said, since this is ANALYZE, yeah, it probably makes more sense to
have it look more like what VACUUM reports than what you'd get from
EXPLAIN.  As an interesting point also- what VACUUM reports isn't
through using track_io_timing, which makes one wonder if we should be
doing that here or not..

Thanks!

Stephen

Attachment

signature.asc

Re: automatic analyze: readahead - add "IO read time" log message

From

Jakub Wartak

Date:

03 November 2020, 09:33:31

Hi Stephen, hackers,

>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API
changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests".
Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where
apparentlythere is some boost due to readahead. 

>Interesting that you weren't seeing any benefit to disabling readahead.

I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation is preferable:

Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different
storagesettings), VG on top of dm-0, LV with stripe-size 8kB, ext4. 
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
INFO:  "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample,
1678321053estimated total rows 

Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> ..
->pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request
sizeper IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m 
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s

No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf,
totalelapsed 3m30s 
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s

No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read
IOPS@ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s 
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no
readaheads[ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s 
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s

One thing not clear here is maybe in future worth measuring how striped LVs are being
affected by readaheads.

>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?

Yes, my interpretation is that the time spent goes into directly block I/O layer waiting.

    54.67%     1.33%  postgres  postgres            [.] FileRead
            ---FileRead
                --53.33%--__pread_nocancel
                           --50.67%--entry_SYSCALL_64_after_hwframe
                                     do_syscall_64
                                     sys_pread64
                                     |--49.33%--vfs_read
                                     |           --48.00%--__vfs_read
                                     |                     |--45.33%--generic_file_read_iter
                                     |                     |          |--42.67%--ondemand_readahead
                                     |                     |          |          __do_page_cache_readahead
                                     |                     |          |          |--25.33%--ext4_mpage_readpages
                                     |                     |          |          |          |--10.67%--submit_bio
                                     |                     |          |          |          |
generic_make_request
                                     |                     |          |          |          |
|--8.00%--blk_mq_make_request
                                     |                     |          |          |          |          |
|--4.00%--blk_mq_get_request
                                     |                     |          |          |          |          |          |
    |--1.33%--blk_mq_get_tag 
                                     |                     |          |          |          |          |          |
     --1.33%--sched_clock 
                                     |                     |          |          |          |          |          |
               xen_sched_clock 
                                     |                     |          |          |          |          |          |
               pvclock_clocksource_read 
                                     |                     |          |          |          |          |
|--1.33%--bio_integrity_prep
                                     |                     |          |          |          |          |
--1.33%--blk_account_io_start
                                     |                     |          |          |          |          |
    part_round_stats 
                                     |                     |          |          |          |          |
    blk_mq_in_flight 
                                     |                     |          |          |          |          |
    blk_mq_queue_tag_busy_iter 
                                     |                     |          |          |          |
--2.67%--dm_make_request
                                     |                     |          |          |          |
__split_and_process_bio
                                     |                     |          |          |          |
__split_and_process_non_flush
                                     |                     |          |          |          |
|--1.33%--__map_bio
                                     |                     |          |          |          |                     |
    generic_make_request 
                                     |                     |          |          |          |                     |
    generic_make_request_checks 
                                     |                     |          |          |          |                     |
    percpu_counter_add_batch 
                                     |                     |          |          |          |
--1.33%--bio_alloc_bioset
                                     |                     |          |          |          |
    mempool_alloc 
                                     |                     |          |          |          |
    kmem_cache_alloc 
                                     |                     |          |          |          |--6.67%--ext4_map_blocks
                                     |                     |          |          |          |
|--4.00%--ext4_es_lookup_extent
                                     |                     |          |          |          |          |
--2.67%--_raw_read_lock
                                     |                     |          |          |          |
--2.67%--__check_block_validity.constprop.81
                                     |                     |          |          |          |
ext4_data_block_valid
                                     |                     |          |          |
--6.67%--add_to_page_cache_lru
                                     |                     |          |          |
|--4.00%--__add_to_page_cache_locked
                                     |                     |          |          |                     |
--1.33%--mem_cgroup_try_charge
                                     |                     |          |          |                     |
    get_mem_cgroup_from_mm 
                                     |                     |          |          |
--2.67%--__lru_cache_add
                                     |                     |          |          |
pagevec_lru_move_fn
                                     |                     |          |          |
__lock_text_start
                                     |                     |          |          |--12.00%--blk_finish_plug
                                     |                     |          |          |          blk_flush_plug_list
                                     |                     |          |          |          blk_mq_flush_plug_list
                                     |                     |          |          |
|--10.67%--__blk_mq_delay_run_hw_queue
                                     |                     |          |          |          |
__blk_mq_run_hw_queue
                                     |                     |          |          |          |
blk_mq_sched_dispatch_requests
                                     |                     |          |          |          |
--9.33%--blk_mq_dispatch_rq_list
                                     |                     |          |          |          |
nvme_queue_rq
                                     |                     |          |          |          |
--1.33%--blk_mq_start_request




>The VACUUM case is going to be complicated by what's in the visibility
>map. (..)

After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux
read-ahead is beneficial in even simple cases.

>> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with
POSIX_FADV_RANDOMissued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO
;)
>
> (..)AIO is a whole other animal that's been discussed off and on
>around here but it's a much larger and more invasive change than just
>calling posix_fadvise().

Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....

The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
of acceleration of stats gathering - at least to me and enduser point of view -
is to have more parallel autoanalyze workers running to drive more I/O concurrency
(by e.g. partitioning the data),  both in readahead and non-readahead scenarios.
Which is a pity because 70-80% of such process sits idle. The readahead might read
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
would be it could throw thousands of requests without readahead and achieve much
better efficiency probably</speculation>

I hope the previous simple patch goes into master and helps other people understand
the picture more easily.

-J.

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

03 November 2020, 17:47:34

Greetings,

* Jakub Wartak (Jakub.Wartak@tomtom.com) wrote:
> >Interesting that you weren't seeing any benefit to disabling readahead.
>
> I've got some free minutes and I have repeated the exercise in more realistic
> and strict environment that previous one to conclude that the current situation is preferable:

Thanks for spending time on this!

> Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
> Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have
differentstorage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4. 
> s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
> INFO:  "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in
sample,1678321053 estimated total rows 
>
> Hot VFS cache:
> Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> ..
->pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request
sizeper IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m 
> Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
> Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
> Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s
>
> No VFS cache:
> Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in
perf,total elapsed 3m30s 
> Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
> Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s
>
> No VFS cache, readahead off:
> Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read
IOPS@ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s 
> Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no
readaheads[ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s 
> Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s

[ ... ]

> >The VACUUM case is going to be complicated by what's in the visibility
> >map. (..)
>
> After observing the ANALYZE readahead behavior benefit I've abandoned
> the case of testing much more advanced VACUUM processing, clearly Linux
> read-ahead is beneficial in even simple cases.

This seems to be indicating that while the Linux kernel may end up
reading pages we don't end up needing, it's much more often the case
that it's ending up reading *some* pages that we do need, and that's
happening often enough that it more than makes up for the extra reads
being done.

Instead of having these guessing games between the kernel and what PG's
doing, however, we could potentially do better using posix_fadvise() to
tell the kernel, up front, exactly what blocks we are going to ask for,
and perhaps that would end up improving things.

Attached is a very much rough-n-ready patch for doing that, using
effective_io_concurrency to control how many blocks to pre-fetch for
ANALYZE (0 meaning 'none').  If you've got a chance to test with
different settings for effective_io_concurrency with the patch applied
to see what impact posix_fadvise() has on these ANALYZE runs, that would
be very cool to see.

Going between effective_cache_size = 0 and effective_cache_size = 10
with this patch, in some very quick testing on a laptop NVMe, while
making sure to drop caches and restart PG in between to clear out
shared_buffers, definitely shows that prefetching done this way is an
improvement over letting the kernel's normal read ahead handle it.

> >> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with
POSIX_FADV_RANDOMissued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO
;)
> >
> > (..)AIO is a whole other animal that's been discussed off and on
> >around here but it's a much larger and more invasive change than just
> >calling posix_fadvise().
>
> Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....

I don't think we should throw out the idea of using PrefetchBuffer()
here.  "Real" AIO would certainly be good to have one of these days, but
until then, posix_fadvise() could net us some of those gains in the
meantime.

> The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
> of acceleration of stats gathering - at least to me and enduser point of view -
> is to have more parallel autoanalyze workers running to drive more I/O concurrency
> (by e.g. partitioning the data),  both in readahead and non-readahead scenarios.
> Which is a pity because 70-80% of such process sits idle. The readahead might read
> 10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
> would be it could throw thousands of requests without readahead and achieve much
> better efficiency probably</speculation>

Actual parallel ANALYZE could be interesting, of course, but that's also
quite a bit more complicated to make happen, and even in that case it
would seem like we'd probably want to use posix_fadvise() anyway since
it'd hopefully help.

> I hope the previous simple patch goes into master and helps other people understand
> the picture more easily.

I agree that we should also be providing users with more information
about what's going on, but maybe we can do that and improve things by
leveraging posix_fadvise().

Thanks,

Stephen

Attachment

Re: automatic analyze: readahead - add "IO read time" log message

From

Jakub Wartak

Date:

04 November 2020, 09:07:59

Greetings Stephen,

I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
huge ~5.2x improvement which is amazing for a such simple patch.

The system and data was identical like last time, so results are directly comparable
to the previous post. The only change is that I've applied Yours patch on top of
REL12_STABLE as this is cluster where I was having original data. I think it deserves
 it's own commitfest entry, right? Or do we combine those two?

Each test was conducted with
1. alter system set effective_io_concurrency=$N;
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.

no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/s

readahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)

I think one needs to be sure to restart, so that autovacuum
workers get the new effective_io_concurrency, maybe that's a
documentation thing only.

I have just one question, with this patch the strace I/O of analyze
look like below:

pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0

If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't  imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.

-J.

----------------------------------------------------------------------------------------------------------------

From: Stephen Frost
Sent: Tuesday, November 3, 2020 6:47 PM
To: Jakub Wartak
Cc: pgsql-hackers
Subject: Re: automatic analyze: readahead - add "IO read time" log message

Greetings,

* Jakub Wartak (Jakub.Wartak@tomtom.com) wrote:
> >Interesting that you weren't seeing any benefit to disabling readahead.
>
> I've got some free minutes and I have repeated the exercise in more realistic
> and strict environment that previous one to conclude that the current situation is preferable:

Thanks for spending time on this!

> Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
> Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have
differentstorage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4. 
> s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
> INFO:  "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in
sample,1678321053 estimated total rows 
>
> Hot VFS cache:
> Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> ..
->pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request
sizeper IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m 
> Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
> Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
> Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s
>
> No VFS cache:
> Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in
perf,total elapsed 3m30s 
> Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
> Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s
>
> No VFS cache, readahead off:
> Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read
IOPS@ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s 
> Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no
readaheads[ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s 
> Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s

[ ... ]

> >The VACUUM case is going to be complicated by what's in the visibility
> >map. (..)
>
> After observing the ANALYZE readahead behavior benefit I've abandoned
> the case of testing much more advanced VACUUM processing, clearly Linux
> read-ahead is beneficial in even simple cases.

This seems to be indicating that while the Linux kernel may end up
reading pages we don't end up needing, it's much more often the case
that it's ending up reading *some* pages that we do need, and that's
happening often enough that it more than makes up for the extra reads
being done.

Instead of having these guessing games between the kernel and what PG's
doing, however, we could potentially do better using posix_fadvise() to
tell the kernel, up front, exactly what blocks we are going to ask for,
and perhaps that would end up improving things.

Attached is a very much rough-n-ready patch for doing that, using
effective_io_concurrency to control how many blocks to pre-fetch for
ANALYZE (0 meaning 'none').  If you've got a chance to test with
different settings for effective_io_concurrency with the patch applied
to see what impact posix_fadvise() has on these ANALYZE runs, that would
be very cool to see.

Going between effective_cache_size = 0 and effective_cache_size = 10
with this patch, in some very quick testing on a laptop NVMe, while
making sure to drop caches and restart PG in between to clear out
shared_buffers, definitely shows that prefetching done this way is an
improvement over letting the kernel's normal read ahead handle it.

> >> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with
POSIX_FADV_RANDOMissued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO
;)
> >
> > (..)AIO is a whole other animal that's been discussed off and on
> >around here but it's a much larger and more invasive change than just
> >calling posix_fadvise().
>
> Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....

I don't think we should throw out the idea of using PrefetchBuffer()
here.  "Real" AIO would certainly be good to have one of these days, but
until then, posix_fadvise() could net us some of those gains in the
meantime.

> The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
> of acceleration of stats gathering - at least to me and enduser point of view -
> is to have more parallel autoanalyze workers running to drive more I/O concurrency
> (by e.g. partitioning the data),  both in readahead and non-readahead scenarios.
> Which is a pity because 70-80% of such process sits idle. The readahead might read
> 10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
> would be it could throw thousands of requests without readahead and achieve much
> better efficiency probably</speculation>

Actual parallel ANALYZE could be interesting, of course, but that's also
quite a bit more complicated to make happen, and even in that case it
would seem like we'd probably want to use posix_fadvise() anyway since
it'd hopefully help.

> I hope the previous simple patch goes into master and helps other people understand
> the picture more easily.

I agree that we should also be providing users with more information
about what's going on, but maybe we can do that and improve things by
leveraging posix_fadvise().

Thanks,

Stephen

Re: automatic analyze: readahead - add "IO read time" log message

From

Tomas Vondra

Date:

04 November 2020, 14:19:05

On Wed, Nov 04, 2020 at 09:07:59AM +0000, Jakub Wartak wrote:
>
>Greetings Stephen,
>
>I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
>huge ~5.2x improvement which is amazing for a such simple patch.
>
>The system and data was identical like last time, so results are directly comparable
>to the previous post. The only change is that I've applied Yours patch on top of
>REL12_STABLE as this is cluster where I was having original data. I think it deserves
> it's own commitfest entry, right? Or do we combine those two?
>
>Each test was conducted with
>1. alter system set effective_io_concurrency=$N;
>2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart;
>3. analyze verbose t;
>4. the measurements were observed by pidstat -d and iostat -dxm.
>
>no readahead (blockdev --setra 0), cold cache:
>effective_io_concurrency=64; => ~26MB/s
>effective_io_concurrency=32; => ~29MB/s
>effective_io_concurrency=8; => ~31MB/s
>effective_io_concurrency=4; => ~32MB/s
>effective_io_concurrency=1; => ~32MB/s
>effective_io_concurrency=0; => ~32MB/s
>
>readahead=256 (128kB, Linux defaults), cold cache:
>effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline)
>effective_io_concurrency=1; => ~62MB/s
>effective_io_concurrency=4; => ~370MB/s
>effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100%
>effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)
>

Thanks for the testing, those are indeed quite promising numbers for
such a simple patch. I wonder how would this behave on a different
storage - I'll see if I can do some testing, but I guess it should be
the same as for other places that do prefetching.

>I think one needs to be sure to restart, so that autovacuum
>workers get the new effective_io_concurrency, maybe that's a
>documentation thing only.
>
>I have just one question, with this patch the strace I/O of analyze
>look like below:
>
>pread(72, ..., 8192, 738181120) = 8192
>fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
>pread(72, "..., 8192, 738500608) = 8192
>fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
>pread(72, .., 8192, 738607104) = 8192
>fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
>pread(72, ..., 8192, 738754560) = 8192
>fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
>fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
>
>If you highlight "738754560" in the output it appears to duplicate the
>syscalls issued until it preads() - in case of "738754560" offset it was
>asked for 3 times. Also I wouldn't  imagine in wildest dreams that
>posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
>

IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.

BTW it seems your e-mail client does something funny, stripping the
"references" headers, which breaks threading and makes following the
discussion very hard (and will confuse the CF app too). If you look at
pgsql-hackers archives, each of your responses starts a new thread.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

04 November 2020, 14:30:41

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On Wed, Nov 04, 2020 at 09:07:59AM +0000, Jakub Wartak wrote:
> >I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
> >huge ~5.2x improvement which is amazing for a such simple patch.

Nice!

> >The system and data was identical like last time, so results are directly comparable
> >to the previous post. The only change is that I've applied Yours patch on top of
> >REL12_STABLE as this is cluster where I was having original data. I think it deserves
> >it's own commitfest entry, right? Or do we combine those two?
> >
> >Each test was conducted with
> >1. alter system set effective_io_concurrency=$N;
> >2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart;
> >3. analyze verbose t;
> >4. the measurements were observed by pidstat -d and iostat -dxm.
> >
> >no readahead (blockdev --setra 0), cold cache:
> >effective_io_concurrency=64; => ~26MB/s
> >effective_io_concurrency=32; => ~29MB/s
> >effective_io_concurrency=8; => ~31MB/s
> >effective_io_concurrency=4; => ~32MB/s
> >effective_io_concurrency=1; => ~32MB/s
> >effective_io_concurrency=0; => ~32MB/s

I'm speculating here, but it seems likely that by disabling read-ahead,
the posix_fadvise() calls are basically getting ignored and that's why
there's not much difference here...?

> >readahead=256 (128kB, Linux defaults), cold cache:
> >effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline)
> >effective_io_concurrency=1; => ~62MB/s
> >effective_io_concurrency=4; => ~370MB/s
> >effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100%
> >effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)

Very nice.

> Thanks for the testing, those are indeed quite promising numbers for
> such a simple patch. I wonder how would this behave on a different
> storage - I'll see if I can do some testing, but I guess it should be
> the same as for other places that do prefetching.

That would certainly be cool.

> >I think one needs to be sure to restart, so that autovacuum
> >workers get the new effective_io_concurrency, maybe that's a
> >documentation thing only.
> >
> >I have just one question, with this patch the strace I/O of analyze
> >look like below:
> >
> >pread(72, ..., 8192, 738181120) = 8192
> >fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, "..., 8192, 738500608) = 8192
> >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, .., 8192, 738607104) = 8192
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, ..., 8192, 738754560) = 8192
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
> >
> >If you highlight "738754560" in the output it appears to duplicate the
> >syscalls issued until it preads() - in case of "738754560" offset it was
> >asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
>
> IMHO that'a a bug in the patch, which always tries to prefetch all
> "future" blocks, including those that were already prefetched. It
> probably needs to do something like bitmap heap scan where we track
> what was already prefetched and only issue the new blocks.

Yeah, this was just something quickly thrown together to see if it'd
help.  I'll clean it up by, as you say, doing something similar to what
we do with bitmap heap scan, by having a separate BlockSampler that's
set up early on and then have the pre-fetching keep pace with the actual
scan instead of posix_fadvise()'ing the same pages over and over.

I'll also look at doing something for VACUUM too, as that could also
benefit from posix_fadvise() calls, when we're skipping over all-frozen
pages.

Thanks!

Stephen

Attachment

signature.asc

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

04 November 2020, 16:02:14

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >If you highlight "738754560" in the output it appears to duplicate the
> >syscalls issued until it preads() - in case of "738754560" offset it was
> >asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
>
> IMHO that'a a bug in the patch, which always tries to prefetch all
> "future" blocks, including those that were already prefetched. It
> probably needs to do something like bitmap heap scan where we track
> what was already prefetched and only issue the new blocks.

Updated patch attached which:

- Starts out by pre-fetching the first effective_io_concurrency number
  of blocks we are going to want, hopefully making it so the kernel will
  trust our fadvise's over its own read-ahead, right from the start.
- Makes sure the prefetch iterator is pushed forward whenever the
  regular interator is moved forward.
- After each page read, issues a prefetch, similar to BitmapHeapScan, to
  hopefully avoiding having the prefetching get in the way of the
  regular i/o.
- Added some comments, ran pgindent, added a commit message.

I do think we should also include patch that Jakub wrote previously
which adds information about the read rate of ANALYZE.

I'll look at integrating that into this patch and then look at a new
patch to do something similar for VACUUM in a bit.

If you're doing further benchmarking of ANALYZE though, this would
probably be the better patch to use.  Certainly improved performance
here quite a bit with effective_io_concurrency set to 16.

Thanks,

Stephen

Attachment

Re: automatic analyze: readahead - add "IO read time" log message

From

Tomas Vondra

Date:

04 November 2020, 16:46:32

Hi,

On 11/4/20 5:02 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>>> If you highlight "738754560" in the output it appears to duplicate the
>>> syscalls issued until it preads() - in case of "738754560" offset it was
>>> asked for 3 times. Also I wouldn't  imagine in wildest dreams that
>>> posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
>>
>> IMHO that'a a bug in the patch, which always tries to prefetch all
>> "future" blocks, including those that were already prefetched. It
>> probably needs to do something like bitmap heap scan where we track
>> what was already prefetched and only issue the new blocks.
> 
> Updated patch attached which:
> 
> - Starts out by pre-fetching the first effective_io_concurrency number
>    of blocks we are going to want, hopefully making it so the kernel will
>    trust our fadvise's over its own read-ahead, right from the start.
> - Makes sure the prefetch iterator is pushed forward whenever the
>    regular interator is moved forward.
> - After each page read, issues a prefetch, similar to BitmapHeapScan, to
>    hopefully avoiding having the prefetching get in the way of the
>    regular i/o.
> - Added some comments, ran pgindent, added a commit message.
> 

Nice, that was quick ;-)

> I do think we should also include patch that Jakub wrote previously
> which adds information about the read rate of ANALYZE.
> 

+1

> I'll look at integrating that into this patch and then look at a new
> patch to do something similar for VACUUM in a bit.
> 

+1

> If you're doing further benchmarking of ANALYZE though, this would
> probably be the better patch to use.  Certainly improved performance
> here quite a bit with effective_io_concurrency set to 16.
> 

Yeah. I'd expect this to be heavily dependent on hardware.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

09 November 2020, 18:06:44

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 11/4/20 5:02 PM, Stephen Frost wrote:
> >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >>>If you highlight "738754560" in the output it appears to duplicate the
> >>>syscalls issued until it preads() - in case of "738754560" offset it was
> >>>asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >>>posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
> >>
> >>IMHO that'a a bug in the patch, which always tries to prefetch all
> >>"future" blocks, including those that were already prefetched. It
> >>probably needs to do something like bitmap heap scan where we track
> >>what was already prefetched and only issue the new blocks.
> >
> >Updated patch attached which:
> >
> >- Starts out by pre-fetching the first effective_io_concurrency number
> >   of blocks we are going to want, hopefully making it so the kernel will
> >   trust our fadvise's over its own read-ahead, right from the start.
> >- Makes sure the prefetch iterator is pushed forward whenever the
> >   regular interator is moved forward.
> >- After each page read, issues a prefetch, similar to BitmapHeapScan, to
> >   hopefully avoiding having the prefetching get in the way of the
> >   regular i/o.
> >- Added some comments, ran pgindent, added a commit message.
>
> Nice, that was quick ;-)

:)

> >I do think we should also include patch that Jakub wrote previously
> >which adds information about the read rate of ANALYZE.
>
> +1

Attached is an updated patch which updates the documentation and
integrates Jakub's initial work on improving the logging around
auto-analyze (and I made the logging in auto-vacuum more-or-less match
it).

> >I'll look at integrating that into this patch and then look at a new
> >patch to do something similar for VACUUM in a bit.
>
> +1

I spent some time looking into this but it's a bit complicated..  For
some sound reasons, VACUUM will avoid skipping through a table when
there's only a few pages that it could skip (not skipping allows us to
move forward the relfrozenxid).  That said, perhaps we could start doing
prefetching once we've decided that we're skipping.  We'd need to think
about if we have to worry about the VM changing between the pre-fetching
and the time when we're actually going to ask for the page..  I don't
*think* that's an issue because only VACUUM would be changing the pages
to be all-frozen or all-visible, and so if we see a page that isn't one
of those then we're going to want to visit that page and that's not
going to change, and we probably don't need to care about a page that
used to be all-frozen and now isn't during this run- but if the prefetch
went ahead and got page 10, and now page 8 is not-all-frozen and the
actual scan is at page 5, then maybe it wants page 8 next and that isn't
what we pre-fetched...

Anyhow, all-in-all, definitely more complicated and probably best
considered and discussed independently.

> >If you're doing further benchmarking of ANALYZE though, this would
> >probably be the better patch to use.  Certainly improved performance
> >here quite a bit with effective_io_concurrency set to 16.
>
> Yeah. I'd expect this to be heavily dependent on hardware.

Sure, I agree with that too.

Thanks,

Stephen

Attachment

Re: automatic analyze: readahead - add "IO read time" log message

From

Tomas Vondra

Date:

09 November 2020, 18:19:59

On 11/9/20 7:06 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 11/4/20 5:02 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>>>>> If you highlight "738754560" in the output it appears to duplicate the
>>>>> syscalls issued until it preads() - in case of "738754560" offset it was
>>>>> asked for 3 times. Also I wouldn't  imagine in wildest dreams that
>>>>> posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
>>>>
>>>> IMHO that'a a bug in the patch, which always tries to prefetch all
>>>> "future" blocks, including those that were already prefetched. It
>>>> probably needs to do something like bitmap heap scan where we track
>>>> what was already prefetched and only issue the new blocks.
>>>
>>> Updated patch attached which:
>>>
>>> - Starts out by pre-fetching the first effective_io_concurrency number
>>>   of blocks we are going to want, hopefully making it so the kernel will
>>>   trust our fadvise's over its own read-ahead, right from the start.
>>> - Makes sure the prefetch iterator is pushed forward whenever the
>>>   regular interator is moved forward.
>>> - After each page read, issues a prefetch, similar to BitmapHeapScan, to
>>>   hopefully avoiding having the prefetching get in the way of the
>>>   regular i/o.
>>> - Added some comments, ran pgindent, added a commit message.
>>
>> Nice, that was quick ;-)
> 
> :)
> 
>>> I do think we should also include patch that Jakub wrote previously
>>> which adds information about the read rate of ANALYZE.
>>
>> +1
> 
> Attached is an updated patch which updates the documentation and
> integrates Jakub's initial work on improving the logging around
> auto-analyze (and I made the logging in auto-vacuum more-or-less match
> it).
> 

Thanks. I'll do some testing/benchmarking once my machines are free, in
a couple days perhaps. But as I said before, I don't expect this to
behave very differently from other places that already do prefetching.

>>> I'll look at integrating that into this patch and then look at a new
>>> patch to do something similar for VACUUM in a bit.
>>
>> +1
> 
> I spent some time looking into this but it's a bit complicated..  For
> some sound reasons, VACUUM will avoid skipping through a table when
> there's only a few pages that it could skip (not skipping allows us to
> move forward the relfrozenxid).  That said, perhaps we could start doing
> prefetching once we've decided that we're skipping.  We'd need to think
> about if we have to worry about the VM changing between the pre-fetching
> and the time when we're actually going to ask for the page..  I don't
> *think* that's an issue because only VACUUM would be changing the pages
> to be all-frozen or all-visible, and so if we see a page that isn't one
> of those then we're going to want to visit that page and that's not
> going to change, and we probably don't need to care about a page that
> used to be all-frozen and now isn't during this run- but if the prefetch
> went ahead and got page 10, and now page 8 is not-all-frozen and the
> actual scan is at page 5, then maybe it wants page 8 next and that isn't
> what we pre-fetched...
> 
> Anyhow, all-in-all, definitely more complicated and probably best
> considered and discussed independently>

+1

FWIW I wonder if this should be tracked separately in the CF app, as
it's very different from the original "add some logging" patch, which
makes the CF entry rather misleading.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

05 December 2020, 16:39:26

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> Thanks. I'll do some testing/benchmarking once my machines are free, in
> a couple days perhaps. But as I said before, I don't expect this to
> behave very differently from other places that already do prefetching.

Agreed, but would still be nice to see test results beyond just what
I've done.

> FWIW I wonder if this should be tracked separately in the CF app, as
> it's very different from the original "add some logging" patch, which
> makes the CF entry rather misleading.

I've gone ahead and updated the CF entry for this to hopefully make it
clearer for those interested in looking at it.  I'll try to come back to
this in the next CF, ideally we'd at least get someone else to take a
look at the code beyond me. :)  (Obviously, you looked at it some, but
wasn't really clear if you were alright with it or if you felt it needed
more review.)

Thanks!

Stephen

Attachment

signature.asc

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

13 January 2021, 21:17:24

Greetings,

* Stephen Frost (sfrost@snowman.net) wrote:
> * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> > Thanks. I'll do some testing/benchmarking once my machines are free, in
> > a couple days perhaps. But as I said before, I don't expect this to
> > behave very differently from other places that already do prefetching.
>
> Agreed, but would still be nice to see test results beyond just what
> I've done.

Any chance you were able to run those tests..?

> > FWIW I wonder if this should be tracked separately in the CF app, as
> > it's very different from the original "add some logging" patch, which
> > makes the CF entry rather misleading.
>
> I've gone ahead and updated the CF entry for this to hopefully make it
> clearer for those interested in looking at it.  I'll try to come back to
> this in the next CF, ideally we'd at least get someone else to take a
> look at the code beyond me. :)  (Obviously, you looked at it some, but
> wasn't really clear if you were alright with it or if you felt it needed
> more review.)

I've updated the patch to leverage Tom's introduction of
TimestatmpDifferenceMilliseconds, which simplifies things a bit (and I
don't think we need to worry about an analyze taking over 25 days...)
and generally rebased this up to current HEAD.

Would be great to get a review / comments from others as to if there's
any concerns.  I'll admit that it seems reasonably straight-forward to
me, but hey, I wrote most of it, so that's not really a fair
assessment... ;)

Thanks,

Stephen

Attachment

Re: automatic analyze: readahead - add "IO read time" log message

From

Heikki Linnakangas

Date:

26 January 2021, 10:23:29

On 13/01/2021 23:17, Stephen Frost wrote:
> Would be great to get a review / comments from others as to if there's
> any concerns.  I'll admit that it seems reasonably straight-forward to
> me, but hey, I wrote most of it, so that's not really a fair
> assessment... ;)

Look good overall. A few minor comments:

The patch consists of two part: add stats to the log for auto-analyze, 
and implement prefetching. They seem like independent features, consider 
splitting into two patches.

It's a bit weird that you get more stats in the log for 
autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not 
really this patch's fault though.

This conflicts with the patch at 
https://commitfest.postgresql.org/31/2907/, to refactor the table AM 
analyze API. That's OK, it's straightforward to resolve regardless of 
which patch is committed first.

>     /* Outer loop over blocks to sample */
>     while (BlockSampler_HasMore(&bs))
>     {
> #ifdef USE_PREFETCH
>         BlockNumber prefetch_targblock = InvalidBlockNumber;
> #endif
>         BlockNumber targblock = BlockSampler_Next(&bs);
> 
> #ifdef USE_PREFETCH
> 
>         /*
>          * Make sure that every time the main BlockSampler is moved forward
>          * that our prefetch BlockSampler also gets moved forward, so that we
>          * always stay out ahead.
>          */
>         if (BlockSampler_HasMore(&prefetch_bs))
>             prefetch_targblock = BlockSampler_Next(&prefetch_bs);
> #endif
> 
>         vacuum_delay_point();
> 
>         if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
>             continue;
> 
> #ifdef USE_PREFETCH
> 
>         /*
>          * When pre-fetching, after we get a block, tell the kernel about the
>          * next one we will want, if there's any left.
>          */
>         if (effective_io_concurrency && prefetch_targblock != InvalidBlockNumber)
>             PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
> #endif
> 
>         while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
>         {
>             ...
>         }
> 
>         pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
>                                      ++blksdone);
>     }

If effective_io_concurrency == 0, this calls 
BlockSampler_Next(&prefetch_bs) anyway, which is a waste of cycles.

If table_scan_analyze_next_block() returns false, we skip the 
PrefetchBuffer() call. That seem wrong.

Is there any potential harm from calling PrefetchBuffer() on a page that 
table_scan_analyze_next_block() later deems as unsuitable for smapling 
and returns false? That's theoretical at the moment, because 
heapam_scan_analyze_next_block() always returns true. (The tableam 
ANALYZE API refactor patch will make this moot, as it moves this logic 
into the tableam's implementation, so the implementation can do whatever 
make sense for the particular AM.)

- Heikki

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

05 February 2021, 21:22:34

Greetings,

* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
> On 13/01/2021 23:17, Stephen Frost wrote:
> >Would be great to get a review / comments from others as to if there's
> >any concerns.  I'll admit that it seems reasonably straight-forward to
> >me, but hey, I wrote most of it, so that's not really a fair
> >assessment... ;)
>
> Look good overall. A few minor comments:

Thanks a lot for the review!

> The patch consists of two part: add stats to the log for auto-analyze, and
> implement prefetching. They seem like independent features, consider
> splitting into two patches.

Yeah, that's a good point.  I had anticipated that there would be
overlap but in the end there really wasn't.  Done in the attached.

> It's a bit weird that you get more stats in the log for
> autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not really
> this patch's fault though.

Agreed.

> This conflicts with the patch at https://commitfest.postgresql.org/31/2907/,
> to refactor the table AM analyze API. That's OK, it's straightforward to
> resolve regardless of which patch is committed first.

Agreed.

> >    /* Outer loop over blocks to sample */
> >    while (BlockSampler_HasMore(&bs))
> >    {
> >#ifdef USE_PREFETCH
> >        BlockNumber prefetch_targblock = InvalidBlockNumber;
> >#endif
> >        BlockNumber targblock = BlockSampler_Next(&bs);
> >
> >#ifdef USE_PREFETCH
> >
> >        /*
> >         * Make sure that every time the main BlockSampler is moved forward
> >         * that our prefetch BlockSampler also gets moved forward, so that we
> >         * always stay out ahead.
> >         */
> >        if (BlockSampler_HasMore(&prefetch_bs))
> >            prefetch_targblock = BlockSampler_Next(&prefetch_bs);
> >#endif
> >
> >        vacuum_delay_point();
> >
> >        if (!table_scan_analyze_next_block(scan, targblock, vac_strategy))
> >            continue;
> >
> >#ifdef USE_PREFETCH
> >
> >        /*
> >         * When pre-fetching, after we get a block, tell the kernel about the
> >         * next one we will want, if there's any left.
> >         */
> >        if (effective_io_concurrency && prefetch_targblock != InvalidBlockNumber)
> >            PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
> >#endif
> >
> >        while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
> >        {
> >            ...
> >        }
> >
> >        pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
> >                                     ++blksdone);
> >    }
>
> If effective_io_concurrency == 0, this calls BlockSampler_Next(&prefetch_bs)
> anyway, which is a waste of cycles.

Good point, fixed.

> If table_scan_analyze_next_block() returns false, we skip the
> PrefetchBuffer() call. That seem wrong.

Agreed, fixed.

> Is there any potential harm from calling PrefetchBuffer() on a page that
> table_scan_analyze_next_block() later deems as unsuitable for smapling and
> returns false? That's theoretical at the moment, because
> heapam_scan_analyze_next_block() always returns true. (The tableam ANALYZE
> API refactor patch will make this moot, as it moves this logic into the
> tableam's implementation, so the implementation can do whatever make sense
> for the particular AM.)

I can't see any potential harm and it seems pretty likely that if an
heapam_scan_analyze_next_block()-equivilant were to decide that a block
isn't appropriate to analyze it'd have to do so after reading that block
anyway, making the prefetch still useful.

Perhaps there'll be a case in the future where a given AM would know
based on just the block number that it isn't useful to analyze, in which
case it'd make sense to adjust the code to skip that block for both
Prefetching and actually reading, but I don't think that would be too
hard to do.  Doesn't seem sensible to invent that in advance of actually
having that case though- it's certainly not the case for heap AM today,
at least, as you say.

Unless there's anything else on this, I'll commit these sometime next
week.

Thanks again for the review!

Stephen

Attachment

Re: automatic analyze: readahead - add "IO read time" log message

From

Heikki Linnakangas

Date:

08 February 2021, 11:24:37

On 05/02/2021 23:22, Stephen Frost wrote:
> Unless there's anything else on this, I'll commit these sometime next
> week.

One more thing: Instead of using 'effective_io_concurrency' GUC 
directly, should call get_tablespace_maintenance_io_concurrency().

- Heikki

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

10 February 2021, 22:10:08

Greetings,

* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
> On 05/02/2021 23:22, Stephen Frost wrote:
> >Unless there's anything else on this, I'll commit these sometime next
> >week.
>
> One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> should call get_tablespace_maintenance_io_concurrency().

Ah, yeah, of course.

Updated patch attached.

Thanks!

Stephen

Attachment

Re: automatic analyze: readahead - add "IO read time" log message

From

Egor Rogov

Date:

25 May 2021, 16:00:46

Hi,

On 11.02.2021 01:10, Stephen Frost wrote:

> Greetings,
>
> * Heikki Linnakangas (hlinnaka@iki.fi) wrote:
>> On 05/02/2021 23:22, Stephen Frost wrote:
>>> Unless there's anything else on this, I'll commit these sometime next
>>> week.
>> One more thing: Instead of using 'effective_io_concurrency' GUC directly,
>> should call get_tablespace_maintenance_io_concurrency().
> Ah, yeah, of course.
>
> Updated patch attached.


I happened to notice that get_tablespace_io_concurrency() is called 
instead of get_tablespace_maintenance_io_concurrency(). It doesn't look 
right, no?


Regards,
Egor Rogov.

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

26 May 2021, 00:36:52

Greetings,

* Egor Rogov (e.rogov@postgrespro.ru) wrote:
> On 11.02.2021 01:10, Stephen Frost wrote:
> >* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
> >>On 05/02/2021 23:22, Stephen Frost wrote:
> >>>Unless there's anything else on this, I'll commit these sometime next
> >>>week.
> >>One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> >>should call get_tablespace_maintenance_io_concurrency().
> >Ah, yeah, of course.
> >
> >Updated patch attached.
>
> I happened to notice that get_tablespace_io_concurrency() is called instead
> of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no?

Hah, indeed you're right.  Will fix.

Thanks!

Stephen

Attachment

signature.asc

Re: automatic analyze: readahead - add "IO read time" log message

From

Stephen Frost

Date:

27 August 2021, 23:24:55

Greetings,

* Stephen Frost (sfrost@snowman.net) wrote:
> * Egor Rogov (e.rogov@postgrespro.ru) wrote:
> > On 11.02.2021 01:10, Stephen Frost wrote:
> > >* Heikki Linnakangas (hlinnaka@iki.fi) wrote:
> > >>On 05/02/2021 23:22, Stephen Frost wrote:
> > >>>Unless there's anything else on this, I'll commit these sometime next
> > >>>week.
> > >>One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> > >>should call get_tablespace_maintenance_io_concurrency().
> > >Ah, yeah, of course.
> > >
> > >Updated patch attached.
> >
> > I happened to notice that get_tablespace_io_concurrency() is called instead
> > of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no?
>
> Hah, indeed you're right.  Will fix.

Found this under a bit of a pile in my todo list. :)

Fix pushed.

Thanks!

Stephen

Attachment

signature.asc