automatic analyze: readahead - add "IO read time" log message - Mailing list pgsql-hackers

From Jakub Wartak
Subject automatic analyze: readahead - add "IO read time" log message
Date
Msg-id VI1PR0701MB6960361E83FE46E71FBC8811F6190@VI1PR0701MB6960.eurprd07.prod.outlook.com
Whole thread Raw
Responses Re: automatic analyze: readahead - add "IO read time" log message
List pgsql-hackers
Greetings hackers,

I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical
factor.Autovacuum/autogathering statistics was tuned to be pretty very aggressive: 
autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
autovacuum_naptime=1s
autovacuum_max_workers=4

Some critical table partitions are configured with:  autovacuum_analyze_scale_factor=0.001,
autovacuum_analyze_threshold=50000to force auto-analyze jobs pretty often. The interesting logs are like this: 
automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s
automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s
automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s
automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s

and this is root-cause of my question.  As you can see there is huge 3x-4x time discrepancy between "elapsed" and
user+systemwhich is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According
tovarious tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1)
alsoreport the same for the worker: 

06:56:12 AM       PID    %usr %system  %guest    %CPU   CPU  Command
06:56:13 AM    114774    8.00   10.00    0.00   18.00    18  postgres
06:56:14 AM    114774    8.00   11.00    0.00   19.00    15  postgres
06:56:15 AM    114774    5.00   13.00    0.00   18.00    18  postgres

06:56:17 AM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
06:56:18 AM    114774  63746.53      0.00      0.00  postgres
06:56:19 AM    114774  62896.00      0.00      0.00  postgres
06:56:20 AM    114774  62920.00      0.00      0.00  postgres

One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not
performinga lot (total = 242MB/s reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although reporting
highutilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in
parallel(let's say 300MB/s for statistics collection alone). 

According to various gdb backtraces, a lot of time is spent here:
#0  0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at
/usr/include/bits/unistd.h:84
#2  FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760,
wait_event_info=167772173)at fd.c:1883 
#3  0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902,
buffer=0x7f9413ab7280"\037\005") at md.c:596 
#4  0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
blockNum=19890902,mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f) 
    at bufmgr.c:897
#5  0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902,
mode=<optimizedout>, strategy=<optimized out>) at bufmgr.c:665 
#6  0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>,
bstrategy=<optimizedout>) at heapam_handler.c:998 
#7  0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>,
scan=0x10c8098)at ../../../src/include/access/tableam.h:1462 
#8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000, totalrows=0x7fffba7e3160,
totaldeadrows=0x7fffba7e3158)at analyze.c:1048 
#9  0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0
<acquire_sample_rows>,relpages=26763525, inh=false, 
    in_outer_xact=false, elevel=13) at analyze.c:502
[..]
#12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118
[..]

The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order:

Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at
heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at
heapam_handler.c:984

And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on
readaheadsdone by ext4, trimmed for clarity: 

# Children      Self       sys       usr  Command   Shared Object       Symbol
    63.64%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] entry_SYSCALL_64_after_hwframe
            ---entry_SYSCALL_64_after_hwframe
               do_syscall_64
               |--59.66%--sys_pread64
               |          vfs_read
               |           --59.09%--__vfs_read
               |                      --58.24%--generic_file_read_iter
               |                                |--47.44%--ondemand_readahead
               |                                |           --46.88%--__do_page_cache_readahead
               |                                |                     |--32.67%--ext4_mpage_readpages
               |                                |                     |          |--16.76%--submit_bio
               |                                |                     |--10.23%--blk_finish_plug
[..]
    63.64%     1.99%     1.99%     0.00%  postgres  [kernel.kallsyms]   [k] do_syscall_64
            |--61.65%--do_syscall_64
            |          |--59.66%--sys_pread64
            |          |          vfs_read
            |          |           --59.09%--__vfs_read
            |          |                      --58.24%--generic_file_read_iter
            |          |                                |--47.44%--ondemand_readahead
            |          |                                |           --46.88%--__do_page_cache_readahead

    61.36%     0.00%     0.00%     0.00%  postgres  postgres            [.] FileRead
            ---FileRead
               __pread_nocancel
                --60.51%--entry_SYSCALL_64_after_hwframe
                          do_syscall_64
                           --59.66%--sys_pread64
                                     vfs_read
                                      --59.09%--__vfs_read
                                                 --58.24%--generic_file_read_iter
                                                           |--47.44%--ondemand_readahead
                                                           |           --46.88%--__do_page_cache_readahead

    61.36%     0.85%     0.00%     0.85%  postgres  libpthread-2.17.so  [.] __pread_nocancel
            |--60.51%--__pread_nocancel
            |          entry_SYSCALL_64_after_hwframe
            |          do_syscall_64
            |           --59.66%--sys_pread64
            |                     vfs_read
            |                      --59.09%--__vfs_read
            |                                 --58.24%--generic_file_read_iter
            |                                           |--47.44%--ondemand_readahead
            |                                           |           --46.88%--__do_page_cache_readahead


    59.66%     0.00%     0.00%     0.00%  postgres  [kernel.kallsyms]   [k] sys_pread64
            ---sys_pread64
               vfs_read
                --59.09%--__vfs_read
                           --58.24%--generic_file_read_iter
                                     |--47.44%--ondemand_readahead
                                     |           --46.88%--__do_page_cache_readahead
                                     |                     |--32.67%--ext4_mpage_readpages


[..]
Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker:

# Overhead       sys       usr  Command   Shared Object       Symbol
     5.40%     0.00%     5.40%  postgres  [vdso]              [.] __vdso_clock_gettime
     5.11%     0.00%     5.11%  postgres  postgres            [.] acquire_sample_rows
            ---acquire_sample_rows
     3.98%     0.00%     3.98%  postgres  postgres            [.] heapam_scan_analyze_next_tuple
            ---heapam_scan_analyze_next_tuple
     3.69%     3.69%     0.00%  postgres  [kernel.kallsyms]   [k] pvclock_clocksource_read

My questions are:
a) does anybody know if it is expected that getrusage() doesn't include readahead times  as current thread system time
?(I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of time
spentexecuting in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?) 

b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add
muchvalue to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the
badidea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as
strictlyI/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in
heavyI/O conditions (with track_io_timings=on) logs the following:  
"LOG:  automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s,
system:0.13 s, elapsed: 0.92 s" 
my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was
just13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the
autovacuum_vacuum_cost_delay=0.In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent
onsleeping? 

c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads? E.g.
maybedisabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of blocks
areneeded? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by autovacuums
andenable even workers). Any hints or comments? 

All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs.

-Jakub Wartak.
Attachment

pgsql-hackers by date:

Previous
From: Muhammad Usama
Date:
Subject: Re: pgbench - add pseudo-random permutation function
Next
From: Julien Rouhaud
Date:
Subject: Re: Collation versioning