automatic analyze: readahead - add "IO read time" log message - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | automatic analyze: readahead - add "IO read time" log message |
Date | |
Msg-id | VI1PR0701MB6960361E83FE46E71FBC8811F6190@VI1PR0701MB6960.eurprd07.prod.outlook.com Whole thread Raw |
Responses |
Re: automatic analyze: readahead - add "IO read time" log message
|
List | pgsql-hackers |
Greetings hackers, I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical factor.Autovacuum/autogathering statistics was tuned to be pretty very aggressive: autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant) autovacuum_naptime=1s autovacuum_max_workers=4 Some critical table partitions are configured with: autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=50000to force auto-analyze jobs pretty often. The interesting logs are like this: automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s and this is root-cause of my question. As you can see there is huge 3x-4x time discrepancy between "elapsed" and user+systemwhich is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According tovarious tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1) alsoreport the same for the worker: 06:56:12 AM PID %usr %system %guest %CPU CPU Command 06:56:13 AM 114774 8.00 10.00 0.00 18.00 18 postgres 06:56:14 AM 114774 8.00 11.00 0.00 19.00 15 postgres 06:56:15 AM 114774 5.00 13.00 0.00 18.00 18 postgres 06:56:17 AM PID kB_rd/s kB_wr/s kB_ccwr/s Command 06:56:18 AM 114774 63746.53 0.00 0.00 postgres 06:56:19 AM 114774 62896.00 0.00 0.00 postgres 06:56:20 AM 114774 62920.00 0.00 0.00 postgres One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not performinga lot (total = 242MB/s reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although reporting highutilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in parallel(let's say 300MB/s for statistics collection alone). According to various gdb backtraces, a lot of time is spent here: #0 0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0 #1 0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at /usr/include/bits/unistd.h:84 #2 FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760, wait_event_info=167772173)at fd.c:1883 #3 0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902, buffer=0x7f9413ab7280"\037\005") at md.c:596 #4 0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902,mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f) at bufmgr.c:897 #5 0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902, mode=<optimizedout>, strategy=<optimized out>) at bufmgr.c:665 #6 0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>, bstrategy=<optimizedout>) at heapam_handler.c:998 #7 0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>, scan=0x10c8098)at ../../../src/include/access/tableam.h:1462 #8 acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158)at analyze.c:1048 #9 0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 <acquire_sample_rows>,relpages=26763525, inh=false, in_outer_xact=false, elevel=13) at analyze.c:502 [..] #12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118 [..] The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order: Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at heapam_handler.c:984 And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on readaheadsdone by ext4, trimmed for clarity: # Children Self sys usr Command Shared Object Symbol 63.64% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ---entry_SYSCALL_64_after_hwframe do_syscall_64 |--59.66%--sys_pread64 | vfs_read | --59.09%--__vfs_read | --58.24%--generic_file_read_iter | |--47.44%--ondemand_readahead | | --46.88%--__do_page_cache_readahead | | |--32.67%--ext4_mpage_readpages | | | |--16.76%--submit_bio | | |--10.23%--blk_finish_plug [..] 63.64% 1.99% 1.99% 0.00% postgres [kernel.kallsyms] [k] do_syscall_64 |--61.65%--do_syscall_64 | |--59.66%--sys_pread64 | | vfs_read | | --59.09%--__vfs_read | | --58.24%--generic_file_read_iter | | |--47.44%--ondemand_readahead | | | --46.88%--__do_page_cache_readahead 61.36% 0.00% 0.00% 0.00% postgres postgres [.] FileRead ---FileRead __pread_nocancel --60.51%--entry_SYSCALL_64_after_hwframe do_syscall_64 --59.66%--sys_pread64 vfs_read --59.09%--__vfs_read --58.24%--generic_file_read_iter |--47.44%--ondemand_readahead | --46.88%--__do_page_cache_readahead 61.36% 0.85% 0.00% 0.85% postgres libpthread-2.17.so [.] __pread_nocancel |--60.51%--__pread_nocancel | entry_SYSCALL_64_after_hwframe | do_syscall_64 | --59.66%--sys_pread64 | vfs_read | --59.09%--__vfs_read | --58.24%--generic_file_read_iter | |--47.44%--ondemand_readahead | | --46.88%--__do_page_cache_readahead 59.66% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] sys_pread64 ---sys_pread64 vfs_read --59.09%--__vfs_read --58.24%--generic_file_read_iter |--47.44%--ondemand_readahead | --46.88%--__do_page_cache_readahead | |--32.67%--ext4_mpage_readpages [..] Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker: # Overhead sys usr Command Shared Object Symbol 5.40% 0.00% 5.40% postgres [vdso] [.] __vdso_clock_gettime 5.11% 0.00% 5.11% postgres postgres [.] acquire_sample_rows ---acquire_sample_rows 3.98% 0.00% 3.98% postgres postgres [.] heapam_scan_analyze_next_tuple ---heapam_scan_analyze_next_tuple 3.69% 3.69% 0.00% postgres [kernel.kallsyms] [k] pvclock_clocksource_read My questions are: a) does anybody know if it is expected that getrusage() doesn't include readahead times as current thread system time ?(I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of time spentexecuting in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?) b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add muchvalue to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the badidea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as strictlyI/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in heavyI/O conditions (with track_io_timings=on) logs the following: "LOG: automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s, system:0.13 s, elapsed: 0.92 s" my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was just13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the autovacuum_vacuum_cost_delay=0.In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent onsleeping? c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads? E.g. maybedisabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of blocks areneeded? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by autovacuums andenable even workers). Any hints or comments? All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs. -Jakub Wartak.
Attachment
pgsql-hackers by date: