Thread: automatic analyze: readahead - add "IO read time" log message
Greetings hackers, I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical factor.Autovacuum/autogathering statistics was tuned to be pretty very aggressive: autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant) autovacuum_naptime=1s autovacuum_max_workers=4 Some critical table partitions are configured with: autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=50000to force auto-analyze jobs pretty often. The interesting logs are like this: automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s and this is root-cause of my question. As you can see there is huge 3x-4x time discrepancy between "elapsed" and user+systemwhich is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According tovarious tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1) alsoreport the same for the worker: 06:56:12 AM PID %usr %system %guest %CPU CPU Command 06:56:13 AM 114774 8.00 10.00 0.00 18.00 18 postgres 06:56:14 AM 114774 8.00 11.00 0.00 19.00 15 postgres 06:56:15 AM 114774 5.00 13.00 0.00 18.00 18 postgres 06:56:17 AM PID kB_rd/s kB_wr/s kB_ccwr/s Command 06:56:18 AM 114774 63746.53 0.00 0.00 postgres 06:56:19 AM 114774 62896.00 0.00 0.00 postgres 06:56:20 AM 114774 62920.00 0.00 0.00 postgres One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not performinga lot (total = 242MB/s reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although reporting highutilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in parallel(let's say 300MB/s for statistics collection alone). According to various gdb backtraces, a lot of time is spent here: #0 0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0 #1 0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at /usr/include/bits/unistd.h:84 #2 FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760, wait_event_info=167772173)at fd.c:1883 #3 0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902, buffer=0x7f9413ab7280"\037\005") at md.c:596 #4 0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902,mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f) at bufmgr.c:897 #5 0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902, mode=<optimizedout>, strategy=<optimized out>) at bufmgr.c:665 #6 0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>, bstrategy=<optimizedout>) at heapam_handler.c:998 #7 0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>, scan=0x10c8098)at ../../../src/include/access/tableam.h:1462 #8 acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158)at analyze.c:1048 #9 0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 <acquire_sample_rows>,relpages=26763525, inh=false, in_outer_xact=false, elevel=13) at analyze.c:502 [..] #12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118 [..] The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order: Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at heapam_handler.c:984 Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at heapam_handler.c:984 And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on readaheadsdone by ext4, trimmed for clarity: # Children Self sys usr Command Shared Object Symbol 63.64% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ---entry_SYSCALL_64_after_hwframe do_syscall_64 |--59.66%--sys_pread64 | vfs_read | --59.09%--__vfs_read | --58.24%--generic_file_read_iter | |--47.44%--ondemand_readahead | | --46.88%--__do_page_cache_readahead | | |--32.67%--ext4_mpage_readpages | | | |--16.76%--submit_bio | | |--10.23%--blk_finish_plug [..] 63.64% 1.99% 1.99% 0.00% postgres [kernel.kallsyms] [k] do_syscall_64 |--61.65%--do_syscall_64 | |--59.66%--sys_pread64 | | vfs_read | | --59.09%--__vfs_read | | --58.24%--generic_file_read_iter | | |--47.44%--ondemand_readahead | | | --46.88%--__do_page_cache_readahead 61.36% 0.00% 0.00% 0.00% postgres postgres [.] FileRead ---FileRead __pread_nocancel --60.51%--entry_SYSCALL_64_after_hwframe do_syscall_64 --59.66%--sys_pread64 vfs_read --59.09%--__vfs_read --58.24%--generic_file_read_iter |--47.44%--ondemand_readahead | --46.88%--__do_page_cache_readahead 61.36% 0.85% 0.00% 0.85% postgres libpthread-2.17.so [.] __pread_nocancel |--60.51%--__pread_nocancel | entry_SYSCALL_64_after_hwframe | do_syscall_64 | --59.66%--sys_pread64 | vfs_read | --59.09%--__vfs_read | --58.24%--generic_file_read_iter | |--47.44%--ondemand_readahead | | --46.88%--__do_page_cache_readahead 59.66% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] sys_pread64 ---sys_pread64 vfs_read --59.09%--__vfs_read --58.24%--generic_file_read_iter |--47.44%--ondemand_readahead | --46.88%--__do_page_cache_readahead | |--32.67%--ext4_mpage_readpages [..] Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker: # Overhead sys usr Command Shared Object Symbol 5.40% 0.00% 5.40% postgres [vdso] [.] __vdso_clock_gettime 5.11% 0.00% 5.11% postgres postgres [.] acquire_sample_rows ---acquire_sample_rows 3.98% 0.00% 3.98% postgres postgres [.] heapam_scan_analyze_next_tuple ---heapam_scan_analyze_next_tuple 3.69% 3.69% 0.00% postgres [kernel.kallsyms] [k] pvclock_clocksource_read My questions are: a) does anybody know if it is expected that getrusage() doesn't include readahead times as current thread system time ?(I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of time spentexecuting in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?) b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add muchvalue to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the badidea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as strictlyI/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in heavyI/O conditions (with track_io_timings=on) logs the following: "LOG: automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s, system:0.13 s, elapsed: 0.92 s" my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was just13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the autovacuum_vacuum_cost_delay=0.In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent onsleeping? c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads? E.g. maybedisabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of blocks areneeded? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by autovacuums andenable even workers). Any hints or comments? All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs. -Jakub Wartak.
Attachment
Greetings, * Jakub Wartak (Jakub.Wartak@tomtom.com) wrote: > I have I hope interesting observation (and nano patch proposal) on system where statistics freshness is a critical factor.Autovacuum/autogathering statistics was tuned to be pretty very aggressive: > autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant) > autovacuum_naptime=1s > autovacuum_max_workers=4 > > Some critical table partitions are configured with: autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=50000to force auto-analyze jobs pretty often. The interesting logs are like this: > automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 s, elapsed: 252.14 s > automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 s, elapsed: 317.33 s > automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 s, elapsed: 213.58 s > automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 s, elapsed: 229.45 s That's certainly pretty aggressive. :) > and this is root-cause of my question. As you can see there is huge 3x-4x time discrepancy between "elapsed" and user+systemwhich is strange at least for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). According tovarious tools it is true: Time was wasted somewhere else, but not in the PostgreSQL analyze. The ps(1) and pidstat(1) alsoreport the same for the worker: The user/system time there is time-on-CPU (hence the 'CPU: ' prefix). > 06:56:12 AM PID %usr %system %guest %CPU CPU Command > 06:56:13 AM 114774 8.00 10.00 0.00 18.00 18 postgres > 06:56:14 AM 114774 8.00 11.00 0.00 19.00 15 postgres > 06:56:15 AM 114774 5.00 13.00 0.00 18.00 18 postgres > > 06:56:17 AM PID kB_rd/s kB_wr/s kB_ccwr/s Command > 06:56:18 AM 114774 63746.53 0.00 0.00 postgres > 06:56:19 AM 114774 62896.00 0.00 0.00 postgres > 06:56:20 AM 114774 62920.00 0.00 0.00 postgres > > One could argue that such autoanalyze worker could perform 5x more of work (%CPU -> 100%) here. The I/O system is not performinga lot (total = 242MB/s reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although reporting highutilization I'm pretty sure it could push much more. There can be up to 3x-4x of such 70-80MB/s analyzes in parallel(let's say 300MB/s for statistics collection alone). The analyze is doing more-or-less random i/o since it's skipping through the table picking out select blocks, not doing regular sequential i/o. > According to various gdb backtraces, a lot of time is spent here: > #0 0x00007f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0 > #1 0x0000000000741a16 in pread (__offset=811253760, __nbytes=8192, __buf=0x7f9413ab7280, __fd=<optimized out>) at /usr/include/bits/unistd.h:84 > #2 FileRead (file=<optimized out>, buffer=0x7f9413ab7280 "\037\005", amount=8192, offset=811253760, wait_event_info=167772173)at fd.c:1883 > #3 0x0000000000764b8f in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=19890902, buffer=0x7f9413ab7280"\037\005") at md.c:596 > #4 0x000000000073d69b in ReadBuffer_common (smgr=<optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902,mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f) > at bufmgr.c:897 > #5 0x000000000073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, forkNum=MAIN_FORKNUM, blockNum=19890902, mode=<optimizedout>, strategy=<optimized out>) at bufmgr.c:665 > #6 0x00000000004c2e2f in heapam_scan_analyze_next_block (scan=<optimized out>, blockno=<optimized out>, bstrategy=<optimizedout>) at heapam_handler.c:998 > #7 0x0000000000597de1 in table_scan_analyze_next_block (bstrategy=<optimized out>, blockno=<optimized out>, scan=0x10c8098)at ../../../src/include/access/tableam.h:1462 > #8 acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, targrows=1500000, totalrows=0x7fffba7e3160,totaldeadrows=0x7fffba7e3158) at analyze.c:1048 > #9 0x0000000000596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 <acquire_sample_rows>,relpages=26763525, inh=false, > in_outer_xact=false, elevel=13) at analyze.c:502 > [..] > #12 0x00000000006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, tab=<optimized out>) at autovacuum.c:3118 > [..] Sure, we're blocked on a read call trying to get the next block. > The interesting thing that targrows=1.5mlns and that blocks are accessed (as expected) in sorted order: > > Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984 > Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984 > Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984 > Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, bstrategy=0x1102278) at heapam_handler.c:984 > Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, bstrategy=0x1102278) at heapam_handler.c:984 Not really sure what's interesting here, but it does look like we're skipping through the table as expected. > And probably this explains the discrepancy, perf with CPU usage reporting shows a lot of frames waiting on I/O on readaheadsdone by ext4, trimmed for clarity: > > # Children Self sys usr Command Shared Object Symbol > 63.64% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe > ---entry_SYSCALL_64_after_hwframe > do_syscall_64 > |--59.66%--sys_pread64 > | vfs_read > | --59.09%--__vfs_read > | --58.24%--generic_file_read_iter > | |--47.44%--ondemand_readahead > | | --46.88%--__do_page_cache_readahead > | | |--32.67%--ext4_mpage_readpages > | | | |--16.76%--submit_bio > | | |--10.23%--blk_finish_plug > [..] > 63.64% 1.99% 1.99% 0.00% postgres [kernel.kallsyms] [k] do_syscall_64 > |--61.65%--do_syscall_64 > | |--59.66%--sys_pread64 > | | vfs_read > | | --59.09%--__vfs_read > | | --58.24%--generic_file_read_iter > | | |--47.44%--ondemand_readahead > | | | --46.88%--__do_page_cache_readahead > > 61.36% 0.00% 0.00% 0.00% postgres postgres [.] FileRead > ---FileRead > __pread_nocancel > --60.51%--entry_SYSCALL_64_after_hwframe > do_syscall_64 > --59.66%--sys_pread64 > vfs_read > --59.09%--__vfs_read > --58.24%--generic_file_read_iter > |--47.44%--ondemand_readahead > | --46.88%--__do_page_cache_readahead > > 61.36% 0.85% 0.00% 0.85% postgres libpthread-2.17.so [.] __pread_nocancel > |--60.51%--__pread_nocancel > | entry_SYSCALL_64_after_hwframe > | do_syscall_64 > | --59.66%--sys_pread64 > | vfs_read > | --59.09%--__vfs_read > | --58.24%--generic_file_read_iter > | |--47.44%--ondemand_readahead > | | --46.88%--__do_page_cache_readahead > > > 59.66% 0.00% 0.00% 0.00% postgres [kernel.kallsyms] [k] sys_pread64 > ---sys_pread64 > vfs_read > --59.09%--__vfs_read > --58.24%--generic_file_read_iter > |--47.44%--ondemand_readahead > | --46.88%--__do_page_cache_readahead > | |--32.67%--ext4_mpage_readpages > With all those 'readahead' calls it certainly makes one wonder if the Linux kernel is reading more than just the block we're looking for because it thinks we're doing a sequential read and will therefore want the next few blocks when, in reality, we're going to skip past them, meaning that any readahead the kernel is doing is likely just wasted I/O. > [..] > Perf --no-children also triple confirms that there isn't any function that is burning a lot inside the worker: > > # Overhead sys usr Command Shared Object Symbol > 5.40% 0.00% 5.40% postgres [vdso] [.] __vdso_clock_gettime > 5.11% 0.00% 5.11% postgres postgres [.] acquire_sample_rows > ---acquire_sample_rows > 3.98% 0.00% 3.98% postgres postgres [.] heapam_scan_analyze_next_tuple > ---heapam_scan_analyze_next_tuple > 3.69% 3.69% 0.00% postgres [kernel.kallsyms] [k] pvclock_clocksource_read Sure, makes sense. > My questions are: > a) does anybody know if it is expected that getrusage() doesn't include readahead times as current thread system time? (I don't know by may be performed by other kernel threads?) ru_stime is defined as "This is the total amount of timespent executing in kernel mode". Maybe the "executing" is the keyword here? (waiting != executing?) getrusage()'s user/system CPU times are reporting time-on-CPU, not counting time blocking for i/o. Waiting isn't the same as executing, no. > b) initially I've wanted to add a new pg_rusage_show_verbose() that would also add ru_inblock, but that wouldn't add muchvalue to the end user. Also adding another timing directly around table_scan_analyze_next_block() seems like the badidea as it involves locking underneah. So I've tried the most easy approach to simply log $pgStatBlockReadTime as strictlyI/O time spent in pread() (ReadBuffer_common() already measures time). The attached patch for PgSQL14-devel in heavyI/O conditions (with track_io_timings=on) logs the following: > "LOG: automatic analyze of table "test.public.t1_default" system usage: IO read time 0.69 s, CPU: user: 0.18 s, system:0.13 s, elapsed: 0.92 s" That definitely seems like a useful thing to include and thanks for the patch! Please be sure to register it in the commitfest app: https://commitfest.postgresql.org > my interpretation would be that IO reading time was most limiting factor (69/92 = 75%), but *CPU* on kernel side was just13s. It could give the enduser/DBA the information needed, the information where's the bottleneck given the autovacuum_vacuum_cost_delay=0.In autovacuum_vacuum_cost_delay>0 maybe it would make sense to include also time spent onsleeping? Yeah, that would certainly be useful. > c) I'm curious if anybody has any I/O related insights into analyze.c processing especially related to readaheads? E.g.maybe disabling readahead would help for PostgreSQL analyze.c usecase on NVMe? Is it worth given that only x% of blocksare needed? The only option I'm aware would be to e.g. hash-partition the table (to introduce parallelism by autovacuumsand enable even workers). Any hints or comments? I would think that, ideally, we'd teach analyze.c to work in the same way that bitmap heap scans do- that is, use posix_fadvise to let the kernel know what pages we're going to want next instead of the kernel guessing (incorrectly) or not doing any pre-fetching. I didn't spend a lot of time poking, but it doesn't look like analyze.c tries to do any prefetching today. In a similar vein, I wonder if VACUUM should be doing prefetching too today, at least when it's skipping through the heap based on the visibility map and jumping over all-frozen pages. > All of the above observations from PostgreSQL 12.4 on Linux kernel 4.14 with ext4/striped dm with 3x-4x NVMEs. > > -Jakub Wartak. > diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c > index 8af12b5c6b..fea1bd6f44 100644 > --- a/src/backend/commands/analyze.c > +++ b/src/backend/commands/analyze.c > @@ -312,6 +312,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params, > Oid save_userid; > int save_sec_context; > int save_nestlevel; > + PgStat_Counter startblockreadtime = 0; > > if (inh) > ereport(elevel, > @@ -347,6 +348,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params, > if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) > { > pg_rusage_init(&ru0); > + startblockreadtime = pgStatBlockReadTime; > if (params->log_min_duration > 0) > starttime = GetCurrentTimestamp(); > } > @@ -686,10 +688,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params, > TimestampDifferenceExceeds(starttime, GetCurrentTimestamp(), > params->log_min_duration)) > ereport(LOG, > - (errmsg("automatic analyze of table \"%s.%s.%s\" system usage: %s", > + (errmsg("automatic analyze of table \"%s.%s.%s\" system usage: IO read time %.2f s, %s", > get_database_name(MyDatabaseId), > get_namespace_name(RelationGetNamespace(onerel)), > RelationGetRelationName(onerel), > + (double) (pgStatBlockReadTime - startblockreadtime)/1000000, > pg_rusage_show(&ru0)))); > } > Haven't looked too closely at this but in general +1 on the idea and this approach looks pretty reasonable to me. Only thing I can think of off-hand is to check how it compares to other places where we report IO read time and make sure that it looks similar. Thanks, Stephen
Attachment
Hi Stephen, hackers, > The analyze is doing more-or-less random i/o since it's skipping through > the table picking out select blocks, not doing regular sequential i/o. VS >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984 >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984 >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984 >Not really sure what's interesting here, but it does look like we're >skipping through the table as expected. Yes, but not randomly in this case. I wanted to point out that this is incrementing block number, therefore I've includedthis debug output which might trigger readahead heuristics. Perhaps this depends on how the table was built / vacuumed ? (in this case, pure-INSERT-only; I would expect the same intime series DBs and DWHs). > With all those 'readahead' calls it certainly makes one wonder if the > Linux kernel is reading more than just the block we're looking for > because it thinks we're doing a sequential read and will therefore want > the next few blocks when, in reality, we're going to skip past them, > meaning that any readahead the kernel is doing is likely just wasted > I/O. I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests". Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where apparentlythere is some boost due to readahead. > That definitely seems like a useful thing to include and thanks for the > patch! Please be sure to register it in the commitfest app: > https://commitfest.postgresql.org Thank you! Thread is now registered. > I would think that, ideally, we'd teach analyze.c to work in the same > way that bitmap heap scans do- that is, use posix_fadvise to let the > kernel know what pages we're going to want next instead of the kernel > guessing (incorrectly) or not doing any pre-fetching. I didn't spend a > lot of time poking, but it doesn't look like analyze.c tries to do any > prefetching today. In a similar vein, I wonder if VACUUM should be > doing prefetching too today, at least when it's skipping through the > heap based on the visibility map and jumping over all-frozen pages. My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issuedon block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) > Haven't looked too closely at this but in general +1 on the idea and > this approach looks pretty reasonable to me. Only thing I can think of > off-hand is to check how it compares to other places where we report IO > read time and make sure that it looks similar. Ok, I've changed the output in 002 version to include "avg read rate" just like in the autovacuum case but still maintainingsingle line output, e.g: automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s (read time: 2.52 s), system usage: CPU: user:0.28 s, system: 0.26 s, elapsed: 2.94 s -J.
Attachment
Greetings Jakub, * Jakub Wartak (Jakub.Wartak@tomtom.com) wrote: > > The analyze is doing more-or-less random i/o since it's skipping through > > the table picking out select blocks, not doing regular sequential i/o. > VS > >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984 > >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984 > >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984 > >Not really sure what's interesting here, but it does look like we're > >skipping through the table as expected. > > Yes, but not randomly in this case. I wanted to point out that this is incrementing block number, therefore I've includedthis debug output which might trigger readahead heuristics. Sure, it's incrementing, but it's skipping- this is very similar to what we do with Bitmap Heap Scans, and that makes it different from a typical sequential scan. > Perhaps this depends on how the table was built / vacuumed ? (in this case, pure-INSERT-only; I would expect the same intime series DBs and DWHs). No, this is how ANALYZE behaves and hasn't got much to do with how the table was built or vacuumed. > > With all those 'readahead' calls it certainly makes one wonder if the > > Linux kernel is reading more than just the block we're looking for > > because it thinks we're doing a sequential read and will therefore want > > the next few blocks when, in reality, we're going to skip past them, > > meaning that any readahead the kernel is doing is likely just wasted > > I/O. > > I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests". Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where apparentlythere is some boost due to readahead. Interesting that you weren't seeing any benefit to disabling readahead. Were you able to see where the time in the kernel was going when readahead was turned off for the ANALYZE? Note that you shouldn't need to make changes to smgr/md/fd to leverage posix_fadvise- what you would do is use PrefetchBuffer(), see BitmapPrefetch(). The VACUUM case is going to be complicated by what's in the visibility map. A VACUUM that isn't able to skip any pages yet is certainly going to benefit from the kernel's readahead, but a VACUUM that's able to skip over pages likely wouldn't benefit as much. The way to test this would look something like: - COPY a bunch of data into a table - make sure to commit that and any other ongoing transactions - VACUUM FREEZE the table - check the visibility map to make sure most of the pages are marked as all-frozen after the VACUUM FREEZE - randomly UPDATE the table, to really get the effect, maybe update 20% of the pages while leaving the rest alone (and therefore 80% of the table should still have the all-frozen bit set in the visibility map) - *then* do a VACUUM on the table and see what happens with different amounts of read-ahead (or, ideally, with posix_fadvise() being used to let the kernel know what pages we're going to actually want). > > That definitely seems like a useful thing to include and thanks for the > > patch! Please be sure to register it in the commitfest app: > > https://commitfest.postgresql.org > > Thank you! Thread is now registered. Great! > > I would think that, ideally, we'd teach analyze.c to work in the same > > way that bitmap heap scans do- that is, use posix_fadvise to let the > > kernel know what pages we're going to want next instead of the kernel > > guessing (incorrectly) or not doing any pre-fetching. I didn't spend a > > lot of time poking, but it doesn't look like analyze.c tries to do any > > prefetching today. In a similar vein, I wonder if VACUUM should be > > doing prefetching too today, at least when it's skipping through the > > heap based on the visibility map and jumping over all-frozen pages. > > My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOM issuedon block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) Well, the idea is that posix_fadvise() usage through PrefetchBuffer() gets us some of that by letting the kernel know what we're going to ask for next. AIO is a whole other animal that's been discussed off and on around here but it's a much larger and more invasive change than just calling posix_fadvise(). > > Haven't looked too closely at this but in general +1 on the idea and > > this approach looks pretty reasonable to me. Only thing I can think of > > off-hand is to check how it compares to other places where we report IO > > read time and make sure that it looks similar. > > Ok, I've changed the output in 002 version to include "avg read rate" just like in the autovacuum case but still maintainingsingle line output, e.g: > automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s (read time: 2.52 s), system usage: CPU:user: 0.28 s, system: 0.26 s, elapsed: 2.94 s Ah, in this case I had been thinking actually about the other things that track_io_timing controls, as in: ExplainPropertyFloat("I/O Read Time", "ms", INSTR_TIME_GET_MILLISEC(usage->blk_read_time), 3, es); ExplainPropertyFloat("I/O Write Time", "ms", INSTR_TIME_GET_MILLISEC(usage->blk_write_time), 3, es); Seems we already aren't exactly consistent in how we report these things, unfortunately. That said, since this is ANALYZE, yeah, it probably makes more sense to have it look more like what VACUUM reports than what you'd get from EXPLAIN. As an interesting point also- what VACUUM reports isn't through using track_io_timing, which makes one wonder if we should be doing that here or not.. Thanks! Stephen
Attachment
Hi Stephen, hackers, >> > With all those 'readahead' calls it certainly makes one wonder if the >> > Linux kernel is reading more than just the block we're looking for >> > because it thinks we're doing a sequential read and will therefore want >> > the next few blocks when, in reality, we're going to skip past them, >> > meaning that any readahead the kernel is doing is likely just wasted >> > I/O. >> I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests". Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where apparentlythere is some boost due to readahead. >Interesting that you weren't seeing any benefit to disabling readahead. I've got some free minutes and I have repeated the exercise in more realistic and strict environment that previous one to conclude that the current situation is preferable: Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+) Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different storagesettings), VG on top of dm-0, LV with stripe-size 8kB, ext4. s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12: INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample, 1678321053estimated total rows Hot VFS cache: Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. ->pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request sizeper IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s No VFS cache: Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf, totalelapsed 3m30s Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s No VFS cache, readahead off: Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS@ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads[ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s One thing not clear here is maybe in future worth measuring how striped LVs are being affected by readaheads. >Were you able to see where the time in the kernel was going when >readahead was turned off for the ANALYZE? Yes, my interpretation is that the time spent goes into directly block I/O layer waiting. 54.67% 1.33% postgres postgres [.] FileRead ---FileRead --53.33%--__pread_nocancel --50.67%--entry_SYSCALL_64_after_hwframe do_syscall_64 sys_pread64 |--49.33%--vfs_read | --48.00%--__vfs_read | |--45.33%--generic_file_read_iter | | |--42.67%--ondemand_readahead | | | __do_page_cache_readahead | | | |--25.33%--ext4_mpage_readpages | | | | |--10.67%--submit_bio | | | | | generic_make_request | | | | | |--8.00%--blk_mq_make_request | | | | | | |--4.00%--blk_mq_get_request | | | | | | | |--1.33%--blk_mq_get_tag | | | | | | | --1.33%--sched_clock | | | | | | | xen_sched_clock | | | | | | | pvclock_clocksource_read | | | | | | |--1.33%--bio_integrity_prep | | | | | | --1.33%--blk_account_io_start | | | | | | part_round_stats | | | | | | blk_mq_in_flight | | | | | | blk_mq_queue_tag_busy_iter | | | | | --2.67%--dm_make_request | | | | | __split_and_process_bio | | | | | __split_and_process_non_flush | | | | | |--1.33%--__map_bio | | | | | | generic_make_request | | | | | | generic_make_request_checks | | | | | | percpu_counter_add_batch | | | | | --1.33%--bio_alloc_bioset | | | | | mempool_alloc | | | | | kmem_cache_alloc | | | | |--6.67%--ext4_map_blocks | | | | | |--4.00%--ext4_es_lookup_extent | | | | | | --2.67%--_raw_read_lock | | | | | --2.67%--__check_block_validity.constprop.81 | | | | | ext4_data_block_valid | | | | --6.67%--add_to_page_cache_lru | | | | |--4.00%--__add_to_page_cache_locked | | | | | --1.33%--mem_cgroup_try_charge | | | | | get_mem_cgroup_from_mm | | | | --2.67%--__lru_cache_add | | | | pagevec_lru_move_fn | | | | __lock_text_start | | | |--12.00%--blk_finish_plug | | | | blk_flush_plug_list | | | | blk_mq_flush_plug_list | | | | |--10.67%--__blk_mq_delay_run_hw_queue | | | | | __blk_mq_run_hw_queue | | | | | blk_mq_sched_dispatch_requests | | | | | --9.33%--blk_mq_dispatch_rq_list | | | | | nvme_queue_rq | | | | | --1.33%--blk_mq_start_request >The VACUUM case is going to be complicated by what's in the visibility >map. (..) After observing the ANALYZE readahead behavior benefit I've abandoned the case of testing much more advanced VACUUM processing, clearly Linux read-ahead is beneficial in even simple cases. >> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOMissued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) > > (..)AIO is a whole other animal that's been discussed off and on >around here but it's a much larger and more invasive change than just >calling posix_fadvise(). Yes, I'm aware and I'm keeping my fingers crossed that maybe some day.... The ANALYZE just seem fit to be natural candidate to use it. The only easy chance of acceleration of stats gathering - at least to me and enduser point of view - is to have more parallel autoanalyze workers running to drive more I/O concurrency (by e.g. partitioning the data), both in readahead and non-readahead scenarios. Which is a pity because 70-80% of such process sits idle. The readahead might read 10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO would be it could throw thousands of requests without readahead and achieve much better efficiency probably</speculation> I hope the previous simple patch goes into master and helps other people understand the picture more easily. -J.
Greetings, * Jakub Wartak (Jakub.Wartak@tomtom.com) wrote: > >Interesting that you weren't seeing any benefit to disabling readahead. > > I've got some free minutes and I have repeated the exercise in more realistic > and strict environment that previous one to conclude that the current situation is preferable: Thanks for spending time on this! > Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+) > Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have differentstorage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4. > s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12: > INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample,1678321053 estimated total rows > > Hot VFS cache: > Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. ->pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request sizeper IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m > Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s > Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s > Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s > > No VFS cache: > Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf,total elapsed 3m30s > Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s > Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s > > No VFS cache, readahead off: > Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS@ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s > Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads[ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s > Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s [ ... ] > >The VACUUM case is going to be complicated by what's in the visibility > >map. (..) > > After observing the ANALYZE readahead behavior benefit I've abandoned > the case of testing much more advanced VACUUM processing, clearly Linux > read-ahead is beneficial in even simple cases. This seems to be indicating that while the Linux kernel may end up reading pages we don't end up needing, it's much more often the case that it's ending up reading *some* pages that we do need, and that's happening often enough that it more than makes up for the extra reads being done. Instead of having these guessing games between the kernel and what PG's doing, however, we could potentially do better using posix_fadvise() to tell the kernel, up front, exactly what blocks we are going to ask for, and perhaps that would end up improving things. Attached is a very much rough-n-ready patch for doing that, using effective_io_concurrency to control how many blocks to pre-fetch for ANALYZE (0 meaning 'none'). If you've got a chance to test with different settings for effective_io_concurrency with the patch applied to see what impact posix_fadvise() has on these ANALYZE runs, that would be very cool to see. Going between effective_cache_size = 0 and effective_cache_size = 10 with this patch, in some very quick testing on a laptop NVMe, while making sure to drop caches and restart PG in between to clear out shared_buffers, definitely shows that prefetching done this way is an improvement over letting the kernel's normal read ahead handle it. > >> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOMissued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) > > > > (..)AIO is a whole other animal that's been discussed off and on > >around here but it's a much larger and more invasive change than just > >calling posix_fadvise(). > > Yes, I'm aware and I'm keeping my fingers crossed that maybe some day.... I don't think we should throw out the idea of using PrefetchBuffer() here. "Real" AIO would certainly be good to have one of these days, but until then, posix_fadvise() could net us some of those gains in the meantime. > The ANALYZE just seem fit to be natural candidate to use it. The only easy chance > of acceleration of stats gathering - at least to me and enduser point of view - > is to have more parallel autoanalyze workers running to drive more I/O concurrency > (by e.g. partitioning the data), both in readahead and non-readahead scenarios. > Which is a pity because 70-80% of such process sits idle. The readahead might read > 10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO > would be it could throw thousands of requests without readahead and achieve much > better efficiency probably</speculation> Actual parallel ANALYZE could be interesting, of course, but that's also quite a bit more complicated to make happen, and even in that case it would seem like we'd probably want to use posix_fadvise() anyway since it'd hopefully help. > I hope the previous simple patch goes into master and helps other people understand > the picture more easily. I agree that we should also be providing users with more information about what's going on, but maybe we can do that and improve things by leveraging posix_fadvise(). Thanks, Stephen
Attachment
Greetings Stephen, I saw up 410MB/s for a few seconds with this patch on NVMe, and that's huge ~5.2x improvement which is amazing for a such simple patch. The system and data was identical like last time, so results are directly comparable to the previous post. The only change is that I've applied Yours patch on top of REL12_STABLE as this is cluster where I was having original data. I think it deserves it's own commitfest entry, right? Or do we combine those two? Each test was conducted with 1. alter system set effective_io_concurrency=$N; 2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart; 3. analyze verbose t; 4. the measurements were observed by pidstat -d and iostat -dxm. no readahead (blockdev --setra 0), cold cache: effective_io_concurrency=64; => ~26MB/s effective_io_concurrency=32; => ~29MB/s effective_io_concurrency=8; => ~31MB/s effective_io_concurrency=4; => ~32MB/s effective_io_concurrency=1; => ~32MB/s effective_io_concurrency=0; => ~32MB/s readahead=256 (128kB, Linux defaults), cold cache: effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline) effective_io_concurrency=1; => ~62MB/s effective_io_concurrency=4; => ~370MB/s effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100% effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s) I think one needs to be sure to restart, so that autovacuum workers get the new effective_io_concurrency, maybe that's a documentation thing only. I have just one question, with this patch the strace I/O of analyze look like below: pread(72, ..., 8192, 738181120) = 8192 fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 pread(72, "..., 8192, 738500608) = 8192 fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 pread(72, .., 8192, 738607104) = 8192 fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0 pread(72, ..., 8192, 738754560) = 8192 fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0 If you highlight "738754560" in the output it appears to duplicate the syscalls issued until it preads() - in case of "738754560" offset it was asked for 3 times. Also I wouldn't imagine in wildest dreams that posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall. -J. ---------------------------------------------------------------------------------------------------------------- From: Stephen Frost Sent: Tuesday, November 3, 2020 6:47 PM To: Jakub Wartak Cc: pgsql-hackers Subject: Re: automatic analyze: readahead - add "IO read time" log message Greetings, * Jakub Wartak (Jakub.Wartak@tomtom.com) wrote: > >Interesting that you weren't seeing any benefit to disabling readahead. > > I've got some free minutes and I have repeated the exercise in more realistic > and strict environment that previous one to conclude that the current situation is preferable: Thanks for spending time on this! > Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+) > Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have differentstorage settings), VG on top of dm-0, LV with stripe-size 8kB, ext4. > s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12: > INFO: "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample,1678321053 estimated total rows > > Hot VFS cache: > Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> .. ->pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request sizeper IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m > Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s > Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s > Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s > > No VFS cache: > Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf,total elapsed 3m30s > Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s > Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s > > No VFS cache, readahead off: > Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS@ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s > Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no readaheads[ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s > Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s [ ... ] > >The VACUUM case is going to be complicated by what's in the visibility > >map. (..) > > After observing the ANALYZE readahead behavior benefit I've abandoned > the case of testing much more advanced VACUUM processing, clearly Linux > read-ahead is beneficial in even simple cases. This seems to be indicating that while the Linux kernel may end up reading pages we don't end up needing, it's much more often the case that it's ending up reading *some* pages that we do need, and that's happening often enough that it more than makes up for the extra reads being done. Instead of having these guessing games between the kernel and what PG's doing, however, we could potentially do better using posix_fadvise() to tell the kernel, up front, exactly what blocks we are going to ask for, and perhaps that would end up improving things. Attached is a very much rough-n-ready patch for doing that, using effective_io_concurrency to control how many blocks to pre-fetch for ANALYZE (0 meaning 'none'). If you've got a chance to test with different settings for effective_io_concurrency with the patch applied to see what impact posix_fadvise() has on these ANALYZE runs, that would be very cool to see. Going between effective_cache_size = 0 and effective_cache_size = 10 with this patch, in some very quick testing on a laptop NVMe, while making sure to drop caches and restart PG in between to clear out shared_buffers, definitely shows that prefetching done this way is an improvement over letting the kernel's normal read ahead handle it. > >> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with POSIX_FADV_RANDOMissued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO ;) > > > > (..)AIO is a whole other animal that's been discussed off and on > >around here but it's a much larger and more invasive change than just > >calling posix_fadvise(). > > Yes, I'm aware and I'm keeping my fingers crossed that maybe some day.... I don't think we should throw out the idea of using PrefetchBuffer() here. "Real" AIO would certainly be good to have one of these days, but until then, posix_fadvise() could net us some of those gains in the meantime. > The ANALYZE just seem fit to be natural candidate to use it. The only easy chance > of acceleration of stats gathering - at least to me and enduser point of view - > is to have more parallel autoanalyze workers running to drive more I/O concurrency > (by e.g. partitioning the data), both in readahead and non-readahead scenarios. > Which is a pity because 70-80% of such process sits idle. The readahead might read > 10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO > would be it could throw thousands of requests without readahead and achieve much > better efficiency probably</speculation> Actual parallel ANALYZE could be interesting, of course, but that's also quite a bit more complicated to make happen, and even in that case it would seem like we'd probably want to use posix_fadvise() anyway since it'd hopefully help. > I hope the previous simple patch goes into master and helps other people understand > the picture more easily. I agree that we should also be providing users with more information about what's going on, but maybe we can do that and improve things by leveraging posix_fadvise(). Thanks, Stephen
On Wed, Nov 04, 2020 at 09:07:59AM +0000, Jakub Wartak wrote: > >Greetings Stephen, > >I saw up 410MB/s for a few seconds with this patch on NVMe, and that's >huge ~5.2x improvement which is amazing for a such simple patch. > >The system and data was identical like last time, so results are directly comparable >to the previous post. The only change is that I've applied Yours patch on top of >REL12_STABLE as this is cluster where I was having original data. I think it deserves > it's own commitfest entry, right? Or do we combine those two? > >Each test was conducted with >1. alter system set effective_io_concurrency=$N; >2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart; >3. analyze verbose t; >4. the measurements were observed by pidstat -d and iostat -dxm. > >no readahead (blockdev --setra 0), cold cache: >effective_io_concurrency=64; => ~26MB/s >effective_io_concurrency=32; => ~29MB/s >effective_io_concurrency=8; => ~31MB/s >effective_io_concurrency=4; => ~32MB/s >effective_io_concurrency=1; => ~32MB/s >effective_io_concurrency=0; => ~32MB/s > >readahead=256 (128kB, Linux defaults), cold cache: >effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline) >effective_io_concurrency=1; => ~62MB/s >effective_io_concurrency=4; => ~370MB/s >effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100% >effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s) > Thanks for the testing, those are indeed quite promising numbers for such a simple patch. I wonder how would this behave on a different storage - I'll see if I can do some testing, but I guess it should be the same as for other places that do prefetching. >I think one needs to be sure to restart, so that autovacuum >workers get the new effective_io_concurrency, maybe that's a >documentation thing only. > >I have just one question, with this patch the strace I/O of analyze >look like below: > >pread(72, ..., 8192, 738181120) = 8192 >fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 >pread(72, "..., 8192, 738500608) = 8192 >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 >pread(72, .., 8192, 738607104) = 8192 >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0 >pread(72, ..., 8192, 738754560) = 8192 >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0 > >If you highlight "738754560" in the output it appears to duplicate the >syscalls issued until it preads() - in case of "738754560" offset it was >asked for 3 times. Also I wouldn't imagine in wildest dreams that >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall. > IMHO that'a a bug in the patch, which always tries to prefetch all "future" blocks, including those that were already prefetched. It probably needs to do something like bitmap heap scan where we track what was already prefetched and only issue the new blocks. BTW it seems your e-mail client does something funny, stripping the "references" headers, which breaks threading and makes following the discussion very hard (and will confuse the CF app too). If you look at pgsql-hackers archives, each of your responses starts a new thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On Wed, Nov 04, 2020 at 09:07:59AM +0000, Jakub Wartak wrote: > >I saw up 410MB/s for a few seconds with this patch on NVMe, and that's > >huge ~5.2x improvement which is amazing for a such simple patch. Nice! > >The system and data was identical like last time, so results are directly comparable > >to the previous post. The only change is that I've applied Yours patch on top of > >REL12_STABLE as this is cluster where I was having original data. I think it deserves > >it's own commitfest entry, right? Or do we combine those two? > > > >Each test was conducted with > >1. alter system set effective_io_concurrency=$N; > >2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 restart; > >3. analyze verbose t; > >4. the measurements were observed by pidstat -d and iostat -dxm. > > > >no readahead (blockdev --setra 0), cold cache: > >effective_io_concurrency=64; => ~26MB/s > >effective_io_concurrency=32; => ~29MB/s > >effective_io_concurrency=8; => ~31MB/s > >effective_io_concurrency=4; => ~32MB/s > >effective_io_concurrency=1; => ~32MB/s > >effective_io_concurrency=0; => ~32MB/s I'm speculating here, but it seems likely that by disabling read-ahead, the posix_fadvise() calls are basically getting ignored and that's why there's not much difference here...? > >readahead=256 (128kB, Linux defaults), cold cache: > >effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s , baseline) > >effective_io_concurrency=1; => ~62MB/s > >effective_io_concurrency=4; => ~370MB/s > >effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU util 100% > >effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s) Very nice. > Thanks for the testing, those are indeed quite promising numbers for > such a simple patch. I wonder how would this behave on a different > storage - I'll see if I can do some testing, but I guess it should be > the same as for other places that do prefetching. That would certainly be cool. > >I think one needs to be sure to restart, so that autovacuum > >workers get the new effective_io_concurrency, maybe that's a > >documentation thing only. > > > >I have just one question, with this patch the strace I/O of analyze > >look like below: > > > >pread(72, ..., 8192, 738181120) = 8192 > >fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 > >pread(72, "..., 8192, 738500608) = 8192 > >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 > >pread(72, .., 8192, 738607104) = 8192 > >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0 > >pread(72, ..., 8192, 738754560) = 8192 > >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0 > >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0 > > > >If you highlight "738754560" in the output it appears to duplicate the > >syscalls issued until it preads() - in case of "738754560" offset it was > >asked for 3 times. Also I wouldn't imagine in wildest dreams that > >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall. > > IMHO that'a a bug in the patch, which always tries to prefetch all > "future" blocks, including those that were already prefetched. It > probably needs to do something like bitmap heap scan where we track > what was already prefetched and only issue the new blocks. Yeah, this was just something quickly thrown together to see if it'd help. I'll clean it up by, as you say, doing something similar to what we do with bitmap heap scan, by having a separate BlockSampler that's set up early on and then have the pre-fetching keep pace with the actual scan instead of posix_fadvise()'ing the same pages over and over. I'll also look at doing something for VACUUM too, as that could also benefit from posix_fadvise() calls, when we're skipping over all-frozen pages. Thanks! Stephen
Attachment
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >If you highlight "738754560" in the output it appears to duplicate the > >syscalls issued until it preads() - in case of "738754560" offset it was > >asked for 3 times. Also I wouldn't imagine in wildest dreams that > >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall. > > IMHO that'a a bug in the patch, which always tries to prefetch all > "future" blocks, including those that were already prefetched. It > probably needs to do something like bitmap heap scan where we track > what was already prefetched and only issue the new blocks. Updated patch attached which: - Starts out by pre-fetching the first effective_io_concurrency number of blocks we are going to want, hopefully making it so the kernel will trust our fadvise's over its own read-ahead, right from the start. - Makes sure the prefetch iterator is pushed forward whenever the regular interator is moved forward. - After each page read, issues a prefetch, similar to BitmapHeapScan, to hopefully avoiding having the prefetching get in the way of the regular i/o. - Added some comments, ran pgindent, added a commit message. I do think we should also include patch that Jakub wrote previously which adds information about the read rate of ANALYZE. I'll look at integrating that into this patch and then look at a new patch to do something similar for VACUUM in a bit. If you're doing further benchmarking of ANALYZE though, this would probably be the better patch to use. Certainly improved performance here quite a bit with effective_io_concurrency set to 16. Thanks, Stephen
Attachment
Hi, On 11/4/20 5:02 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >>> If you highlight "738754560" in the output it appears to duplicate the >>> syscalls issued until it preads() - in case of "738754560" offset it was >>> asked for 3 times. Also I wouldn't imagine in wildest dreams that >>> posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall. >> >> IMHO that'a a bug in the patch, which always tries to prefetch all >> "future" blocks, including those that were already prefetched. It >> probably needs to do something like bitmap heap scan where we track >> what was already prefetched and only issue the new blocks. > > Updated patch attached which: > > - Starts out by pre-fetching the first effective_io_concurrency number > of blocks we are going to want, hopefully making it so the kernel will > trust our fadvise's over its own read-ahead, right from the start. > - Makes sure the prefetch iterator is pushed forward whenever the > regular interator is moved forward. > - After each page read, issues a prefetch, similar to BitmapHeapScan, to > hopefully avoiding having the prefetching get in the way of the > regular i/o. > - Added some comments, ran pgindent, added a commit message. > Nice, that was quick ;-) > I do think we should also include patch that Jakub wrote previously > which adds information about the read rate of ANALYZE. > +1 > I'll look at integrating that into this patch and then look at a new > patch to do something similar for VACUUM in a bit. > +1 > If you're doing further benchmarking of ANALYZE though, this would > probably be the better patch to use. Certainly improved performance > here quite a bit with effective_io_concurrency set to 16. > Yeah. I'd expect this to be heavily dependent on hardware. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 11/4/20 5:02 PM, Stephen Frost wrote: > >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >>>If you highlight "738754560" in the output it appears to duplicate the > >>>syscalls issued until it preads() - in case of "738754560" offset it was > >>>asked for 3 times. Also I wouldn't imagine in wildest dreams that > >>>posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall. > >> > >>IMHO that'a a bug in the patch, which always tries to prefetch all > >>"future" blocks, including those that were already prefetched. It > >>probably needs to do something like bitmap heap scan where we track > >>what was already prefetched and only issue the new blocks. > > > >Updated patch attached which: > > > >- Starts out by pre-fetching the first effective_io_concurrency number > > of blocks we are going to want, hopefully making it so the kernel will > > trust our fadvise's over its own read-ahead, right from the start. > >- Makes sure the prefetch iterator is pushed forward whenever the > > regular interator is moved forward. > >- After each page read, issues a prefetch, similar to BitmapHeapScan, to > > hopefully avoiding having the prefetching get in the way of the > > regular i/o. > >- Added some comments, ran pgindent, added a commit message. > > Nice, that was quick ;-) :) > >I do think we should also include patch that Jakub wrote previously > >which adds information about the read rate of ANALYZE. > > +1 Attached is an updated patch which updates the documentation and integrates Jakub's initial work on improving the logging around auto-analyze (and I made the logging in auto-vacuum more-or-less match it). > >I'll look at integrating that into this patch and then look at a new > >patch to do something similar for VACUUM in a bit. > > +1 I spent some time looking into this but it's a bit complicated.. For some sound reasons, VACUUM will avoid skipping through a table when there's only a few pages that it could skip (not skipping allows us to move forward the relfrozenxid). That said, perhaps we could start doing prefetching once we've decided that we're skipping. We'd need to think about if we have to worry about the VM changing between the pre-fetching and the time when we're actually going to ask for the page.. I don't *think* that's an issue because only VACUUM would be changing the pages to be all-frozen or all-visible, and so if we see a page that isn't one of those then we're going to want to visit that page and that's not going to change, and we probably don't need to care about a page that used to be all-frozen and now isn't during this run- but if the prefetch went ahead and got page 10, and now page 8 is not-all-frozen and the actual scan is at page 5, then maybe it wants page 8 next and that isn't what we pre-fetched... Anyhow, all-in-all, definitely more complicated and probably best considered and discussed independently. > >If you're doing further benchmarking of ANALYZE though, this would > >probably be the better patch to use. Certainly improved performance > >here quite a bit with effective_io_concurrency set to 16. > > Yeah. I'd expect this to be heavily dependent on hardware. Sure, I agree with that too. Thanks, Stephen
Attachment
On 11/9/20 7:06 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 11/4/20 5:02 PM, Stephen Frost wrote: >>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >>>>> If you highlight "738754560" in the output it appears to duplicate the >>>>> syscalls issued until it preads() - in case of "738754560" offset it was >>>>> asked for 3 times. Also I wouldn't imagine in wildest dreams that >>>>> posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall. >>>> >>>> IMHO that'a a bug in the patch, which always tries to prefetch all >>>> "future" blocks, including those that were already prefetched. It >>>> probably needs to do something like bitmap heap scan where we track >>>> what was already prefetched and only issue the new blocks. >>> >>> Updated patch attached which: >>> >>> - Starts out by pre-fetching the first effective_io_concurrency number >>> of blocks we are going to want, hopefully making it so the kernel will >>> trust our fadvise's over its own read-ahead, right from the start. >>> - Makes sure the prefetch iterator is pushed forward whenever the >>> regular interator is moved forward. >>> - After each page read, issues a prefetch, similar to BitmapHeapScan, to >>> hopefully avoiding having the prefetching get in the way of the >>> regular i/o. >>> - Added some comments, ran pgindent, added a commit message. >> >> Nice, that was quick ;-) > > :) > >>> I do think we should also include patch that Jakub wrote previously >>> which adds information about the read rate of ANALYZE. >> >> +1 > > Attached is an updated patch which updates the documentation and > integrates Jakub's initial work on improving the logging around > auto-analyze (and I made the logging in auto-vacuum more-or-less match > it). > Thanks. I'll do some testing/benchmarking once my machines are free, in a couple days perhaps. But as I said before, I don't expect this to behave very differently from other places that already do prefetching. >>> I'll look at integrating that into this patch and then look at a new >>> patch to do something similar for VACUUM in a bit. >> >> +1 > > I spent some time looking into this but it's a bit complicated.. For > some sound reasons, VACUUM will avoid skipping through a table when > there's only a few pages that it could skip (not skipping allows us to > move forward the relfrozenxid). That said, perhaps we could start doing > prefetching once we've decided that we're skipping. We'd need to think > about if we have to worry about the VM changing between the pre-fetching > and the time when we're actually going to ask for the page.. I don't > *think* that's an issue because only VACUUM would be changing the pages > to be all-frozen or all-visible, and so if we see a page that isn't one > of those then we're going to want to visit that page and that's not > going to change, and we probably don't need to care about a page that > used to be all-frozen and now isn't during this run- but if the prefetch > went ahead and got page 10, and now page 8 is not-all-frozen and the > actual scan is at page 5, then maybe it wants page 8 next and that isn't > what we pre-fetched... > > Anyhow, all-in-all, definitely more complicated and probably best > considered and discussed independently> +1 FWIW I wonder if this should be tracked separately in the CF app, as it's very different from the original "add some logging" patch, which makes the CF entry rather misleading. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > Thanks. I'll do some testing/benchmarking once my machines are free, in > a couple days perhaps. But as I said before, I don't expect this to > behave very differently from other places that already do prefetching. Agreed, but would still be nice to see test results beyond just what I've done. > FWIW I wonder if this should be tracked separately in the CF app, as > it's very different from the original "add some logging" patch, which > makes the CF entry rather misleading. I've gone ahead and updated the CF entry for this to hopefully make it clearer for those interested in looking at it. I'll try to come back to this in the next CF, ideally we'd at least get someone else to take a look at the code beyond me. :) (Obviously, you looked at it some, but wasn't really clear if you were alright with it or if you felt it needed more review.) Thanks! Stephen
Attachment
Greetings, * Stephen Frost (sfrost@snowman.net) wrote: > * Tomas Vondra (tomas.vondra@enterprisedb.com) wrote: > > Thanks. I'll do some testing/benchmarking once my machines are free, in > > a couple days perhaps. But as I said before, I don't expect this to > > behave very differently from other places that already do prefetching. > > Agreed, but would still be nice to see test results beyond just what > I've done. Any chance you were able to run those tests..? > > FWIW I wonder if this should be tracked separately in the CF app, as > > it's very different from the original "add some logging" patch, which > > makes the CF entry rather misleading. > > I've gone ahead and updated the CF entry for this to hopefully make it > clearer for those interested in looking at it. I'll try to come back to > this in the next CF, ideally we'd at least get someone else to take a > look at the code beyond me. :) (Obviously, you looked at it some, but > wasn't really clear if you were alright with it or if you felt it needed > more review.) I've updated the patch to leverage Tom's introduction of TimestatmpDifferenceMilliseconds, which simplifies things a bit (and I don't think we need to worry about an analyze taking over 25 days...) and generally rebased this up to current HEAD. Would be great to get a review / comments from others as to if there's any concerns. I'll admit that it seems reasonably straight-forward to me, but hey, I wrote most of it, so that's not really a fair assessment... ;) Thanks, Stephen
Attachment
On 13/01/2021 23:17, Stephen Frost wrote: > Would be great to get a review / comments from others as to if there's > any concerns. I'll admit that it seems reasonably straight-forward to > me, but hey, I wrote most of it, so that's not really a fair > assessment... ;) Look good overall. A few minor comments: The patch consists of two part: add stats to the log for auto-analyze, and implement prefetching. They seem like independent features, consider splitting into two patches. It's a bit weird that you get more stats in the log for autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not really this patch's fault though. This conflicts with the patch at https://commitfest.postgresql.org/31/2907/, to refactor the table AM analyze API. That's OK, it's straightforward to resolve regardless of which patch is committed first. > /* Outer loop over blocks to sample */ > while (BlockSampler_HasMore(&bs)) > { > #ifdef USE_PREFETCH > BlockNumber prefetch_targblock = InvalidBlockNumber; > #endif > BlockNumber targblock = BlockSampler_Next(&bs); > > #ifdef USE_PREFETCH > > /* > * Make sure that every time the main BlockSampler is moved forward > * that our prefetch BlockSampler also gets moved forward, so that we > * always stay out ahead. > */ > if (BlockSampler_HasMore(&prefetch_bs)) > prefetch_targblock = BlockSampler_Next(&prefetch_bs); > #endif > > vacuum_delay_point(); > > if (!table_scan_analyze_next_block(scan, targblock, vac_strategy)) > continue; > > #ifdef USE_PREFETCH > > /* > * When pre-fetching, after we get a block, tell the kernel about the > * next one we will want, if there's any left. > */ > if (effective_io_concurrency && prefetch_targblock != InvalidBlockNumber) > PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock); > #endif > > while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot)) > { > ... > } > > pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE, > ++blksdone); > } If effective_io_concurrency == 0, this calls BlockSampler_Next(&prefetch_bs) anyway, which is a waste of cycles. If table_scan_analyze_next_block() returns false, we skip the PrefetchBuffer() call. That seem wrong. Is there any potential harm from calling PrefetchBuffer() on a page that table_scan_analyze_next_block() later deems as unsuitable for smapling and returns false? That's theoretical at the moment, because heapam_scan_analyze_next_block() always returns true. (The tableam ANALYZE API refactor patch will make this moot, as it moves this logic into the tableam's implementation, so the implementation can do whatever make sense for the particular AM.) - Heikki
Greetings, * Heikki Linnakangas (hlinnaka@iki.fi) wrote: > On 13/01/2021 23:17, Stephen Frost wrote: > >Would be great to get a review / comments from others as to if there's > >any concerns. I'll admit that it seems reasonably straight-forward to > >me, but hey, I wrote most of it, so that's not really a fair > >assessment... ;) > > Look good overall. A few minor comments: Thanks a lot for the review! > The patch consists of two part: add stats to the log for auto-analyze, and > implement prefetching. They seem like independent features, consider > splitting into two patches. Yeah, that's a good point. I had anticipated that there would be overlap but in the end there really wasn't. Done in the attached. > It's a bit weird that you get more stats in the log for > autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not really > this patch's fault though. Agreed. > This conflicts with the patch at https://commitfest.postgresql.org/31/2907/, > to refactor the table AM analyze API. That's OK, it's straightforward to > resolve regardless of which patch is committed first. Agreed. > > /* Outer loop over blocks to sample */ > > while (BlockSampler_HasMore(&bs)) > > { > >#ifdef USE_PREFETCH > > BlockNumber prefetch_targblock = InvalidBlockNumber; > >#endif > > BlockNumber targblock = BlockSampler_Next(&bs); > > > >#ifdef USE_PREFETCH > > > > /* > > * Make sure that every time the main BlockSampler is moved forward > > * that our prefetch BlockSampler also gets moved forward, so that we > > * always stay out ahead. > > */ > > if (BlockSampler_HasMore(&prefetch_bs)) > > prefetch_targblock = BlockSampler_Next(&prefetch_bs); > >#endif > > > > vacuum_delay_point(); > > > > if (!table_scan_analyze_next_block(scan, targblock, vac_strategy)) > > continue; > > > >#ifdef USE_PREFETCH > > > > /* > > * When pre-fetching, after we get a block, tell the kernel about the > > * next one we will want, if there's any left. > > */ > > if (effective_io_concurrency && prefetch_targblock != InvalidBlockNumber) > > PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock); > >#endif > > > > while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot)) > > { > > ... > > } > > > > pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE, > > ++blksdone); > > } > > If effective_io_concurrency == 0, this calls BlockSampler_Next(&prefetch_bs) > anyway, which is a waste of cycles. Good point, fixed. > If table_scan_analyze_next_block() returns false, we skip the > PrefetchBuffer() call. That seem wrong. Agreed, fixed. > Is there any potential harm from calling PrefetchBuffer() on a page that > table_scan_analyze_next_block() later deems as unsuitable for smapling and > returns false? That's theoretical at the moment, because > heapam_scan_analyze_next_block() always returns true. (The tableam ANALYZE > API refactor patch will make this moot, as it moves this logic into the > tableam's implementation, so the implementation can do whatever make sense > for the particular AM.) I can't see any potential harm and it seems pretty likely that if an heapam_scan_analyze_next_block()-equivilant were to decide that a block isn't appropriate to analyze it'd have to do so after reading that block anyway, making the prefetch still useful. Perhaps there'll be a case in the future where a given AM would know based on just the block number that it isn't useful to analyze, in which case it'd make sense to adjust the code to skip that block for both Prefetching and actually reading, but I don't think that would be too hard to do. Doesn't seem sensible to invent that in advance of actually having that case though- it's certainly not the case for heap AM today, at least, as you say. Unless there's anything else on this, I'll commit these sometime next week. Thanks again for the review! Stephen
Attachment
On 05/02/2021 23:22, Stephen Frost wrote: > Unless there's anything else on this, I'll commit these sometime next > week. One more thing: Instead of using 'effective_io_concurrency' GUC directly, should call get_tablespace_maintenance_io_concurrency(). - Heikki
Greetings, * Heikki Linnakangas (hlinnaka@iki.fi) wrote: > On 05/02/2021 23:22, Stephen Frost wrote: > >Unless there's anything else on this, I'll commit these sometime next > >week. > > One more thing: Instead of using 'effective_io_concurrency' GUC directly, > should call get_tablespace_maintenance_io_concurrency(). Ah, yeah, of course. Updated patch attached. Thanks! Stephen
Attachment
Hi, On 11.02.2021 01:10, Stephen Frost wrote: > Greetings, > > * Heikki Linnakangas (hlinnaka@iki.fi) wrote: >> On 05/02/2021 23:22, Stephen Frost wrote: >>> Unless there's anything else on this, I'll commit these sometime next >>> week. >> One more thing: Instead of using 'effective_io_concurrency' GUC directly, >> should call get_tablespace_maintenance_io_concurrency(). > Ah, yeah, of course. > > Updated patch attached. I happened to notice that get_tablespace_io_concurrency() is called instead of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no? Regards, Egor Rogov.
Greetings, * Egor Rogov (e.rogov@postgrespro.ru) wrote: > On 11.02.2021 01:10, Stephen Frost wrote: > >* Heikki Linnakangas (hlinnaka@iki.fi) wrote: > >>On 05/02/2021 23:22, Stephen Frost wrote: > >>>Unless there's anything else on this, I'll commit these sometime next > >>>week. > >>One more thing: Instead of using 'effective_io_concurrency' GUC directly, > >>should call get_tablespace_maintenance_io_concurrency(). > >Ah, yeah, of course. > > > >Updated patch attached. > > I happened to notice that get_tablespace_io_concurrency() is called instead > of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no? Hah, indeed you're right. Will fix. Thanks! Stephen
Attachment
Greetings, * Stephen Frost (sfrost@snowman.net) wrote: > * Egor Rogov (e.rogov@postgrespro.ru) wrote: > > On 11.02.2021 01:10, Stephen Frost wrote: > > >* Heikki Linnakangas (hlinnaka@iki.fi) wrote: > > >>On 05/02/2021 23:22, Stephen Frost wrote: > > >>>Unless there's anything else on this, I'll commit these sometime next > > >>>week. > > >>One more thing: Instead of using 'effective_io_concurrency' GUC directly, > > >>should call get_tablespace_maintenance_io_concurrency(). > > >Ah, yeah, of course. > > > > > >Updated patch attached. > > > > I happened to notice that get_tablespace_io_concurrency() is called instead > > of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no? > > Hah, indeed you're right. Will fix. Found this under a bit of a pile in my todo list. :) Fix pushed. Thanks! Stephen