Re: automatic analyze: readahead - add "IO read time" log message - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: automatic analyze: readahead - add "IO read time" log message
Date
Msg-id VI1PR0701MB696040EA07BC742E18313060F6110@VI1PR0701MB6960.eurprd07.prod.outlook.com
Whole thread Raw
In response to automatic analyze: readahead - add "IO read time" log message  (Jakub Wartak <Jakub.Wartak@tomtom.com>)
Responses Re: automatic analyze: readahead - add "IO read time" log message  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
Hi Stephen, hackers,

>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending time looking at the smgr/md/fd API
changesrequired to find shortcut, but I'm getting actually a little bit worse timings at least on "laptop DB tests".
Onething that I've noticed is that needs to be only for automatic-analyze, but not for automatic-vacuum where
apparentlythere is some boost due to readahead. 

>Interesting that you weren't seeing any benefit to disabling readahead.

I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation is preferable:

Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is different layer and might have different
storagesettings), VG on top of dm-0, LV with stripe-size 8kB, ext4. 
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical output of PgSQL12:
INFO:  "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows and 0 dead rows; 1500000 rows in sample,
1678321053estimated total rows 

Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent heapam_scan_analyze_next_block() -> ..
->pread() which causes ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average request
sizeper IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, total elapsed ~3m 
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s

No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s read, ondemand_readahead visible in perf,
totalelapsed 3m30s 
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 3m28s

No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read
IOPS@ avgrq-sz = 11 sectors (*512=5.5kB), total elapsed 5m59s 
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, double-confirmed no
readaheads[ pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total elapsed 5m56s 
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: as above, total elapsed 5m55s

One thing not clear here is maybe in future worth measuring how striped LVs are being
affected by readaheads.

>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?

Yes, my interpretation is that the time spent goes into directly block I/O layer waiting.

    54.67%     1.33%  postgres  postgres            [.] FileRead
            ---FileRead
                --53.33%--__pread_nocancel
                           --50.67%--entry_SYSCALL_64_after_hwframe
                                     do_syscall_64
                                     sys_pread64
                                     |--49.33%--vfs_read
                                     |           --48.00%--__vfs_read
                                     |                     |--45.33%--generic_file_read_iter
                                     |                     |          |--42.67%--ondemand_readahead
                                     |                     |          |          __do_page_cache_readahead
                                     |                     |          |          |--25.33%--ext4_mpage_readpages
                                     |                     |          |          |          |--10.67%--submit_bio
                                     |                     |          |          |          |
generic_make_request
                                     |                     |          |          |          |
|--8.00%--blk_mq_make_request
                                     |                     |          |          |          |          |
|--4.00%--blk_mq_get_request
                                     |                     |          |          |          |          |          |
    |--1.33%--blk_mq_get_tag 
                                     |                     |          |          |          |          |          |
     --1.33%--sched_clock 
                                     |                     |          |          |          |          |          |
               xen_sched_clock 
                                     |                     |          |          |          |          |          |
               pvclock_clocksource_read 
                                     |                     |          |          |          |          |
|--1.33%--bio_integrity_prep
                                     |                     |          |          |          |          |
--1.33%--blk_account_io_start
                                     |                     |          |          |          |          |
    part_round_stats 
                                     |                     |          |          |          |          |
    blk_mq_in_flight 
                                     |                     |          |          |          |          |
    blk_mq_queue_tag_busy_iter 
                                     |                     |          |          |          |
--2.67%--dm_make_request
                                     |                     |          |          |          |
__split_and_process_bio
                                     |                     |          |          |          |
__split_and_process_non_flush
                                     |                     |          |          |          |
|--1.33%--__map_bio
                                     |                     |          |          |          |                     |
    generic_make_request 
                                     |                     |          |          |          |                     |
    generic_make_request_checks 
                                     |                     |          |          |          |                     |
    percpu_counter_add_batch 
                                     |                     |          |          |          |
--1.33%--bio_alloc_bioset
                                     |                     |          |          |          |
    mempool_alloc 
                                     |                     |          |          |          |
    kmem_cache_alloc 
                                     |                     |          |          |          |--6.67%--ext4_map_blocks
                                     |                     |          |          |          |
|--4.00%--ext4_es_lookup_extent
                                     |                     |          |          |          |          |
--2.67%--_raw_read_lock
                                     |                     |          |          |          |
--2.67%--__check_block_validity.constprop.81
                                     |                     |          |          |          |
ext4_data_block_valid
                                     |                     |          |          |
--6.67%--add_to_page_cache_lru
                                     |                     |          |          |
|--4.00%--__add_to_page_cache_locked
                                     |                     |          |          |                     |
--1.33%--mem_cgroup_try_charge
                                     |                     |          |          |                     |
    get_mem_cgroup_from_mm 
                                     |                     |          |          |
--2.67%--__lru_cache_add
                                     |                     |          |          |
pagevec_lru_move_fn
                                     |                     |          |          |
__lock_text_start
                                     |                     |          |          |--12.00%--blk_finish_plug
                                     |                     |          |          |          blk_flush_plug_list
                                     |                     |          |          |          blk_mq_flush_plug_list
                                     |                     |          |          |
|--10.67%--__blk_mq_delay_run_hw_queue
                                     |                     |          |          |          |
__blk_mq_run_hw_queue
                                     |                     |          |          |          |
blk_mq_sched_dispatch_requests
                                     |                     |          |          |          |
--9.33%--blk_mq_dispatch_rq_list
                                     |                     |          |          |          |
nvme_queue_rq
                                     |                     |          |          |          |
--1.33%--blk_mq_start_request




>The VACUUM case is going to be complicated by what's in the visibility
>map. (..)

After observing the ANALYZE readahead behavior benefit I've abandoned
the case of testing much more advanced VACUUM processing, clearly Linux
read-ahead is beneficial in even simple cases.

>> My only idea would be that a lot of those blocks could be read asynchronously in batches (AIO) with
POSIX_FADV_RANDOMissued on block-range before, so maybe the the optimization is possible, but not until we'll have AIO
;)
>
> (..)AIO is a whole other animal that's been discussed off and on
>around here but it's a much larger and more invasive change than just
>calling posix_fadvise().

Yes, I'm aware and I'm keeping my fingers crossed that maybe some day....

The ANALYZE just seem fit to be natural candidate to use it. The only easy chance
of acceleration of stats gathering - at least to me and enduser point of view -
is to have more parallel autoanalyze workers running to drive more I/O concurrency
(by e.g. partitioning the data),  both in readahead and non-readahead scenarios.
Which is a pity because 70-80% of such process sits idle. The readahead might read
10x more unnecessary data, but pread() doesn't have to wait. <speculation>Once AIO
would be it could throw thousands of requests without readahead and achieve much
better efficiency probably</speculation>

I hope the previous simple patch goes into master and helps other people understand
the picture more easily.

-J.


pgsql-hackers by date:

Previous
From: Pavel Borisov
Date:
Subject: Re: Bogus documentation for bogus geometric operators
Next
From: Michael Paquier
Date:
Subject: Re: Online checksums verification in the backend