Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | nyzjwfntxx7lcdvzdac2bdz4rhy5cd553mpva2zfqejedu6qvx@66rrj6ob3iyy Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: BitmapHeapScan streaming read user and prelim refactoring
|
List | pgsql-hackers |
Hi, Thomas, there's a bit relevant to you at the bottom. Melanie chatted with me about the performance regressions in Tomas' benchmarks of the patch. I experimented some and I think I found a few interesting pieces. I looked solely at cyclic, wm=4096, matches=8, eic=16 as I wanted to narrow down what I was looking at as much as possible, and as that seemed a significant regression. I was able to reproduce the regression with the patch, although the noise was very high. My first attempt at reducing the noise was using MAP_POPULATE, as I was seeing a lot of time spent in page fault related code, and that help stabilize the output some. After I couldn't really make sense of the different perf characteristics looking at the explain analyze or iostat, so I switched to profiling. As part of my profiling steps, I: - bind postgres to a single core - set the cpu governor to performance - disable CPU idle just for that core (disabling it for all cores sometimes leads to slowness due to less clock boost headroom) - try both with turbo boost on/off With all of those applied, the performance difference almost vanished. Weird, huh! Normally CPUs only clock down when idle. That includes waiting for IO if that wait period is long enough. That made me look more at strace. Which shows this interesting difference between master and the patch: The chosen excerpts are randomly picked, but the pattern is similar throughout execution of the query: master: pread64(52, "\0\0\0\0\260\7|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339271680) = 8192 <0.000010> fadvise64(52, 339402752, 8192, POSIX_FADV_WILLNEED) = 0 <0.000008> pread64(52, "\0\0\0\0h\10|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339279872) = 8192 <0.000011> fadvise64(52, 339410944, 8192, POSIX_FADV_WILLNEED) = 0 <0.000008> pread64(52, "\0\0\0\0 \t|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339288064) = 8192 <0.000011> fadvise64(52, 339419136, 8192, POSIX_FADV_WILLNEED) = 0 <0.000011> pread64(52, "\0\0\0\0\330\t|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339296256) = 8192 <0.000011> fadvise64(52, 339427328, 8192, POSIX_FADV_WILLNEED) = 0 <0.000008> With the patch: fadvise64(52, 332365824, 131072, POSIX_FADV_WILLNEED) = 0 <0.000021> pread64(52, "\0\0\0\0(\223x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329220096) = 131072<0.000161> pread64(52, "\0\0\0\0\250\236x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329351168) = 131072<0.000401> pread64(52, "\0\0\0\0@\252x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 329482240) = 65536 <0.000044> pread64(52, "\0\0\0\0\0\250y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332365824) = 131072<0.000079> pread64(52, "\0\0\0\0\200\263y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332496896) = 131072<0.000336> fadvise64(52, 335781888, 131072, POSIX_FADV_WILLNEED) = 0 <0.000021> pread64(52, "\0\0\0\0\0\277y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332627968) = 131072<0.000091> pread64(52, "\0\0\0\0\230\312y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332759040) = 131072<0.000399> pread64(52, "\0\0\0\0\30\326y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 332890112) = 65536<0.000046> pread64(52, "\0\0\0\0\220\324z\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335781888) = 131072<0.000081> pread64(52, "\0\0\0\0(\340z\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335912960) = 131072<0.000335> fadvise64(52, 339197952, 131072, POSIX_FADV_WILLNEED) = 0 <0.000021> There are a few interesting observations here: - The patch does allow us to make larger reads, nice! - With the patch we do *not* fadvise some of the blocks, the read stream sequentialness logic prevents it. - With the patch there are occasional IOs that are *much* slower. E.g. the last pread64 is a lot slower than the two preceding ones. Which I think explains what's happening. Because we don't fadvise all the time, we have to synchronously wait for some IOs. Because of those slower IOs, the CPU has time to go into an idle state. Slowing the whole query down. If I disable that: diff --git i/src/backend/storage/aio/read_stream.c w/src/backend/storage/aio/read_stream.c index e4414b2e915..e58585c4e02 100644 --- i/src/backend/storage/aio/read_stream.c +++ w/src/backend/storage/aio/read_stream.c @@ -242,8 +242,9 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) * isn't a strictly sequential pattern, then we'll issue advice. */ if (!suppress_advice && - stream->advice_enabled && - stream->pending_read_blocknum != stream->seq_blocknum) + stream->advice_enabled + // && stream->pending_read_blocknum != stream->seq_blocknum + ) flags = READ_BUFFERS_ISSUE_ADVICE; else flags = 0; I see *much* improved times. Without cpu idle tuning: master: 111ms patch: 128ms patch, with disabled seq detection: 68ms Suddenly making the patch a huge win. ISTM that disabling posix_fadvise() just because of a single sequential IO is too aggressive. A pattern of two sequential reads, gap of more than a few blocks, two sequential reads, ... will afaict never do useful readahead in linux, because it needs a page cache access nearby already read data to start doing readahead. But that means we encounter ynchronous-blocking-read. Thomas, what lead to disabling advice this aggressively? Particularly because heapam.c disables advice alltogether, it isn't entirely obvious why that's good? I suspect this might also affect the behaviour in the vacuum read stream case. I think we'll need to add some logic in read stream that only disables advice after a longer sequential sequence. Writing logic for that shouldn't be too hard, I think? Determining the concrete cutoffs is probably harder, although I think even fairly simplistic logic will be "good enough". Greetings, Andres Freund
pgsql-hackers by date: