Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id nyzjwfntxx7lcdvzdac2bdz4rhy5cd553mpva2zfqejedu6qvx@66rrj6ob3iyy
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
Hi,

Thomas, there's a bit relevant to you at the bottom.


Melanie chatted with me about the performance regressions in Tomas' benchmarks
of the patch.  I experimented some and I think I found a few interesting
pieces.

I looked solely at cyclic, wm=4096, matches=8, eic=16 as I wanted to
narrow down what I was looking at as much as possible, and as that seemed a
significant regression.

I was able to reproduce the regression with the patch, although the noise was
very high.

My first attempt at reducing the noise was using MAP_POPULATE, as I was seeing
a lot of time spent in page fault related code, and that help stabilize the
output some.

After I couldn't really make sense of the different perf characteristics
looking at the explain analyze or iostat, so I switched to profiling. As part
of my profiling steps, I:

- bind postgres to a single core
- set the cpu governor to performance
- disable CPU idle just for that core (disabling it for all cores sometimes
  leads to slowness due to less clock boost headroom)
- try both with turbo boost on/off

With all of those applied, the performance difference almost vanished. Weird,
huh!


Normally CPUs only clock down when idle. That includes waiting for IO if that
wait period is long enough.


That made me look more at strace. Which shows this interesting difference
between master and the patch:

The chosen excerpts are randomly picked, but the pattern is similar
throughout execution of the query:

master:

pread64(52, "\0\0\0\0\260\7|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339271680) = 8192
<0.000010>
fadvise64(52, 339402752, 8192, POSIX_FADV_WILLNEED) = 0 <0.000008>
pread64(52, "\0\0\0\0h\10|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339279872) = 8192
<0.000011>
fadvise64(52, 339410944, 8192, POSIX_FADV_WILLNEED) = 0 <0.000008>
pread64(52, "\0\0\0\0 \t|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339288064) = 8192
<0.000011>
fadvise64(52, 339419136, 8192, POSIX_FADV_WILLNEED) = 0 <0.000011>
pread64(52, "\0\0\0\0\330\t|\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 8192, 339296256) = 8192
<0.000011>
fadvise64(52, 339427328, 8192, POSIX_FADV_WILLNEED) = 0 <0.000008>


With the patch:

fadvise64(52, 332365824, 131072, POSIX_FADV_WILLNEED) = 0 <0.000021>
pread64(52, "\0\0\0\0(\223x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329220096) =
131072<0.000161>
 
pread64(52, "\0\0\0\0\250\236x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 329351168) =
131072<0.000401>
 
pread64(52, "\0\0\0\0@\252x\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 329482240) = 65536
<0.000044>
pread64(52, "\0\0\0\0\0\250y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332365824) =
131072<0.000079>
 
pread64(52, "\0\0\0\0\200\263y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332496896) =
131072<0.000336>
 
fadvise64(52, 335781888, 131072, POSIX_FADV_WILLNEED) = 0 <0.000021>
pread64(52, "\0\0\0\0\0\277y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332627968) =
131072<0.000091>
 
pread64(52, "\0\0\0\0\230\312y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 332759040) =
131072<0.000399>
 
pread64(52, "\0\0\0\0\30\326y\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 65536, 332890112) =
65536<0.000046>
 
pread64(52, "\0\0\0\0\220\324z\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335781888) =
131072<0.000081>
 
pread64(52, "\0\0\0\0(\340z\17\0\0\4\0x\0\200\30\0 \4 \0\0\0\0\260\237\222\0`\237\222\0"..., 131072, 335912960) =
131072<0.000335>
 
fadvise64(52, 339197952, 131072, POSIX_FADV_WILLNEED) = 0 <0.000021>


There are a few interesting observations here:
- The patch does allow us to make larger reads, nice!

- With the patch we do *not* fadvise some of the blocks, the read stream
  sequentialness logic prevents it.

- With the patch there are occasional IOs that are *much* slower. E.g. the
  last pread64 is a lot slower than the two preceding ones.

Which I think explains what's happening. Because we don't fadvise all the
time, we have to synchronously wait for some IOs. Because of those slower IOs,
the CPU has time to go into an idle state. Slowing the whole query down.


If I disable that:

diff --git i/src/backend/storage/aio/read_stream.c w/src/backend/storage/aio/read_stream.c
index e4414b2e915..e58585c4e02 100644
--- i/src/backend/storage/aio/read_stream.c
+++ w/src/backend/storage/aio/read_stream.c
@@ -242,8 +242,9 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
      * isn't a strictly sequential pattern, then we'll issue advice.
      */
     if (!suppress_advice &&
-        stream->advice_enabled &&
-        stream->pending_read_blocknum != stream->seq_blocknum)
+        stream->advice_enabled
+        // && stream->pending_read_blocknum != stream->seq_blocknum
+        )
         flags = READ_BUFFERS_ISSUE_ADVICE;
     else
         flags = 0;

I see *much* improved times.


Without cpu idle tuning:

master:                                 111ms
patch:                                  128ms
patch, with disabled seq detection:      68ms


Suddenly making the patch a huge win.


ISTM that disabling posix_fadvise() just because of a single sequential IO is
too aggressive. A pattern of two sequential reads, gap of more than a few
blocks, two sequential reads, ... will afaict never do useful readahead in
linux, because it needs a page cache access nearby already read data to start
doing readahead. But that means we encounter ynchronous-blocking-read.


Thomas, what lead to disabling advice this aggressively? Particularly because
heapam.c disables advice alltogether, it isn't entirely obvious why that's
good?


I suspect this might also affect the behaviour in the vacuum read stream case.


I think we'll need to add some logic in read stream that only disables advice
after a longer sequential sequence. Writing logic for that shouldn't be too
hard, I think? Determining the concrete cutoffs is probably harder, although I
think even fairly simplistic logic will be "good enough".

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: pg17.3 PQescapeIdentifier() ignores len
Next
From: Robert Haas
Date:
Subject: Re: explain analyze rows=%.0f