Re: index prefetching - Mailing list pgsql-hackers
| From | Konstantin Knizhnik |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | dcbc802c-97c8-48ea-a40f-358482b9def6@garret.ru Whole thread Raw |
| In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
| Responses |
Re: index prefetching
|
| List | pgsql-hackers |
Even a small fixes overhead can be quite visible for extremely short queries that can't possibly benefit from prefetching (like the LIMIT 1). That's what we were concerned about when we invented this heuristics (which only prefetches from the 2nd batch). I agree it's crude, no argument there. I'm not against having something better, as long as we can make it reliable.
I completely agree that we should not create read stream from the very beginning despite to small overhead.
My concern is that second batch is not good criteria: depending on key size and position of the first key on the page, it can be too short.
One of the problems with read_stream is that it may look ahead very far ahead. Much further than the query will need. Imagine an index-only scan, with 99.99% pages being all-visible. At some point the scan will hit an item that requires reading the heap page. Which triggers the look-ahead, and a search for the *next* heap page. But that next index entry may be many leafs ahead. And maybe the scan never even gets to actually need that, which means the work will be wasted. If the query only needs a couple tuples, this may cause a bad regression.
Sorry, I do not completely understand how it can happen.
Read stream can only fetch heap pages which are referenced by TIDs in leave pages (leaf=batch).
So read stream can not advance more than TIDs from currently processes leaf page, can it?
IIRC the same thing can happen for plain index scans. You may need to try with correlated indexes (I think the read_stream callback skips duplicate TIDs).
Another problem is that different TIDs can refer to the same heap page (which happens with clustered index or table populated in key ascending order).
But number of SMGR reads criteria also should work good in this case.
When all data is cached in shared buffers, then we do not perform IO at all. It means there it doesn't matter whether and when we initialize read_stream. We can do it after processing 10 items (current default), or immediately - it should not affect performance. And this is what I have tested: performance actually not depends on `read_stream_threshold` (if data fits in shared buffers). At least it is within few percents and may be it is just random fluctuations. Obviously there is no 25% degradation.I'm not sure it's this simple. Even if we perform no physical I/O, the read_stream callback will need to go through the whole batch and pass all the items to the read_stream. The data may be cached, but not in shared buffers, in which case the read_stream will do the IPC etc.
You mean that data is present in OS file cache and reading it is fast and will not benefit from AIO?
It certainly can happen but it seems to be quite obvious problem of double buffering.
It definitely doesn't mean that it is not possible to find scenario where this approach with enabling prefetch after processing N items will show worse performance than master or v5. We just need to properly choose cache hit rate. But the same is true IMHO for v5 itself: it is possible to find workload where it will show the same degradation comparing with master.I'm sure such "adversarial" data sets exist, even if we don't have a good example at hand. The question is how plausible / realistic such data sets are, though. I'm not sure about v5, though. The assumption was the overhead is most visible for short queries, and once we get to the 2nd batch the query is expensive enough to to not be affected too much. When you say "we just need to properly choose cache hit rate", how would we do that? And how would we know/determine the optimal threshold?
Sorry, may be I was unclear.
Saying about "choosing cache hit rate" I didn't not mean to use it for determining proper threshold.
I was thinking about the best/worst scenario for index prefetch.
Originally in my "benchmark" I considered case when no heap pages are present in shared buffer.
In your scenario - size of shared buffers is larger than size of table, so all pages are present in shared buffers.
First case is certainly the best scenario for index prefetch where we can expect to get the largest improvement.
In your case there is certainly no improvement, but as we can see - overhead is also not so large (just because we do not read any page).
I expect that the worst result will be in some intermediate case - when some pages are present in shared buffer, some - not.
It will be especially true for "second batch" criteria, because it can happen that we need to read just the single heap page for the whole batch and using read stream for it will just add extra overhead.
More precise heuristic should IMHO take in account actual number of performed disk read.I'm not sure "disk reads" is the right term. The data may be in page cache, and we just need to copy them into shared buffers.
I agree, but there seems to be no chance to determine if page was actually read from disk or cache.
But as I wrote above, it is problem of double buffering. If we can avoid it (i.e. using direct IO), then there will be no such problem.
Please notice that I do not want to predict number of disk reads - i.e. check if candidates for prefetch are present in shared buffers. It will really adds significant overhead. I think that it is better to use as threshold number of performed reads.Not sure I understand what this says. Can you elaborate?
I sent proposed patch in the previous mail.
Did you have a chance to look at it?
pgsql-hackers by date: