Re: index prefetching - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 20240214234302.hk5e3otfssasxnac@awork3.anarazel.de Whole thread Raw |
In response to | Re: index prefetching (Melanie Plageman <melanieplageman@gmail.com>) |
List | pgsql-hackers |
Hi, On 2024-02-14 16:45:57 -0500, Melanie Plageman wrote: > > > The LIMIT problem is not very clear to me either. Yes, if we get close > > > to the end of the leaf page, we may need to visit the next leaf page. > > > But that's kinda the whole point of prefetching - reading stuff ahead, > > > and reading too far ahead is an inherent risk. Isn't that a problem we > > > have even without LIMIT? The prefetch distance ramp up is meant to limit > > > the impact. > > > > Right now, the index AM doesn't know anything about LIMIT at all. That > > doesn't matter, since the index AM can only read/scan one full leaf > > page before returning control back to the executor proper. The > > executor proper can just shut down the whole index scan upon finding > > that we've already returned N tuples for a LIMIT N. > > > > We don't do prefetching right now, but we also don't risk reading a > > leaf page that'll just never be needed. Those two things are in > > tension, but I don't think that that's quite the same thing as the > > usual standard prefetching tension/problem. Here there is uncertainty > > about whether what we're prefetching will *ever* be required -- not > > uncertainty about when exactly it'll be required. (Perhaps this > > distinction doesn't mean much to you. I'm just telling you how I think > > about it, in case it helps move the discussion forward.) > > I don't think that the LIMIT problem is too different for index scans > than heap scans. We will need some advice from planner to come down to > prevent over-eager prefetching in all cases. I'm not sure that that's really true. I think the more common and more problematic case for partially executing a sub-tree of a query are nested loops (worse because that happens many times within a query). Particularly for anti-joins prefetching too aggressively could lead to a significant IO amplification. At the same time it's IMO more important to ramp up prefetching distance fairly aggressively for index scans than it is for sequential scans. For sequential scans it's quite likely that either the whole scan takes quite a while (thus slowly ramping doesn't affect overall time that much) or that the data is cached anyway because the tables are small and frequently used (in which case we don't need to ramp). And even if smaller tables aren't cached, because it's sequential IO, the IOs are cheaper as they're sequential. Contrast that to index scans, where it's much more likely that you have cache misses in queries that do an overall fairly small number of IOs and where that IO is largely random. I think we'll need some awareness at ExecInitNode() time about how the results of the nodes are used. I see a few "classes": 1) All rows are needed, because the node is below an Agg, Hash, Materialize, Sort, .... Can be determined purely by the plan shape. 2) All rows are needed, because the node is completely consumed by the top-level (i.e. no limit, anti-joins or such inbetween) and the top-level wants to run the whole query. Unfortunately I don't think we know this at plan time at the moment (it's just determined by what's passed to ExecutorRun()). 3) Some rows are needed, but it's hard to know the precise number. E.g. because of a LIMIT further up. 4) Only a single row is going to be needed, albeit possibly after filtering on the node level. E.g. the anti-join case. There are different times at which we could determine how each node is consumed: a) Determine node consumption "class" purely within ExecInit*, via different eflags. Today that couldn't deal with 2), but I think it'd not too hard to modify callers that consume query results completely to tell that ExecutorStart(), not just ExecutorRun(). A disadvantage would be that this prevents us from taking IO depth into account during costing. There very well might be plans that are cheaper than others because the plan shape allows more concurrent IO. b) Determine node consumption class at plan time. This also couldn't deal with 2), but fixing that probably would be harder, because we'll often not know at plan time how the query will be executed. And in fact the same plan might be executed multiple ways, in case of prepared statements. The obvious advantage is of course that we can influence the choice of paths. I suspect we'd eventually want a mix of both. Plan time to be able to influence plan shape, ExecInit* to deal with not knowing how the query will be consumed at plan time. Which suggests that we could start with whichever is easier and extend later. Greetings, Andres Freund
pgsql-hackers by date: