Re: proposal: be smarter about i/o patterns in index scan - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: proposal: be smarter about i/o patterns in index scan |
Date | |
Msg-id | 19335.1084992985@sss.pgh.pa.us Whole thread Raw |
In response to | Re: proposal: be smarter about i/o patterns in index scan ("Jeffrey W. Baker" <jwbaker@acm.org>) |
Responses |
Re: proposal: be smarter about i/o patterns in index scan
Re: proposal: be smarter about i/o patterns in index scan |
List | pgsql-hackers |
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > Are you saying that index scan results are sorted by something other > than the index key? Because in my scheme they would still be sorted by > index key. Not unless you add yet another sort step after the fetch step, that is the idea becomes1. read index to get candidate TIDs2. sort TIDs into physical order3. read heap in physical order, checkrow visibility4. sort selected rows into index order That would start to look like an unpleasant amount of overhead, since the sort would have to be over the entire output; you couldn't divide the scan into reasonable-size batches, whereas steps 1-3 can easily be run in batched style. Moreover, this'd not be any win compared to just dropping the assumption of sorted output; the planner could stick in the same sort for itself if it decided the cost was worth it. What you probably want to do is support both our traditional indexscan style and an "unsorted indexscan" plan type that delivers unsorted output (implementing steps 1-3 above). With appropriate cost estimation the planner could make an intelligent choice between the two. I'm too lazy to look in the archives right now, but my private notes about this say: We could improve indexscan performance if we were to read a bunch of TIDs from the index, sort in page number order, and access the heap tuples in page number order. But probably need a batch size in the thousands of TIDs to get any really useful effect. This would not work very well given the present assumption that we hold a pin on the index page until we have fetched the tuple (cf lazy VACUUM). Pinning lots of index pages is no good for concurrency or buffer usage. QUESTION: is that assumption really necessary? Can't we just make the code ignore TIDs that lead to deleted tuples? Worst case is that we fetch a newly-inserted tuple that does not actually have the expected index value, but won't we ignore it because of MVCC tqual rules? (This does NOT work for dirty read or SnapshotNow cases, but we could probably set things up to only try this optimization for standard snapshot reads.) An even nicer property is that we could loop inside the index AM to scoop up tuples, saving locking overhead. This might be worth doing even when we want to keep returning the tuples in index order. and there is also something about a longer-range plan involving the same ideas: Idea: invent two new plan node types: * IndexOnlyScanreads an index, does NOT visit heap. Returns consed-up tuplesthat contain the heap ctid and (some of) theindex columns. * TidExpandGiven an input tuplestream containing CTIDs, visit the heap.Discard input tuple if heap tuple is missing or notvisible tocurrent xact. Otherwise, generate output tuple consisting ofrequired fields from input tuple and heap tuple. We'd put IndexOnlyScan below, and TidExpand above, a join node. The transformation is legal if the join condition does not require a heap column that's not available from the index. (If both join inputs are indexscanable then we'd consider plans involving two IndexOnlyScans, a join, and two TidExpand nodes above the join.) This is a win if the join eliminates many source tuples from consideration. (Probably it never makes sense for the outer side of an outer join...) Also, it's not legal for a rel that is on the nullable side of an outer join, since we might have bogus matches that would incorrectly prevent a null-extended row from being emitted. Note that any restriction quals on the original index node or the join node would have to be moved up to the TidExpand node, if they refer to columns not available from the index. (This would of course affect the row count and cost estimates, possibly causing us to decide the transformation is a bad idea. Note also that the Expand node would fetch the same heap tuple multiple times if the join is not one-to-one; this could again cause the cost to come out worse than the regular style.) The only part of this that we can't cost-estimate very accurately is the fraction of dead index entries that will fail the time qual; but really we aren't estimating the costs for regular indexscans very well either, when the fraction of dead entries is high. We could probably just make this a GUC parameter with a default setting of 0.1 or so. (NOTE: ANALYZE without VACUUM could compute info about the current dead-tuple fraction, I think; worth doing?) The work required would mostly be to figure out how to construct such plans efficiently in the planner. Should we try to take already-built join pathtrees and modify them, or should we stick to the existing bottom-up approach? In multiple-join cases, there might be many places where the required TidExpand could be inserted; should we try to cost them all, or just use a heuristic about where to place TidExpand? Issue: compatibility with concurrent VACUUM. We can't expect any protection from index locking when the heap visit is delayed like this. I think this is okay as long as TidExpand is careful not to assume the CTID is still valid (it could be pointing at a removed tuple). The CTID could also be pointing at a tuple slot that has been emptied by VACUUM and then refilled by another process --- but in that case the new tuple will fail the time qual test and will be ignored anyway. (Note that this *only* works for a snapshot time qual, not for SnapshotNow; so we can get away with it in user queries even though it'd be unsafe in the general case.) regards, tom lane
pgsql-hackers by date: