Thread: proposal: be smarter about i/o patterns in index scan
Greetings all, We have noticed a way to make a major improvement in Pg's performance on our workload, and I would like to get your thoughts before I go off to work up a patch. The basic problem is that Pg seeks far too much when performing an index scan. I have saved an strace of a backend which is selecting 170,000 rows from a 240,000,000 row table via index scan. The strace shows 134,000 seeks and reads, or almost one seek for every tuple in the result set. This would be fine normally except the seeks are in a *very* suboptimal pattern, swinging back and forth across the device. The query requires 16 minutes, 30 seconds on our test hardware. The proposal is to sort the block requests before they are issued. Because Pg only issues single seek/read pairs synchronously, the kernel has no chance to apply its elevator and hence every seek/read Pg issues results in actual movement of the disk head. Pg's random pattern of seeking also makes the kernel's readahead fairly useless. To prove the proposal I did a simulation, using the recorded seek positions in the strace. We noted that Pg issued 403 seek/read pairs for every 8192 bytes read from the index. So we performed four simulations: a straight playback of Pg's seek pattern, a playback where each list of 403 seeks is sorted by byte offset, a playback of all the seeks sorted by offset, and a sequential scan of the 13GB data file. PostgreSQL 4.2.1: 16m30s Sorted in chunks: 10m6s Sorted altogether: 4m19s Sequential scan: 6m7s As you can see, there's a lot to be gained here. If Pg was able to issue its seeks in the optimal order, the query would return in 1/4th the time. Even the chunk-sorted scheme is a big win. So the proposal is this: the offsets stored in the index ought to be sorted. Either A) they can be stored sorted in the first place (sorted in VACUUM ANALYZE, or always sorted), or B) the backend can sort each list of tuples it reads from the index, or C) the backend can read the entire index result and sort that (this would be the best). >From just paging through the code, it doesn't seem terribly hard. B seems the easiest to implement but is also the least improvement. Even seqscan is better than B. One improvement to B would be to read much more than 8K from the index. Reading e.g. 256KB would improve things dramatically. A might be easy but would either degrade over time or cause slower inserts. C is the best and hardest to implement (I am guessing, because the size of sorted index subset is unbounded). Your thoughts are appreciated. -jwb
Yes, fetching a RID list from an index scan, sorting 'em and then fetching from the table would be a very appreciable speedup for many queries. I would imagine that all the commercial engines do this (db2 calls it a sorted RID-list-fetch) .. and this has in fact been discussed on -hackers before. One issue for this is that there could be a slip between the cup and the lip .. ie., between the fetch from the index, the sort, and the fetch from the table a vaccuum could have taken place rendering TIDs invalid. This should however not be a show-stopper .. surely all that's needed is a lock against vaccuuming in the presence of a tid-list-fetch. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
> The basic problem is that Pg seeks far too much when performing an index > scan. I have saved an strace of a backend which is selecting 170,000 > rows from a 240,000,000 row table via index scan. The strace shows > 134,000 seeks and reads, or almost one seek for every tuple in the > result set. This would be fine normally except the seeks are in a > *very* suboptimal pattern, swinging back and forth across the device. > The query requires 16 minutes, 30 seconds on our test hardware. Yes yes yes *please* fix this :-) This performance bottle neck in PG effects us all the time. > The proposal is to sort the block requests before they are issued. > Because Pg only issues single seek/read pairs synchronously, the kernel > has no chance to apply its elevator and hence every seek/read Pg issues > results in actual movement of the disk head. Pg's random pattern of > seeking also makes the kernel's readahead fairly useless. > > To prove the proposal I did a simulation, using the recorded seek > positions in the strace. We noted that Pg issued 403 seek/read pairs > for every 8192 bytes read from the index. So we performed four > simulations: a straight playback of Pg's seek pattern, a playback where > each list of 403 seeks is sorted by byte offset, a playback of all the > seeks sorted by offset, and a sequential scan of the 13GB data file. > > PostgreSQL 4.2.1: 16m30s > Sorted in chunks: 10m6s > Sorted altogether: 4m19s > Sequential scan: 6m7s > > As you can see, there's a lot to be gained here. If Pg was able to > issue its seeks in the optimal order, the query would return in 1/4th > the time. Even the chunk-sorted scheme is a big win. I think your worst case may not be as bad as it gets. Nothing scientific, but my experience tells me that it's common to get even worse performance than that. I've had queries that would take seemingly forever, but then after a fresh cluster and cache flush, the same query would take almost no time at all. I also think that with a multi-mirrored RAID setup and your proposed IO sorting, the mirroring would be more likely to overcome seek overhead. With the current implementation, it seems almost useless to throw more hardware at the problem. I think the improvement would be even 'huger' than your numbers show :-) > So the proposal is this: the offsets stored in the index ought to be > sorted. Either A) they can be stored sorted in the first place (sorted > in VACUUM ANALYZE, or always sorted), or B) the backend can sort each > list of tuples it reads from the index, or C) the backend can read the > entire index result and sort that (this would be the best). > > >From just paging through the code, it doesn't seem terribly hard. B > seems the easiest to implement but is also the least improvement. Even > seqscan is better than B. One improvement to B would be to read much > more than 8K from the index. Reading e.g. 256KB would improve things > dramatically. A might be easy but would either degrade over time or > cause slower inserts. C is the best and hardest to implement (I am > guessing, because the size of sorted index subset is unbounded). IMO if you're going to do it, do it right. That means A is pretty much out, again, IMO. It would seem that B (improved) would be the way to go because, as you say, C would produce unbounded subsets. I would propose to read very large sections of the index subset though, more in the order of several MB (configurable, of course). With that much space to play with, most queries would require only one pass at the index and therefore show performance equal to C. Larger queries would suffer a bit, but then we don't usually have such speedy expectations of large expected result sets, and again, with enough sort space to play with, the performance could approach that of C. I would think that you could also sort the index on index order (during vacuum) to improve index scan performance. This could be a completely seperate implementation task. With clean sequential index access *and* clean sequential heap access, it might just prove to be the single largest performance boost PG has seen, well, ever. My .02... Glen Parker
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > We have noticed a way to make a major improvement in Pg's performance on > our workload, and I would like to get your thoughts before I go off to > work up a patch. For starters, read the previous discussions of this in the archives. Two questions you should have answers to before starting to implement, rather than after, are how to deal with locking considerations and what will be the implications of giving up the property that indexscans deliver sorted output. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> For starters, read the previous discussions of this in the Tom> archives. Tom> Two questions you should have answers to before starting to Tom> implement, rather than after, are how to dealwith locking Tom> considerations and what will be the implications of giving up Tom> the property that indexscansdeliver sorted output. I don't know about the former, but as to the latter, we should certainly have the ability for both output sorted by key and TID-List-Fetch ... -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
On Wed, 2004-05-19 at 07:58, Tom Lane wrote: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > > We have noticed a way to make a major improvement in Pg's performance on > > our workload, and I would like to get your thoughts before I go off to > > work up a patch. > > For starters, read the previous discussions of this in the archives. If you are referring to archives.postgresql.org, all I have to say is: "An error occured! Can not connect to search daemon" And as far as I have seen, it's been like that for years. > Two questions you should have answers to before starting to implement, > rather than after, are how to deal with locking considerations and > what will be the implications of giving up the property that indexscans > deliver sorted output. Are you saying that index scan results are sorted by something other than the index key? Because in my scheme they would still be sorted by index key. -jwb
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > If you are referring to archives.postgresql.org, all I have to say is: > "An error occured! Can not connect to search daemon" > And as far as I have seen, it's been like that for years. I do seem to be getting that today (Marc?) but it's hardly been "like that for years". See also http://www.pgsql.ru/db/pgsearch/ which provides an alternative search engine. regards, tom lane
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > Are you saying that index scan results are sorted by something other > than the index key? Because in my scheme they would still be sorted by > index key. Not unless you add yet another sort step after the fetch step, that is the idea becomes1. read index to get candidate TIDs2. sort TIDs into physical order3. read heap in physical order, checkrow visibility4. sort selected rows into index order That would start to look like an unpleasant amount of overhead, since the sort would have to be over the entire output; you couldn't divide the scan into reasonable-size batches, whereas steps 1-3 can easily be run in batched style. Moreover, this'd not be any win compared to just dropping the assumption of sorted output; the planner could stick in the same sort for itself if it decided the cost was worth it. What you probably want to do is support both our traditional indexscan style and an "unsorted indexscan" plan type that delivers unsorted output (implementing steps 1-3 above). With appropriate cost estimation the planner could make an intelligent choice between the two. I'm too lazy to look in the archives right now, but my private notes about this say: We could improve indexscan performance if we were to read a bunch of TIDs from the index, sort in page number order, and access the heap tuples in page number order. But probably need a batch size in the thousands of TIDs to get any really useful effect. This would not work very well given the present assumption that we hold a pin on the index page until we have fetched the tuple (cf lazy VACUUM). Pinning lots of index pages is no good for concurrency or buffer usage. QUESTION: is that assumption really necessary? Can't we just make the code ignore TIDs that lead to deleted tuples? Worst case is that we fetch a newly-inserted tuple that does not actually have the expected index value, but won't we ignore it because of MVCC tqual rules? (This does NOT work for dirty read or SnapshotNow cases, but we could probably set things up to only try this optimization for standard snapshot reads.) An even nicer property is that we could loop inside the index AM to scoop up tuples, saving locking overhead. This might be worth doing even when we want to keep returning the tuples in index order. and there is also something about a longer-range plan involving the same ideas: Idea: invent two new plan node types: * IndexOnlyScanreads an index, does NOT visit heap. Returns consed-up tuplesthat contain the heap ctid and (some of) theindex columns. * TidExpandGiven an input tuplestream containing CTIDs, visit the heap.Discard input tuple if heap tuple is missing or notvisible tocurrent xact. Otherwise, generate output tuple consisting ofrequired fields from input tuple and heap tuple. We'd put IndexOnlyScan below, and TidExpand above, a join node. The transformation is legal if the join condition does not require a heap column that's not available from the index. (If both join inputs are indexscanable then we'd consider plans involving two IndexOnlyScans, a join, and two TidExpand nodes above the join.) This is a win if the join eliminates many source tuples from consideration. (Probably it never makes sense for the outer side of an outer join...) Also, it's not legal for a rel that is on the nullable side of an outer join, since we might have bogus matches that would incorrectly prevent a null-extended row from being emitted. Note that any restriction quals on the original index node or the join node would have to be moved up to the TidExpand node, if they refer to columns not available from the index. (This would of course affect the row count and cost estimates, possibly causing us to decide the transformation is a bad idea. Note also that the Expand node would fetch the same heap tuple multiple times if the join is not one-to-one; this could again cause the cost to come out worse than the regular style.) The only part of this that we can't cost-estimate very accurately is the fraction of dead index entries that will fail the time qual; but really we aren't estimating the costs for regular indexscans very well either, when the fraction of dead entries is high. We could probably just make this a GUC parameter with a default setting of 0.1 or so. (NOTE: ANALYZE without VACUUM could compute info about the current dead-tuple fraction, I think; worth doing?) The work required would mostly be to figure out how to construct such plans efficiently in the planner. Should we try to take already-built join pathtrees and modify them, or should we stick to the existing bottom-up approach? In multiple-join cases, there might be many places where the required TidExpand could be inserted; should we try to cost them all, or just use a heuristic about where to place TidExpand? Issue: compatibility with concurrent VACUUM. We can't expect any protection from index locking when the heap visit is delayed like this. I think this is okay as long as TidExpand is careful not to assume the CTID is still valid (it could be pointing at a removed tuple). The CTID could also be pointing at a tuple slot that has been emptied by VACUUM and then refilled by another process --- but in that case the new tuple will fail the time qual test and will be ignored anyway. (Note that this *only* works for a snapshot time qual, not for SnapshotNow; so we can get away with it in user queries even though it'd be unsafe in the general case.) regards, tom lane
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane > For starters, read the previous discussions of this in the archives. Search doesn't work. Even if it did, I'm not sure what I would search on. Do you remember the time frame of the discussion? If I could narrow it down to a few months, I could possibly find it the slow way. I'd be very interested to read it. Anyway, this subject seems to get precious little attention, and I don't understand why. PG's index scan implementation is so bad at some types of queries, I find it surprising that there aren't weekly discussions about it, and developers lined up around the block to work on it. I would be one of those developers, but frankly the learning curve is pretty steep and I don't have any smaller complaints to use as learning tools. What am I missing? Why is a performance bottle neck of this magnitude not on the same list of priorities as PITR, replication, and Win32? > Two questions you should have answers to before starting to implement, > rather than after, are how to deal with locking considerations and > what will be the implications of giving up the property that indexscans > deliver sorted output. Here's one answer: If you had to sort every result set, even when an index could have been used, overall performance would still improve by a very large margin. I'd bet money on it. Actually, the index would still have to be scanned in sort order. Only the fetch order for heap tuples would be sorted differently. Glen Parker
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane > Sent: Wednesday, May 19, 2004 11:56 AM > Not unless you add yet another sort step after the fetch step, that is > the idea becomes > 1. read index to get candidate TIDs > 2. sort TIDs into physical order > 3. read heap in physical order, check row visibility > 4. sort selected rows into index order > > That would start to look like an unpleasant amount of overhead, since > the sort would have to be over the entire output; you couldn't divide > the scan into reasonable-size batches, whereas steps 1-3 can easily > be run in batched style. Or: 2. Sort AND COPY TID's into physical order. 3. Read heap in phy. order, match results to un-sorted TID list. That sounds quite cheap. Then you get to drop step 4 (which would *usually* be quite a bit more expensive than a TID sort and copy?). The cost of the proposed implementation would be higher *when everything is in the cache*, granted. As a user, I will take that cost 10 times over in return for such a large improvement in the no-cache situation. -Glen
"Glen Parker" <glenebob@nwlink.com> writes: > What am I missing? Why is a performance bottle neck of this magnitude not > on the same list of priorities as PITR, replication, and Win32? It's higher on my personal to-do list than most of those ;-). But those things are getting done because there are other developers with other priorities. I suspect also that the set of people competent to make this change is much smaller than the set of people able to work on the other points. In my mind, most of the issue is in the planner (figuring out what to do with an unsorted-indexscan option) and relatively few people have wanted to touch the planner. > Here's one answer: If you had to sort every result set, even when an index > could have been used, overall performance would still improve by a very > large margin. I'd bet money on it. For a counterexample I refer you to our standard solution for MAX-using-an-index: SELECT ... FROM table ORDER BY foo DESC LIMIT 1; which would become truly spectacularly bad without an ordered index scan. A more general point is that for any indexscan that returns only a small number of index entries (eg, any unique-key search) worrying about physical-order access will be wasted effort. The best you could hope for is not to be significantly worse than the existing code in such cases, and I'm unconvinced you could achieve even that. I can assure you that any patch that completely removes the existing behavior will be rejected, because there are plenty of cases where it's the right thing. The main thing that unordered indexscan could do for us is extend the usefulness of indexscan plans into relatively-poor-selectivity cases where we presently tend to drop back to seqscans. There would still be a selectivity threshold above which you might as well use seqscan, but it ought to be higher than the fraction-of-a-percent that we currently see for indexscans. What is unknown, and will be unknown until someone tries it, is just what range of selectivity this technique might win for. I think such a range exists; I am not real certain that it is wide enough to justify a lot of effort in making the idea a reality. regards, tom lane
On Wed, 2004-05-19 at 11:56, Tom Lane wrote: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > > Are you saying that index scan results are sorted by something other > > than the index key? Because in my scheme they would still be sorted by > > index key. > > Not unless you add yet another sort step after the fetch step, that is > the idea becomes > 1. read index to get candidate TIDs > 2. sort TIDs into physical order > 3. read heap in physical order, check row visibility > 4. sort selected rows into index order > > That would start to look like an unpleasant amount of overhead, since > the sort would have to be over the entire output; you couldn't divide > the scan into reasonable-size batches, whereas steps 1-3 can easily > be run in batched style. > > Moreover, this'd not be any win compared to just dropping the assumption > of sorted output; the planner could stick in the same sort for itself if > it decided the cost was worth it. > > What you probably want to do is support both our traditional indexscan > style and an "unsorted indexscan" plan type that delivers unsorted > output (implementing steps 1-3 above). With appropriate cost estimation > the planner could make an intelligent choice between the two. > I'm too lazy to look in the archives right now, but my private notes > about this say: Thanks for the notes. Introducing a new query execution step sounds like a better/easier idea than was I was going to do, which was to rework some of the access methods to operate on vectors instead of scalars. That idea looks increasingly difficult to implement. > We could improve indexscan performance if we were to read a bunch of TIDs from > the index, sort in page number order, and access the heap tuples in page > number order. But probably need a batch size in the thousands of TIDs to get > any really useful effect. Depends on the query, but I got an improvement by half from batches of 400. > This would not work very well given the present > assumption that we hold a pin on the index page until we have fetched the > tuple (cf lazy VACUUM). Pinning lots of index pages is no good for > concurrency or buffer usage. QUESTION: is that assumption really necessary? > Can't we just make the code ignore TIDs that lead to deleted tuples? Worst > case is that we fetch a newly-inserted tuple that does not actually have the > expected index value, but won't we ignore it because of MVCC tqual rules? > (This does NOT work for dirty read or SnapshotNow cases, but we could probably > set things up to only try this optimization for standard snapshot reads.) > An even nicer property is that we could loop inside the index AM to scoop up > tuples, saving locking overhead. This might be worth doing even when we want > to keep returning the tuples in index order. I think this is *definitely* worth doing. The current implementation in the vicinity of index_getnext wastes valuable opportunities to optimize the io and/or to let the kernel do a better job. Descending into index_getnext -> heap_getnext for every tuple looks expensive in my application. Thanks again for your notes I'll revisit the source code and see if I can come up with a plan. -jwb PS Regarding the archive, I've never seen it work. I use MARC instead.
"Glen Parker" <glenebob@nwlink.com> writes: >> Not unless you add yet another sort step after the fetch step, that is >> the idea becomes >> 1. read index to get candidate TIDs >> 2. sort TIDs into physical order >> 3. read heap in physical order, check row visibility >> 4. sort selected rows into index order > Or: > 2. Sort AND COPY TID's into physical order. > 3. Read heap in phy. order, match results to un-sorted TID list. > That sounds quite cheap. No, it sounds like handwaving. What's your "match results" step, and does it take less than O(N^2) time? How do you get to *return* the tuples in index order, without having read them all before you return the first one? (Which is really what the problem is with the sort alternative, at least for fast-start cases such as the LIMIT 1 example.) What happens when you run out of memory for the list of TIDs ... er, make that two lists of TIDs? regards, tom lane
On Wed, 2004-05-19 at 13:10, Tom Lane wrote: > "Glen Parker" <glenebob@nwlink.com> writes: > >> Not unless you add yet another sort step after the fetch step, that is > >> the idea becomes > >> 1. read index to get candidate TIDs > >> 2. sort TIDs into physical order > >> 3. read heap in physical order, check row visibility > >> 4. sort selected rows into index order > > > Or: > > 2. Sort AND COPY TID's into physical order. > > 3. Read heap in phy. order, match results to un-sorted TID list. > > That sounds quite cheap. > > No, it sounds like handwaving. What's your "match results" step, and > does it take less than O(N^2) time? How do you get to *return* the > tuples in index order, without having read them all before you return > the first one? (Which is really what the problem is with the sort > alternative, at least for fast-start cases such as the LIMIT 1 example.) > What happens when you run out of memory for the list of TIDs ... er, > make that two lists of TIDs? No doubt, you can't just naively create giant vectors of TIDs and expect to sort them. Is there any concept in Pg of an unrealized result? If you scanned an index building up a result set that was totally unrealized, except for the TID and the index columns, you could cheaply join two such results without ever touching the heap. You could also use the existing Sort execution step to sort such a result. Then don't touch the heap something accesses a non-index column, or because you are returning the result somewhere and need to satisfy MVCC visibility limits. This would lay down the foundation needed by your TIDExpand, and could also be a useful concept in other areas. I acknowledge my own significant handwaving on this subject :) From your point of view everyone is going to be handwaving, because we don't have your depth of experience with this code. -jwb
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > No doubt, you can't just naively create giant vectors of TIDs and expect > to sort them. Is there any concept in Pg of an unrealized result? Only for the case of a partially-read plan result. That is, we do this for rowsets, but not for scalars or arrays. A lot of the point of the LIMIT 1 example is that it is exploiting the fact that we won't ever materialize the full output of the indexscan. > If you scanned an index building up a result set that was totally > unrealized, except for the TID and the index columns, you could > cheaply join two such results without ever touching the heap. You > could also use the existing Sort execution step to sort such a result. > Then don't touch the heap something accesses a non-index column, or > because you are returning the result somewhere and need to satisfy > MVCC visibility limits. This is basically what I was talking about with IndexOnlyScan/TidExpand. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > The main thing that unordered indexscan could do for us is extend the > usefulness of indexscan plans into relatively-poor-selectivity cases > where we presently tend to drop back to seqscans. Well I'm sure Tom remembers this since he's mentioned it in the past. But for the sake of completeness I'll remind others of the other big benefit a heap-ordered indexscan could provide: Merging multiple index scans. If you have something like select * from a where x>? and (y=? or z=?) and you have separate indexes on x, y, and z, then having a heap-ordered index scan would allow the optimizer to do three independent scans of the three indexes and only have to fetch the tuples that fit the entire boolean clause from the heap. In the case of expressions where any individual column is very non-selective but the combination is very selective the result can be a big improvement. (Actually in the special case of x=? and y=? and z=? you could do the same without any special sorting step if the index tuples were kept in heap order within each index key. I think that would be an interesting optimization to try too) -- greg
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane > > Or: > > 2. Sort AND COPY TID's into physical order. > > 3. Read heap in phy. order, match results to un-sorted TID list. > > That sounds quite cheap. > > No, it sounds like handwaving. What's your "match results" step, and You got me :-) > does it take less than O(N^2) time? How do you get to *return* the Less than O(N^2)??? Geez I think so! > tuples in index order, without having read them all before you return > the first one? (Which is really what the problem is with the sort > alternative, at least for fast-start cases such as the LIMIT 1 example.) (you'll have to pardon me for thinking as I type... and being to long winded :-) It takes O(1) time. If you have a hard maximum of TID's you'll grab from the index (which sounds right), you could store the TID list in a vector. The IO-order-sorted list could be a singly-linked-list OR a vector, but either way, each entry would have to point back to an offset in the index-ordered list. The index-ordered list would be (sizeof(TID) * max_TID_count) bytes, and the IO-ordered list would be (sizeof(TID) + sizeof(int) * max_TID_count) bytes. If a TID is 8 bytes and we're going to fetch 2048 index entries per iteration, that gets us (*counting on fingers*) 16K (for index-ordered list) + 24K for the IO-ordered list. The index-ordered list can also be a singly-linked list, in which case the mapping would be by pointer. If both lists are single-linked lists, then the size overhead (for a fully realized 2048 entry scan) would be: ((sizeof(TID) + sizeof(void*)) * max_TID_count) + ((sizeof(TID) + sizeof(void*) + sizeof(void*)) * max_TID_count) = (*more counting on fingers*) ... 24K + 32K = 56K. Hmm, the vector style would be crazy expensive for small queries. The linked-list approach sounds pretty good though. I assume no one thinks doing this would be completely free :-) And, if you did the IO-ordered list as a vector, you'd save the linked-list overhead, and you would know exactly how big to make it, so it wouldn't have to hurt you on small queries. 2048 entries seems pretty darn generous actually. > How do you get to *return* the > tuples in index order, without having read them all before you return Hmm, worsed case scenario would be to scan the heap in exactly-reverse order of index order. With a 2048 entry batch size, you'd have a fair amount of overhead on a LIMIT 1 :-) Unless the index scanner could be given a maximum tuple count, rather than just 'knowing' it from a GUC value. Would this dirty up the plan tree beyond all recognition? > What happens when you run out of memory for the list of TIDs ... er, > make that two lists of TIDs? Now you really got me. Hand waving: the same thing you do when you run out of memory anywhere else :-) By configuring the server to fetch index entries in 1-entry batches, you'd be right back to the overhead (or very nearly so) of the current implementation, so it would only hurt people who chose to be hurt by it :-) -Glen
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > ... Introducing a new query execution step sounds > like a better/easier idea than was I was going to do, which was to > rework some of the access methods to operate on vectors instead of > scalars. That idea looks increasingly difficult to implement. One thing that I think is worth doing in any case is to alter the API for the index AMs' getnext functions so that they can return multiple TIDs per call (probably into an array supplied by the caller). We could for example return all matching tuples from a single index page in one call, and then let index_getnext iterate through them without re-calling the index AM. (Making it work per-index-page would allow us to keep the current VACUUM interlocking conventions in place exactly, so that there would be no risk of breaking anything. All tuples returned in a given call would still be protected by an index-page lock.) We'd have to make such an API change anyway to support unordered indexscan, but it should be a benefit even for ordered scans, because it should considerably reduce the locking overhead for indexscans that fetch multiple tuples. In particular it might alleviate the BufMgrLock contention problems that were under discussion last month. (The test case we were using to illustrate that problem fetched several hundred tuples per indexscan, so it clearly could benefit. Extent of benefit unknown at this time, though.) The tricky part of this is figuring out how to handle mark/restore and scan direction switching in a way that doesn't complicate the code to the point of unmaintainability. I think it may be possible to keep all the extra complication within indexam.c, but haven't thought through the details. This seems like a relatively localized change, and might be a good place for you to start. regards, tom lane
On Wed, 2004-05-19 at 12:55, Tom Lane wrote: > "Glen Parker" <glenebob@nwlink.com> writes: > > What am I missing? Why is a performance bottle neck of this magnitude not > > on the same list of priorities as PITR, replication, and Win32? > > It's higher on my personal to-do list than most of those ;-). But those > things are getting done because there are other developers with other > priorities. I suspect also that the set of people competent to make > this change is much smaller than the set of people able to work on the > other points. In my mind, most of the issue is in the planner (figuring > out what to do with an unsorted-indexscan option) and relatively few > people have wanted to touch the planner. > > > Here's one answer: If you had to sort every result set, even when an index > > could have been used, overall performance would still improve by a very > > large margin. I'd bet money on it. > > For a counterexample I refer you to our standard solution for > MAX-using-an-index: > > SELECT ... FROM table ORDER BY foo DESC LIMIT 1; Yes, that would stink with unordered index result. > which would become truly spectacularly bad without an ordered index > scan. A more general point is that for any indexscan that returns only > a small number of index entries (eg, any unique-key search) worrying > about physical-order access will be wasted effort. The best you could > hope for is not to be significantly worse than the existing code in such > cases, and I'm unconvinced you could achieve even that. I don't think a TID-sorted index scan would be worse than the current unsorted scan. You will produce a list of 1, sort that, and fetch the tuple. In the current implementation, you only omit the sort, which is basically free. > I can assure you that any patch that completely removes the existing > behavior will be rejected, because there are plenty of cases where it's > the right thing. Okay. > The main thing that unordered indexscan could do for us is extend the > usefulness of indexscan plans into relatively-poor-selectivity cases > where we presently tend to drop back to seqscans. There would still be > a selectivity threshold above which you might as well use seqscan, but > it ought to be higher than the fraction-of-a-percent that we currently > see for indexscans. What is unknown, and will be unknown until someone > tries it, is just what range of selectivity this technique might win > for. I think such a range exists; I am not real certain that it is wide > enough to justify a lot of effort in making the idea a reality. I don't think selectivity is the main issue. I think the absolute size of the result is the issue. The current implementation fairly well garauntees one seek for each tuple of the result. Which means that the cost to access the result is something on the order of 10ms * rows. So for any significant result that needs to be fetched from the heap, the seek cost is going to overwhelm all other costs quickly. Even though my index is selective, retrieving only 200k rows of 240m row table, the cost of the sort is justified. -jwb
Is there a TODO here? --------------------------------------------------------------------------- Tom Lane wrote: > "Jeffrey W. Baker" <jwbaker@acm.org> writes: > > ... Introducing a new query execution step sounds > > like a better/easier idea than was I was going to do, which was to > > rework some of the access methods to operate on vectors instead of > > scalars. That idea looks increasingly difficult to implement. > > One thing that I think is worth doing in any case is to alter the API > for the index AMs' getnext functions so that they can return multiple > TIDs per call (probably into an array supplied by the caller). We could > for example return all matching tuples from a single index page in one > call, and then let index_getnext iterate through them without re-calling > the index AM. (Making it work per-index-page would allow us to keep the > current VACUUM interlocking conventions in place exactly, so that there > would be no risk of breaking anything. All tuples returned in a given > call would still be protected by an index-page lock.) > > We'd have to make such an API change anyway to support unordered > indexscan, but it should be a benefit even for ordered scans, because > it should considerably reduce the locking overhead for indexscans that > fetch multiple tuples. In particular it might alleviate the BufMgrLock > contention problems that were under discussion last month. (The test > case we were using to illustrate that problem fetched several hundred > tuples per indexscan, so it clearly could benefit. Extent of benefit > unknown at this time, though.) > > The tricky part of this is figuring out how to handle mark/restore and > scan direction switching in a way that doesn't complicate the code to > the point of unmaintainability. I think it may be possible to keep all > the extra complication within indexam.c, but haven't thought through the > details. > > This seems like a relatively localized change, and might be a good place > for you to start. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073