Thread: Planner enhancement suggestion.
Bitmap index scan is bliss. Many thanks to the postgres team ! Now searching in tables with a lot of fields and conditions is no longer a pain. And just a thought : SELECT * FROM table WHERE category IN (1,2,3) ORDER BY price LIMIT 10; Suppose you have an index on category, and another index on price. Depending on the stats postgres has about the values, you'll either get : 0- seq scan + sort 1- Plain or Bitmap Index scan using "category", then sort by "price" 2- Index scan on "price", Filter on "category IN (1,2,3)", no sort. 1 is efficient if the category is rare. Postgres knows this and uses this plan well. Without a LIMIT, option 1 should be preferred. 2 is efficient if the items in the categories 1,2,3 are cheap (close to the start of the index on price). However if the items in question are on the other side of the index, it will index-scan a large part of the table. This can be a big hit. Postgres has no stats about the correlation of "category" and "price", so it won't know when there is going to be a problem. Another option would be interesting. It has two steps : - Build a bitmap using the index on "category" (just like in case 1) so we know which pages on the table have relevant rows - Index scan on "price", but only looking in the heap for pages which are flagged in the bitmap, and then "Recheck Cond" on "category". In other words, do an index scan to get the rows in the right order, but don't bother to check the heap for pages where the bitmap says there are no rows. In the worst case, you still have to run through the entire index, but at least not through the entire table ! It can also speed up some merge joins. What do you think ?
On Sun, Mar 05, 2006 at 10:00:25PM +0100, PFC wrote: > > Bitmap index scan is bliss. Many thanks to the postgres team ! Now > searching in tables with a lot of fields and conditions is no longer a > pain. > > And just a thought : > > SELECT * FROM table WHERE category IN (1,2,3) ORDER BY price LIMIT > 10; > > Suppose you have an index on category, and another index on price. > Depending on the stats postgres has about the values, you'll either get : > > 0- seq scan + sort > 1- Plain or Bitmap Index scan using "category", then sort by "price" > 2- Index scan on "price", Filter on "category IN (1,2,3)", no sort. > > 1 is efficient if the category is rare. Postgres knows this and uses > this plan well. > Without a LIMIT, option 1 should be preferred. > > 2 is efficient if the items in the categories 1,2,3 are cheap (close > to the start of the index on price). However if the items in question are > on the other side of the index, it will index-scan a large part of the > table. This can be a big hit. Postgres has no stats about the correlation > of "category" and "price", so it won't know when there is going to be a > problem. > > Another option would be interesting. It has two steps : > > - Build a bitmap using the index on "category" (just like in case 1) > so we know which pages on the table have relevant rows > > - Index scan on "price", but only looking in the heap for pages > which are flagged in the bitmap, and then "Recheck Cond" on "category". > In other words, do an index scan to get the rows in the right order, > but don't bother to check the heap for pages where the bitmap says there > are no rows. > In the worst case, you still have to run through the entire index, > but at least not through the entire table ! > > It can also speed up some merge joins. The problem is that you're now talking about doing 2 index scans instead of just one and a sort. If the correlation on price is high, it could still win. As the cost estimator for index scan stands right now, there's no way such a plan would be chosen unless correlation was extremely high, however. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
> The problem is that you're now talking about doing 2 index scans instead > of just one and a sort. It depends on what you call an index scan : a- Scanning just the index (no heap lookup) to create a bitmap b- Scanning the index and hitting the heap in index order to retrieve the rows (a) should be quite fast, because indexes generally use less space than the main table, and have good locality of reference. (b) is OK if the table fits in memory, but if it has to seek on every row from the heap... So, when doing : SELECT * FROM products WHERE category=C ORDER BY price LIMIT 20; If the category contains few products, using the index on category then sorting is good. However, if the category contains many items, postgres is likely to use the index on price to avoid the sort. It needs to lose time fetching many rows from the heap which will not be in category C. In that case, I guess it would be a win to build a bitmap of the pages containing rows which belongs to category C, and only do the heap lookup on these pages. I have a query like that. When category C contains cheap products, the index scan on price finds them pretty quick. However if it is a category containing mostly expensive products, the index scan will have to hit most of the table in order to find them. The time needed for the query for these two extreme varies from 1 ms to about 20 ms (and that's because the table is fully cached, or else the worst case would be a lot slower). I would definitely prefer a constant 2 ms. The other solution is to create an index on (category,price), but this path leads to lots, lots of indexes given all the combinations. The bitmap trick I proposed in my previous post would be even more interesting if the table is clustered on category (which seems a reasonable thing to do). > If the correlation on price is high, it could > still win. As the cost estimator for index scan stands right now, > there's no way such a plan would be chosen unless correlation was > extremely high, however. Does the cost estimator know about this kind of correlation ?
On Tue, Mar 07, 2006 at 07:09:15PM +0100, PFC wrote: > > >The problem is that you're now talking about doing 2 index scans instead > >of just one and a sort. > > It depends on what you call an index scan : > a- Scanning just the index (no heap lookup) to create a bitmap Sure, and then you scan the other index and read the heap at the same time (b). Your plan requires doing both. The question is: in what cases will it be faster to scan the extra index and build the bitmap vs. just doing a sort. > b- Scanning the index and hitting the heap in index order to > retrieve the rows > > (a) should be quite fast, because indexes generally use less space > than the main table, and have good locality of reference. (b) is OK if the > table fits in memory, but if it has to seek on every row from the heap... If the table fits in memory, who cares? A sort should be damn fast at that point, because you're dealing with a small set of data. > So, when doing : > SELECT * FROM products WHERE category=C ORDER BY price LIMIT 20; > > If the category contains few products, using the index on category > then sorting is good. > However, if the category contains many items, postgres is likely to > use the index on price to avoid the sort. It needs to lose time fetching Have you actually seen this behavior? My experience is that you have to have a correlation somewhere over 80-90% before an index scan is favored over a seqscan + sort (which as I mentioned before appears to be broken). > The bitmap trick I proposed in my previous post would be even more > interesting if the table is clustered on category (which seems a > reasonable thing to do). In which case it's highly unlikely that using the price index will buy you anything. > Does the cost estimator know about this kind of correlation ? Yes. The problem is that the index scan cost estimator figures out a best and worst case cost, and then interpolates between the two using correlation^2. IMO it should be using abs(correlation) to do this, and there's some data at http://stats.distributed.net/~decibel/ that backs this up. There's also been some discussions on -hackers (search the archives for "index cost correlation nasby"), but I've not had time to follow up on this. If you wanted to test a new index cost formula it would be a one line change to the code. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461