Thread: Re: [GENERAL] Yet Another (Simple) Case of Index not used
> -----Original Message----- > From: Denis [mailto:denis@next2me.com] > Sent: Tuesday, April 08, 2003 12:57 PM > To: pgsql-performance@postgresql.org; > pgsql-general@postgresql.org; pgsql-sql@postgresql.org > Subject: [GENERAL] Yet Another (Simple) Case of Index not used > > > Hi there, > I'm running into a quite puzzling simple example where the > index I've created on a fairly big table (465K entries) is > not used, against all common sense expectations: The query I > am trying to do (fast) is: > > select count(*) from addresses; > > This takes more than a second to complete, because, as the > 'explain' command shows me, the index created on 'addresses' > is not used, and a seq scan is being used. As well it should be. > One would assume > that the creation of an index would allow the counting of the > number of entries in a table to be instantanous? Traversing the index to perform the count will definitely make the query many times slower. A general rule of thumb (not sure if it is true with PostgreSQL) is that if you have to traverse more than 10% of the data with an index then a full table scan will be faster. This is especially true when there is highly redundant data in the index fields. If there were an index on bit data type, and you have half and half 1 and 0, an index scan of the table will be disastrous. To simply scan the table, we will just sequentially read pages until the data is exhausted. If we follow the index, we will randomly jump from page to page, defeating the read buffering. [snip]
from mysql manual: ------------------------------------------------------------- "COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause. For example: mysql> select COUNT(*) from student;" ------------------------------------------------------------- A nice little optimization, maybe not possible in a MVCC system. Dann Corbit wrote: >>-----Original Message----- >>From: Denis [mailto:denis@next2me.com] >>Sent: Tuesday, April 08, 2003 12:57 PM >>To: pgsql-performance@postgresql.org; >>pgsql-general@postgresql.org; pgsql-sql@postgresql.org >>Subject: [GENERAL] Yet Another (Simple) Case of Index not used >> >> >>Hi there, >>I'm running into a quite puzzling simple example where the >>index I've created on a fairly big table (465K entries) is >>not used, against all common sense expectations: The query I >>am trying to do (fast) is: >> >>select count(*) from addresses; >> >>This takes more than a second to complete, because, as the >>'explain' command shows me, the index created on 'addresses' >>is not used, and a seq scan is being used. > > > As well it should be. > > >>One would assume >>that the creation of an index would allow the counting of the >>number of entries in a table to be instantanous? > > > Traversing the index to perform the count will definitely make the query > many times slower. > > A general rule of thumb (not sure if it is true with PostgreSQL) is that > if you have to traverse more than 10% of the data with an index then a > full table scan will be faster. This is especially true when there is > highly redundant data in the index fields. If there were an index on > bit data type, and you have half and half 1 and 0, an index scan of the > table will be disastrous. > > To simply scan the table, we will just sequentially read pages until the > data is exhausted. If we follow the index, we will randomly jump from > page to page, defeating the read buffering. > [snip] > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
Dennis Gearon wrote: > from mysql manual: > ------------------------------------------------------------- > "COUNT(*) is optimized to return very quickly if the SELECT retrieves from one > table, no other columns are retrieved, and there is no WHERE clause. For example: > > mysql> select COUNT(*) from student;" > ------------------------------------------------------------- > > A nice little optimization, maybe not possible in a MVCC system. I think the only thing you can do with MVCC is to cache the value and tranaction id for "SELECT AGG(*) FROM tab" and make the cached value visible to transaction id's greater than the one that executed the query, and invalidate the cache every time the table is modified. In fact, don't clear the cache, just record the transaction id of the table modification command so we can use standard visibility routines to make the cache usable as long as possiible. The cleanest way would probably be to create an aggregate cache system table, and to insert into it when someone does an unqualified aggregate, and to delete from it when someone modifies the table --- the MVCC tuple visibility rules are handled automatically. Queries can look in there to see if a visible cached value already exists. Of course, the big question is whether this would be a big win, and whether the cost of upkeep would justify it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tuesday 15 Apr 2003 3:23 pm, Bruce Momjian wrote: > Dennis Gearon wrote: > > from mysql manual: > > ------------------------------------------------------------- > > "COUNT(*) is optimized to return very quickly if the SELECT retrieves > > from one table, no other columns are retrieved, and there is no WHERE > > clause. For example: > > > > mysql> select COUNT(*) from student;" > > ------------------------------------------------------------- > The cleanest way would probably be to create an aggregate cache system > table, and to insert into it when someone does an unqualified aggregate, > and to delete from it when someone modifies the table --- the MVCC tuple > visibility rules are handled automatically. Queries can look in there > to see if a visible cached value already exists. Of course, the big > question is whether this would be a big win, and whether the cost of > upkeep would justify it. If the rule system could handle something like: CREATE RULE quick_foo_count AS ON SELECT count(*) FROM foo DO INSTEAD SELECT quick_count FROM agg_cache WHERE tbl_name='foo'; The whole thing could be handled by user-space triggers/rules and still invisible to the end-user. -- Richard Huxton
Added to TODO: * Consider using MVCC to cache count(*) queries with no WHERE clause --------------------------------------------------------------------------- Bruce Momjian wrote: > Dennis Gearon wrote: > > from mysql manual: > > ------------------------------------------------------------- > > "COUNT(*) is optimized to return very quickly if the SELECT retrieves from one > > table, no other columns are retrieved, and there is no WHERE clause. For example: > > > > mysql> select COUNT(*) from student;" > > ------------------------------------------------------------- > > > > A nice little optimization, maybe not possible in a MVCC system. > > I think the only thing you can do with MVCC is to cache the value and > tranaction id for "SELECT AGG(*) FROM tab" and make the cached value > visible to transaction id's greater than the one that executed the > query, and invalidate the cache every time the table is modified. > > In fact, don't clear the cache, just record the transaction id of the > table modification command so we can use standard visibility routines to > make the cache usable as long as possiible. > > The cleanest way would probably be to create an aggregate cache system > table, and to insert into it when someone does an unqualified aggregate, > and to delete from it when someone modifies the table --- the MVCC tuple > visibility rules are handled automatically. Queries can look in there > to see if a visible cached value already exists. Of course, the big > question is whether this would be a big win, and whether the cost of > upkeep would justify it. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073