Thread: index does not improve performance
Hi all, I have a table with 253.380 records. It's a firewall log. I thought that creating an index will improve queries but results so far were disappointing. The table has about 20 columns, most of them of type "text". There's a text field "ip_type" that has one of three values: TCP, UDP, ICMP. I tried to do two selects with and without an index on "ip_type". Without index: "select count(*) from log where ip_type='udp';" takes 3.0 seconds (this query evaluates to zero rows). "select count(*) from log where ip_type='UDP';" takes 4.5 seconds (this query evaluates to 245.182 rows). With index: "select count(*) from log where ip_type='udp';" takes 0.0 seconds. "select count(*) from log where ip_type='UDP';" takes 5.0 seconds. It looks like creating an index degrades performance if the result set is similar to the size of the whole table (I had much better results when the condition was met by only two thousand records). Is this normal? -- Milos Prudek
Unfortunately Milos an index isn't likely to help on this type of a query. It would appear that the value 'UDP' accounts for the vast majority of the rows, and so an index loses most of its value. You see, it actually takes *longer* to return queries using the indexes if a significant portion of the table is being touched, because the database has to check both the index and the tuple. One of the new features of 7.2 is better statitistics gathering so that PostgreSQL can opt *against* using an index scan in precisely this case. You basically want to return the entire table, so an indexscan only adds to the query time. For example, your query where you searched for 'udp' returned very quickly, because PostgreSQL was able to use the index to verify that the value 'udp' didn't exist. When you searched for 'UDP' PostgreSQL probably used an index scan as well, and so your query took longer than before you added the index (that's overhead of actually looking at the index instead of just getting to business and scanning the table). In short, if your query only matches a small percentage of the records then an indexscan is a win. Otherwise it's just an extra step. Jason On Thu, 2002-01-31 at 12:25, Milos Prudek wrote: > Hi all, > > I have a table with 253.380 records. It's a firewall log. I thought that > creating an index will improve queries but results so far were > disappointing. > > The table has about 20 columns, most of them of type "text". There's a > text field "ip_type" that has one of three values: TCP, UDP, ICMP. I > tried to do two selects with and without an index on "ip_type". > > > Without index: > > "select count(*) from log where ip_type='udp';" takes 3.0 seconds (this > query evaluates to zero rows). > > "select count(*) from log where ip_type='UDP';" takes 4.5 seconds (this > query evaluates to 245.182 rows). > > > > With index: > > "select count(*) from log where ip_type='udp';" takes 0.0 seconds. > > "select count(*) from log where ip_type='UDP';" takes 5.0 seconds. > > > It looks like creating an index degrades performance if the result set > is similar to the size of the whole table (I had much better results > when the condition was met by only two thousand records). Is this > normal? > -- > Milos Prudek
On Thu, 2002-01-31 at 14:25, Milos Prudek wrote: > It looks like creating an index degrades performance if the result set > is similar to the size of the whole table (I had much better results > when the condition was met by only two thousand records). Is this > normal? Yes -- for very large result sets, you will get worse performance using an index scan than not using one. I *believe* in 7.2 that Pg keeps good enough statistics to figure this out most of the time (so it would use the index for the first query but not the second); however, I may be wrong. BTW, have you VACUUM ANALYZE'd this table? What version of Pg are you running? The output for 'explain' for these queries (with and without index) would be helpful. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
> BTW, have you VACUUM ANALYZE'd this table? What version of Pg are you On your advice I did VACUUM ANALYZE, and it did help! From 5.5 seconds I'm down to 3.4 seconds. I'm using PostgreSQL 7.1.3, RPM binary. > running? The output for 'explain' for these queries (with and without > index) would be helpful. With index: Aggregate (cost=11292.20..11292.20 rows=1 width=0) -> Seq Scan on log (cost=0.00..10679.25 rows=245182 width=0) Without index: Aggregate (cost=11292.20..11292.20 rows=1 width=0) -> Seq Scan on log (cost=0.00..10679.25 rows=245182 width=0) -- Milos Prudek
> Unfortunately Milos an index isn't likely to help on this type of a > query. It would appear that the value 'UDP' accounts for the vast > majority of the rows, and so an index loses most of its value. You see, > it actually takes *longer* to return queries using the indexes if a > significant portion of the table is being touched, because the database > has to check both the index and the tuple. One of the new features of Jason and Neil, Thank you very much for the detailed answer. I'm relatively new to SQL. My background is much more inferior system, FoxPro for DOS, which in ancient times used what they called "patented Rushmore technology". I think that FoxPro was able to return COUNT immediately, if index was used, no matter what number of records the condition was true for. I wonder if this is something that PostgreSQL will eventually be able to do, or if it is simply not technically possible for some reason. -- Milos Prudek
Milos Prudek <milos.prudek@tiscali.cz> writes: > think that FoxPro was able to return COUNT immediately, if index was > used, no matter what number of records the condition was true for. I > wonder if this is something that PostgreSQL will eventually be able to > do, or if it is simply not technically possible for some reason. It is not going to happen in the foreseeable future, for the simple reason that there isn't necessarily a uniquely correct COUNT value in Postgres' view of the world, and thus no prospect of maintaining a single counter that would be of any use. You may care to read the MVCC chapter of the manual, http://developer.postgresql.org/docs/postgres/mvcc.html Also see my talk on Postgres transactions, a PDF available at http://developer.postgresql.org/osdn.php This is a bit out of date (the comments about crash-safeness predate the addition of WAL) but the discussion of tuple visibility is still very relevant. regards, tom lane
The "VACUUM" part of "VACUUM ANALYZE" is like "PACK" from your FoxPro days. The extra speed came from making the physical files smaller. A significant difference from FoxPro is that with PG, an UPDATE actually does a INSERT of a new row, and a DELETE of old row. You will want to perform VACUUM ANALYZE regularly. The "Seq Scan" part of explain says that PG decided to do a sequential read of your table, ignoring the index even when it did exist. There must have been other factors invloved to cause these queries to run in 4.5 and 5.0 seconds respectively. They should have run in the same amount of time. Frank At 04:02 PM 2/2/02 +0100, Milos Prudek wrote: >> BTW, have you VACUUM ANALYZE'd this table? What version of Pg are you > >On your advice I did VACUUM ANALYZE, and it did help! From 5.5 seconds >I'm down to 3.4 seconds. > >I'm using PostgreSQL 7.1.3, RPM binary. > > >> running? The output for 'explain' for these queries (with and without >> index) would be helpful. > >With index: >Aggregate (cost=11292.20..11292.20 rows=1 width=0) > -> Seq Scan on log (cost=0.00..10679.25 rows=245182 width=0) > >Without index: >Aggregate (cost=11292.20..11292.20 rows=1 width=0) > -> Seq Scan on log (cost=0.00..10679.25 rows=245182 width=0) > > >-- >Milos Prudek > > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > >http://archives.postgresql.org >
Sometimes it's horses for courses. I am currently replacing a DOS Paradox application and I have run up against some of the same obstacles. With PostgreSQL getting a count of the rows in a table is an expensive procedure (requiring a sequential scan). That's just the nature of the beast. If you really need to return the total number of rows then about the only way to do it is to maintain that information in another table. Of course, that requires creating custom triggers for insert and delete, and you get to worry about contention for this new table (and deadlock, concurrency and a list of other problems). I have a similar problem with a set of similar tables that store caseweights. These tables are all fairly large (the smallest has 16 million rows), and I am using commodity hardware. This makes sequential scans very painful. Queries that return a small portion of the total rows, however, return very fast. So I simply created a set of summary tables that hold 15 minute summaries of the statistics I need. I populate these tables with small Python scripts launched automatically using cron. Now the tables that I actually query are several orders of magnitude smaller than my raw data tables, and my queries return almost instantly. Since it would appear that you are primarily logging firewall data (which shouldn't change after the fact), your application is a prime suspect for this type of optimization. Hope this was helpful, Jason On Sat, 2002-02-02 at 07:44, Milos Prudek wrote: > > Unfortunately Milos an index isn't likely to help on this type of a > > query. It would appear that the value 'UDP' accounts for the vast > > majority of the rows, and so an index loses most of its value. You see, > > it actually takes *longer* to return queries using the indexes if a > > significant portion of the table is being touched, because the database > > has to check both the index and the tuple. One of the new features of > > Jason and Neil, > > Thank you very much for the detailed answer. I'm relatively new to SQL. > My background is much more inferior system, FoxPro for DOS, which in > ancient times used what they called "patented Rushmore technology". I > think that FoxPro was able to return COUNT immediately, if index was > used, no matter what number of records the condition was true for. I > wonder if this is something that PostgreSQL will eventually be able to > do, or if it is simply not technically possible for some reason. > > > -- > Milos Prudek > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Frank Bax wrote: > > The "VACUUM" part of "VACUUM ANALYZE" is like "PACK" from your FoxPro days. I knew that, since I read the fine manual :-) No harm in pointing it out, though. > The extra speed came from making the physical files smaller. A > significant difference from FoxPro is that with PG, an UPDATE actually does > a INSERT of a new row, and a DELETE of old row. You will want to perform But I did not know this. Thanks to you, Frank and Darko! -- Milos Prudek