Thread: index does not improve performance

index does not improve performance

From
Milos Prudek
Date:
Hi all,

I have a table with 253.380 records. It's a firewall log. I thought that
creating an index will improve queries but results so far were
disappointing.

The table has about 20 columns, most of them of type "text". There's a
text field "ip_type" that has one of three values: TCP, UDP,  ICMP. I
tried to do two selects with and without an index on "ip_type".


Without index:

"select count(*) from log where ip_type='udp';" takes 3.0 seconds (this
query evaluates to zero rows).

"select count(*) from log where ip_type='UDP';" takes 4.5 seconds (this
query evaluates to 245.182 rows).



With index:

"select count(*) from log where ip_type='udp';" takes 0.0 seconds.

"select count(*) from log where ip_type='UDP';" takes 5.0 seconds.


It looks like creating an index degrades performance if the result set
is similar to the size of the whole table (I had much better results
when the condition was met by only two thousand records). Is this
normal?
--
Milos Prudek

Re: index does not improve performance

From
Jason Earl
Date:
Unfortunately Milos an index isn't likely to help on this type of a
query.  It would appear that the value 'UDP' accounts for the vast
majority of the rows, and so an index loses most of its value.  You see,
it actually takes *longer* to return queries using the indexes if a
significant portion of the table is being touched, because the database
has to check both the index and the tuple.  One of the new features of
7.2 is better statitistics gathering so that PostgreSQL can opt
*against* using an index scan in precisely this case.  You basically
want to return the entire table, so an indexscan only adds to the query
time.

For example, your query where you searched for 'udp' returned very
quickly, because PostgreSQL was able to use the index to verify that the
value 'udp' didn't exist.  When you searched for 'UDP' PostgreSQL
probably used an index scan as well, and so your query took longer than
before you added the index (that's overhead of actually looking at the
index instead of just getting to business and scanning the table).

In short, if your query only matches a small percentage of the records
then an indexscan is a win.  Otherwise it's just an extra step.

Jason

On Thu, 2002-01-31 at 12:25, Milos Prudek wrote:
> Hi all,
>
> I have a table with 253.380 records. It's a firewall log. I thought that
> creating an index will improve queries but results so far were
> disappointing.
>
> The table has about 20 columns, most of them of type "text". There's a
> text field "ip_type" that has one of three values: TCP, UDP,  ICMP. I
> tried to do two selects with and without an index on "ip_type".
>
>
> Without index:
>
> "select count(*) from log where ip_type='udp';" takes 3.0 seconds (this
> query evaluates to zero rows).
>
> "select count(*) from log where ip_type='UDP';" takes 4.5 seconds (this
> query evaluates to 245.182 rows).
>
>
>
> With index:
>
> "select count(*) from log where ip_type='udp';" takes 0.0 seconds.
>
> "select count(*) from log where ip_type='UDP';" takes 5.0 seconds.
>
>
> It looks like creating an index degrades performance if the result set
> is similar to the size of the whole table (I had much better results
> when the condition was met by only two thousand records). Is this
> normal?
> --
> Milos Prudek


Re: index does not improve performance

From
Neil Conway
Date:
On Thu, 2002-01-31 at 14:25, Milos Prudek wrote:
> It looks like creating an index degrades performance if the result set
> is similar to the size of the whole table (I had much better results
> when the condition was met by only two thousand records). Is this
> normal?

Yes -- for very large result sets, you will get worse performance using
an index scan than not using one.

I *believe* in 7.2 that Pg keeps good enough statistics to figure this
out most of the time (so it would use the index for the first query but
not the second); however, I may be wrong.

BTW, have you VACUUM ANALYZE'd this table? What version of Pg are you
running? The output for 'explain' for these queries (with and without
index) would be helpful.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC


Re: index does not improve performance

From
Milos Prudek
Date:
> BTW, have you VACUUM ANALYZE'd this table? What version of Pg are you

On your advice I did VACUUM ANALYZE, and it did help! From 5.5 seconds
I'm down to 3.4 seconds.

I'm using PostgreSQL 7.1.3, RPM binary.


> running? The output for 'explain' for these queries (with and without
> index) would be helpful.

With index:
Aggregate  (cost=11292.20..11292.20 rows=1 width=0)
  ->  Seq Scan on log  (cost=0.00..10679.25 rows=245182 width=0)

Without index:
Aggregate  (cost=11292.20..11292.20 rows=1 width=0)
  ->  Seq Scan on log  (cost=0.00..10679.25 rows=245182 width=0)


--
Milos Prudek


Re: index does not improve performance

From
Milos Prudek
Date:
> Unfortunately Milos an index isn't likely to help on this type of a
> query.  It would appear that the value 'UDP' accounts for the vast
> majority of the rows, and so an index loses most of its value.  You see,
> it actually takes *longer* to return queries using the indexes if a
> significant portion of the table is being touched, because the database
> has to check both the index and the tuple.  One of the new features of

Jason and Neil,

Thank you very much for the detailed answer. I'm relatively new to SQL.
My background is much more inferior system, FoxPro for DOS, which in
ancient times used what they called "patented Rushmore technology". I
think that FoxPro was able to return COUNT immediately, if index was
used, no matter what number of records the condition was true for. I
wonder if this is something that PostgreSQL will eventually be able to
do, or if it is simply not technically possible for some reason.


--
Milos Prudek



Re: index does not improve performance

From
Tom Lane
Date:
Milos Prudek <milos.prudek@tiscali.cz> writes:
> think that FoxPro was able to return COUNT immediately, if index was
> used, no matter what number of records the condition was true for. I
> wonder if this is something that PostgreSQL will eventually be able to
> do, or if it is simply not technically possible for some reason.

It is not going to happen in the foreseeable future, for the simple
reason that there isn't necessarily a uniquely correct COUNT value
in Postgres' view of the world, and thus no prospect of maintaining
a single counter that would be of any use.  You may care to read the
MVCC chapter of the manual,

http://developer.postgresql.org/docs/postgres/mvcc.html

Also see my talk on Postgres transactions, a PDF available at

http://developer.postgresql.org/osdn.php

This is a bit out of date (the comments about crash-safeness predate
the addition of WAL) but the discussion of tuple visibility is still
very relevant.

            regards, tom lane

Re: index does not improve performance

From
Frank Bax
Date:
The "VACUUM" part of "VACUUM ANALYZE" is like "PACK" from your FoxPro days.
 The extra speed came from making the physical files smaller.  A
significant difference from FoxPro is that with PG, an UPDATE actually does
a INSERT of a new row, and a DELETE of old row.  You will want to perform
VACUUM ANALYZE regularly.

The "Seq Scan" part of explain says that PG decided to do a sequential read
of your table, ignoring the index even when it did exist.  There must have
been other factors invloved to cause these queries to run in 4.5 and 5.0
seconds respectively.  They should have run in the same amount of time.

Frank

At 04:02 PM 2/2/02 +0100, Milos Prudek wrote:
>> BTW, have you VACUUM ANALYZE'd this table? What version of Pg are you
>
>On your advice I did VACUUM ANALYZE, and it did help! From 5.5 seconds
>I'm down to 3.4 seconds.
>
>I'm using PostgreSQL 7.1.3, RPM binary.
>
>
>> running? The output for 'explain' for these queries (with and without
>> index) would be helpful.
>
>With index:
>Aggregate  (cost=11292.20..11292.20 rows=1 width=0)
>  ->  Seq Scan on log  (cost=0.00..10679.25 rows=245182 width=0)
>
>Without index:
>Aggregate  (cost=11292.20..11292.20 rows=1 width=0)
>  ->  Seq Scan on log  (cost=0.00..10679.25 rows=245182 width=0)
>
>
>--
>Milos Prudek
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org
>

Re: index does not improve performance

From
Jason Earl
Date:
Sometimes it's horses for courses.  I am currently replacing a DOS
Paradox application and I have run up against some of the same
obstacles.  With PostgreSQL getting a count of the rows in a table is an
expensive procedure (requiring a sequential scan).  That's just the
nature of the beast.  If you really need to return the total number of
rows then about the only way to do it is to maintain that information in
another table.  Of course, that requires creating custom triggers for
insert and delete, and you get to worry about contention for this new
table (and deadlock, concurrency and a list of other problems).

I have a similar problem with a set of similar tables that store
caseweights.  These tables are all fairly large (the smallest has 16
million rows), and I am using commodity hardware.  This makes sequential
scans very painful.  Queries that return a small portion of the total
rows, however, return very fast.

So I simply created a set of summary tables that hold 15 minute
summaries of the statistics I need.  I populate these tables with small
Python scripts launched automatically using cron.  Now the tables that I
actually query are several orders of magnitude smaller than my raw data
tables, and my queries return almost instantly.

Since it would appear that you are primarily logging firewall data
(which shouldn't change after the fact), your application is a prime
suspect for this type of optimization.

Hope this was helpful,
Jason

On Sat, 2002-02-02 at 07:44, Milos Prudek wrote:
> > Unfortunately Milos an index isn't likely to help on this type of a
> > query.  It would appear that the value 'UDP' accounts for the vast
> > majority of the rows, and so an index loses most of its value.  You see,
> > it actually takes *longer* to return queries using the indexes if a
> > significant portion of the table is being touched, because the database
> > has to check both the index and the tuple.  One of the new features of
>
> Jason and Neil,
>
> Thank you very much for the detailed answer. I'm relatively new to SQL.
> My background is much more inferior system, FoxPro for DOS, which in
> ancient times used what they called "patented Rushmore technology". I
> think that FoxPro was able to return COUNT immediately, if index was
> used, no matter what number of records the condition was true for. I
> wonder if this is something that PostgreSQL will eventually be able to
> do, or if it is simply not technically possible for some reason.
>
>
> --
> Milos Prudek
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org



Re: index does not improve performance

From
Milos Prudek
Date:
Frank Bax wrote:
>
> The "VACUUM" part of "VACUUM ANALYZE" is like "PACK" from your FoxPro days.

I knew that, since I read the fine manual :-) No harm in pointing it
out, though.

>  The extra speed came from making the physical files smaller.  A
> significant difference from FoxPro is that with PG, an UPDATE actually does
> a INSERT of a new row, and a DELETE of old row.  You will want to perform

But I did not know this. Thanks to you, Frank and Darko!

--
Milos Prudek