Thread: improve select performance...

improve select performance...

From
Steve Holdoway
Date:
Can anyone out there help me, I'm having a bit of a problem trying to improve the performance of a php script that does
adata load? 

At the moment, the script checks a database to see if the entry is present, and, if not, it adds it. The table has
about400,000 rows, and I'm checking a bulk load of about 50,000 entries. The table is defined as follows (not my
design!):

Welcome to psql 8.2.4, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

db=> \d badurls
                     Table "public.badurls"
   Column    |            Type             |     Modifiers
-------------+-----------------------------+--------------------
 url         | character varying           | not null
 ip          | character(20)               |
 dateadded   | timestamp without time zone | not null
 type        | character(1)                |
 publishdate | timestamp without time zone |
 status      | integer                     | not null default 1
 version     | integer                     | default 0
 edited      | integer                     | default 0
 category    | character(4)[]              |
Indexes:
    "idxbadurls_url" hash (url)
    "idxbadurls_version" btree (version)


and I'm accessing the url column. (I've tried a btree index as well... not much difference)

In an attempt to improve the performance, I'm getting a list of data that's already there in a single hit, and not
attemptingto insert them ( I'm still checking the rest first, just in case... ). 

So, I'm trying to run a select url from badurls where url in ( .... ) to get this list. However, this is also
desperatelyslow... 

1000 in the list: 14 sec
5000 in the list: 2:40
10000 in the list: 5:50

( run from a client machine over a 100mbit lan ).

Anyway, that's enough to prove to me that I need to rejig stuff to get this to work!!!

Server's a dual xeon machine, with hardware raid, and running debian. It's got 4GB of memory, 1GB is currently not even
usedfor io buffers. 

The postgres instance is configured as standard, except for increasing the shared_buffers to 200MB.

Explain plan for about 50 entries in the 'in' clause returns

Bitmap Heap Scan on badurls  (cost=4531.22..15748.67 rows=60462 width=32)
   Recheck Cond: ((url)::text = ANY (('{.....}'::character varying[])::text[]))
   ->  Bitmap Index Scan on idxbadurls_url  (cost=0.00..4516.10 rows=60462 width=0)
         Index Cond: ((url)::text = ANY (('{.....}'::character varying[])::text[]))
(4 rows)


Can anyone suggest what I should be reconfiguring???

Cheers,


Steve

Re: improve select performance...

From
"Phillip Smith"
Date:
Can you post the actual queries and the full EXPLAIN ANALYZE for the
queries?

Are you running 64bit Debian and PostgreSQL?

Cheers,
~p

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Steve Holdoway
Sent: Wednesday, 16 May 2007 09:08
To: pgsql-admin@postgresql.org
Subject: [ADMIN] improve select performance...

Can anyone out there help me, I'm having a bit of a problem trying to
improve the performance of a php script that does a data load?

At the moment, the script checks a database to see if the entry is present,
and, if not, it adds it. The table has about 400,000 rows, and I'm checking
a bulk load of about 50,000 entries. The table is defined as follows (not my
design!):

Welcome to psql 8.2.4, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

db=> \d badurls
                     Table "public.badurls"
   Column    |            Type             |     Modifiers
-------------+-----------------------------+--------------------
 url         | character varying           | not null
 ip          | character(20)               |
 dateadded   | timestamp without time zone | not null
 type        | character(1)                |
 publishdate | timestamp without time zone |
 status      | integer                     | not null default 1
 version     | integer                     | default 0
 edited      | integer                     | default 0
 category    | character(4)[]              |
Indexes:
    "idxbadurls_url" hash (url)
    "idxbadurls_version" btree (version)


and I'm accessing the url column. (I've tried a btree index as well... not
much difference)

In an attempt to improve the performance, I'm getting a list of data that's
already there in a single hit, and not attempting to insert them ( I'm still
checking the rest first, just in case... ).

So, I'm trying to run a select url from badurls where url in ( .... ) to get
this list. However, this is also desperately slow...

1000 in the list: 14 sec
5000 in the list: 2:40
10000 in the list: 5:50

( run from a client machine over a 100mbit lan ).

Anyway, that's enough to prove to me that I need to rejig stuff to get this
to work!!!

Server's a dual xeon machine, with hardware raid, and running debian. It's
got 4GB of memory, 1GB is currently not even used for io buffers.

The postgres instance is configured as standard, except for increasing the
shared_buffers to 200MB.

Explain plan for about 50 entries in the 'in' clause returns

Bitmap Heap Scan on badurls  (cost=4531.22..15748.67 rows=60462 width=32)
   Recheck Cond: ((url)::text = ANY (('{.....}'::character
varying[])::text[]))
   ->  Bitmap Index Scan on idxbadurls_url  (cost=0.00..4516.10 rows=60462
width=0)
         Index Cond: ((url)::text = ANY (('{.....}'::character
varying[])::text[]))
(4 rows)


Can anyone suggest what I should be reconfiguring???

Cheers,


Steve

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate


*******************Confidentiality and Privilege Notice*******************

The material contained in this message is privileged and confidential to
the addressee.  If you are not the addressee indicated in this message or
responsible for delivery of the message to such person, you may not copy
or deliver this message to anyone, and you should destroy it and kindly
notify the sender by reply email.

Information in this message that does not relate to the official business
of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta.
Weatherbeeta, its employees, contractors or associates shall not be liable
for direct, indirect or consequential loss arising from transmission of this
message or any attachments

Re: improve select performance...

From
Steve Holdoway
Date:
I managed to get this working properly by dropping the index on the url column, and making it the primary key instead.
Cananyone point me at docs that define the difference between a primary key and an unique index on the same field? I'm
abit confused here! 

Cheers,  Steve

On Wed, 16 May 2007 11:08:10 +1200
Steve Holdoway <steve.holdoway@firetrust.com> wrote:

> Can anyone out there help me, I'm having a bit of a problem trying to improve the performance of a php script that
doesa data load? 
>
> At the moment, the script checks a database to see if the entry is present, and, if not, it adds it. The table has
about400,000 rows, and I'm checking a bulk load of about 50,000 entries. The table is defined as follows (not my
design!):
>
> Welcome to psql 8.2.4, the PostgreSQL interactive terminal.
>
> Type:  \copyright for distribution terms
>        \h for help with SQL commands
>        \? for help with psql commands
>        \g or terminate with semicolon to execute query
>        \q to quit
>
> db=> \d badurls
>                      Table "public.badurls"
>    Column    |            Type             |     Modifiers
> -------------+-----------------------------+--------------------
>  url         | character varying           | not null
>  ip          | character(20)               |
>  dateadded   | timestamp without time zone | not null
>  type        | character(1)                |
>  publishdate | timestamp without time zone |
>  status      | integer                     | not null default 1
>  version     | integer                     | default 0
>  edited      | integer                     | default 0
>  category    | character(4)[]              |
> Indexes:
>     "idxbadurls_url" hash (url)
>     "idxbadurls_version" btree (version)
>
>
> and I'm accessing the url column. (I've tried a btree index as well... not much difference)
>
> In an attempt to improve the performance, I'm getting a list of data that's already there in a single hit, and not
attemptingto insert them ( I'm still checking the rest first, just in case... ). 
>
> So, I'm trying to run a select url from badurls where url in ( .... ) to get this list. However, this is also
desperatelyslow... 
>
> 1000 in the list: 14 sec
> 5000 in the list: 2:40
> 10000 in the list: 5:50
>
> ( run from a client machine over a 100mbit lan ).
>
> Anyway, that's enough to prove to me that I need to rejig stuff to get this to work!!!
>
> Server's a dual xeon machine, with hardware raid, and running debian. It's got 4GB of memory, 1GB is currently not
evenused for io buffers. 
>
> The postgres instance is configured as standard, except for increasing the shared_buffers to 200MB.
>
> Explain plan for about 50 entries in the 'in' clause returns
>
> Bitmap Heap Scan on badurls  (cost=4531.22..15748.67 rows=60462 width=32)
>    Recheck Cond: ((url)::text = ANY (('{.....}'::character varying[])::text[]))
>    ->  Bitmap Index Scan on idxbadurls_url  (cost=0.00..4516.10 rows=60462 width=0)
>          Index Cond: ((url)::text = ANY (('{.....}'::character varying[])::text[]))
> (4 rows)
>
>
> Can anyone suggest what I should be reconfiguring???
>
> Cheers,
>
>
> Steve
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

Re: improve select performance...

From
Tom Lane
Date:
Steve Holdoway <steve.holdoway@firetrust.com> writes:
> I managed to get this working properly by dropping the index on the
> url column, and making it the primary key instead. Can anyone point me
> at docs that define the difference between a primary key and an unique
> index on the same field? I'm a bit confused here!

A primary key is a unique index plus a not-null constraint on its
column(s), plus it's the default reference target for FOREIGN KEY
references to the table (which is why there can be only one per table).
AFAIK that's it.  For the purposes of SELECTs there really is no
difference.

>> db=> \d badurls
..
>> Indexes:
>> "idxbadurls_url" hash (url)
>> "idxbadurls_version" btree (version)

One problem here is you used a hash index.  Postgres' hash index
implementation is pretty poor.  I suspect that the explicit marking
of the index as unique might've changed the plan too; it's fairly
obvious from your EXPLAIN that the planner didn't previously know
the column was unique.  (Which suggests that you oughta ANALYZE
more often.)

            regards, tom lane