Thread: Can this query go faster???

Can this query go faster???

From
Joost Kraaijeveld
Date:
Hi,

Is it possible to get this query run faster than it does now, by adding
indexes, changing the query?

SELECT customers.objectid FROM prototype.customers, prototype.addresses
WHERE
customers.contactaddress = addresses.objectid
ORDER BY zipCode asc, housenumber asc
LIMIT 1 OFFSET 283745

Explain:

Limit  (cost=90956.71..90956.71 rows=1 width=55)
  ->  Sort  (cost=90247.34..91169.63 rows=368915 width=55)
        Sort Key: addresses.zipcode, addresses.housenumber
        ->  Hash Join  (cost=14598.44..56135.75 rows=368915 width=55)
              Hash Cond: ("outer".contactaddress = "inner".objectid)
              ->  Seq Scan on customers  (cost=0.00..31392.15
rows=368915 width=80)
              ->  Hash  (cost=13675.15..13675.15 rows=369315 width=55)
                    ->  Seq Scan on addresses  (cost=0.00..13675.15
rows=369315 width=55)

The customers table has an index on contactaddress and objectid.
The addresses table has an index on zipcode+housenumber and objectid.

TIA

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl



Re: Can this query go faster???

From
Michael Riess
Date:
> Hi,
>
> Is it possible to get this query run faster than it does now, by adding
> indexes, changing the query?
>
> SELECT customers.objectid FROM prototype.customers, prototype.addresses
> WHERE
> customers.contactaddress = addresses.objectid
> ORDER BY zipCode asc, housenumber asc
> LIMIT 1 OFFSET 283745
>
> Explain:
>
> Limit  (cost=90956.71..90956.71 rows=1 width=55)
>   ->  Sort  (cost=90247.34..91169.63 rows=368915 width=55)
>         Sort Key: addresses.zipcode, addresses.housenumber
>         ->  Hash Join  (cost=14598.44..56135.75 rows=368915 width=55)
>               Hash Cond: ("outer".contactaddress = "inner".objectid)
>               ->  Seq Scan on customers  (cost=0.00..31392.15
> rows=368915 width=80)
>               ->  Hash  (cost=13675.15..13675.15 rows=369315 width=55)
>                     ->  Seq Scan on addresses  (cost=0.00..13675.15
> rows=369315 width=55)
>
> The customers table has an index on contactaddress and objectid.
> The addresses table has an index on zipcode+housenumber and objectid.

When the resulting relation contains all the info from both tables,
indexes won't help, seq scan is inevitable.

Re: Can this query go faster???

From
Csaba Nagy
Date:
Joost,

Why do you use an offset here ? I guess you're traversing the table
somehow, in this case it would be better to remember the last zipcode +
housenumber and put an additional condition to get the next bigger than
the last one you've got... that would go for the index on
zipcode+housenumber and be very fast. The big offset forces postgres to
traverse that many entries until it's able to pick the one row for the
result...

On Tue, 2005-12-06 at 10:43, Joost Kraaijeveld wrote:
> Hi,
>
> Is it possible to get this query run faster than it does now, by adding
> indexes, changing the query?
>
> SELECT customers.objectid FROM prototype.customers, prototype.addresses
> WHERE
> customers.contactaddress = addresses.objectid
> ORDER BY zipCode asc, housenumber asc
> LIMIT 1 OFFSET 283745
>
> Explain:
>
> Limit  (cost=90956.71..90956.71 rows=1 width=55)
>   ->  Sort  (cost=90247.34..91169.63 rows=368915 width=55)
>         Sort Key: addresses.zipcode, addresses.housenumber
>         ->  Hash Join  (cost=14598.44..56135.75 rows=368915 width=55)
>               Hash Cond: ("outer".contactaddress = "inner".objectid)
>               ->  Seq Scan on customers  (cost=0.00..31392.15
> rows=368915 width=80)
>               ->  Hash  (cost=13675.15..13675.15 rows=369315 width=55)
>                     ->  Seq Scan on addresses  (cost=0.00..13675.15
> rows=369315 width=55)
>
> The customers table has an index on contactaddress and objectid.
> The addresses table has an index on zipcode+housenumber and objectid.
>
> TIA
>
> --
> Groeten,
>
> Joost Kraaijeveld
> Askesis B.V.
> Molukkenstraat 14
> 6524NB Nijmegen
> tel: 024-3888063 / 06-51855277
> fax: 024-3608416
> e-mail: J.Kraaijeveld@Askesis.nl
> web: www.askesis.nl
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org


Re: Can this query go faster???

From
"Markus Wollny"
Date:
Hi,

> -----Ursprüngliche Nachricht-----
> Von: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] Im Auftrag
> von Joost Kraaijeveld
> Gesendet: Dienstag, 6. Dezember 2005 10:44
> An: Pgsql-Performance
> Betreff: [PERFORM] Can this query go faster???

> SELECT customers.objectid FROM prototype.customers,
> prototype.addresses WHERE customers.contactaddress =
> addresses.objectid ORDER BY zipCode asc, housenumber asc
> LIMIT 1 OFFSET 283745
>
> Explain:
>
> Limit  (cost=90956.71..90956.71 rows=1 width=55)
>   ->  Sort  (cost=90247.34..91169.63 rows=368915 width=55)
>         Sort Key: addresses.zipcode, addresses.housenumber
>         ->  Hash Join  (cost=14598.44..56135.75 rows=368915 width=55)
>               Hash Cond: ("outer".contactaddress = "inner".objectid)
>               ->  Seq Scan on customers  (cost=0.00..31392.15
> rows=368915 width=80)
>               ->  Hash  (cost=13675.15..13675.15 rows=369315 width=55)
>                     ->  Seq Scan on addresses  (cost=0.00..13675.15
> rows=369315 width=55)
>
> The customers table has an index on contactaddress and objectid.
> The addresses table has an index on zipcode+housenumber and objectid.

The planner chooses sequential scans on customers.contactaddress and addresses.objectid instead of using the indices.
Inorder to determine whether this is a sane decision, you should run EXPLAIN ANALYZE on this query, once with SET
ENABLE_SEQSCAN= on; and once with SET ENABLE_SEQSCAN = off;. If the query is significantly faster with SEQSCAN off,
thensomething is amiss - either you haven't run analyze often enough so the stats are out of date or you have
random_page_costset too high (look for the setting in postgresql.conf) - these two are the "usual suspects". 

Kind regards

   Markus

Re: Can this query go faster???

From
Joost Kraaijeveld
Date:
On Tue, 2005-12-06 at 10:52 +0100, Csaba Nagy wrote:
> Joost,
>
> Why do you use an offset here ? I guess you're traversing the table
> somehow, in this case it would be better to remember the last zipcode +
> housenumber and put an additional condition to get the next bigger than
> the last one you've got... that would go for the index on
> zipcode+housenumber and be very fast. The big offset forces postgres to
> traverse that many entries until it's able to pick the one row for the
I am forced to translate a sorting dependent record number to a record
in the database. The GUI (a Java JTable) works with record /row numbers,
which is handy if one has an ISAM database, but not if one uses
PostgreSQL.

I wonder if using a forward scrolling cursor would be faster.

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl



Re: Can this query go faster???

From
Tino Wildenhain
Date:
Joost Kraaijeveld schrieb:
> On Tue, 2005-12-06 at 10:52 +0100, Csaba Nagy wrote:
>
>>Joost,
>>
>>Why do you use an offset here ? I guess you're traversing the table
>>somehow, in this case it would be better to remember the last zipcode +
>>housenumber and put an additional condition to get the next bigger than
>>the last one you've got... that would go for the index on
>>zipcode+housenumber and be very fast. The big offset forces postgres to
>>traverse that many entries until it's able to pick the one row for the
>
> I am forced to translate a sorting dependent record number to a record
> in the database. The GUI (a Java JTable) works with record /row numbers,
> which is handy if one has an ISAM database, but not if one uses
> PostgreSQL.

You can have a row number in postgres easily too. For example if you
just include a serial for the row number.

Cursor would work too but you would need to have a persistent connection.

Regards
Tino

Re: Can this query go faster???

From
Joost Kraaijeveld
Date:
Hi Tino,

On Tue, 2005-12-06 at 11:32 +0100, Tino Wildenhain wrote:
> You can have a row number in postgres easily too. For example if you
> just include a serial for the row number.
Not if the order of things is determined runtime and not at insert time...

> Cursor would work too but you would need to have a persistent connection.
I just tried it: a cursor is not faster (what does not surprise me at
all, as the amount of work looks the same to me)

I guess there is no solution.

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl



Re: Can this query go faster???

From
Tino Wildenhain
Date:
Joost Kraaijeveld schrieb:
> Hi Tino,
>
..
>
>>Cursor would work too but you would need to have a persistent connection.
>
> I just tried it: a cursor is not faster (what does not surprise me at
> all, as the amount of work looks the same to me)

Actually no, if you scroll forward, you just ask the database for the
next rows to materialize. So if you are ahead in your database and
ask for next rows, it should be faster then working w/ an offset
from start each time.



Re: Can this query go faster???

From
Joost Kraaijeveld
Date:
On Tue, 2005-12-06 at 12:36 +0100, Tino Wildenhain wrote:
> >
> > I just tried it: a cursor is not faster (what does not surprise me at
> > all, as the amount of work looks the same to me)
>
> Actually no, if you scroll forward, you just ask the database for the
> next rows to materialize. So if you are ahead in your database and
> ask for next rows, it should be faster then working w/ an offset
> from start each time.
Ah, a misunderstanding: I only need to calculate an index if the user
wants a record that is not in or adjacent to the cache (in which case I
can do a "select values > last value in the cache". So  I must always
materialize all rows below the wanted index.

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl



Re: Can this query go faster???

From
Tino Wildenhain
Date:
Joost Kraaijeveld schrieb:
> On Tue, 2005-12-06 at 12:36 +0100, Tino Wildenhain wrote:
>
>>>I just tried it: a cursor is not faster (what does not surprise me at
>>>all, as the amount of work looks the same to me)
>>
>>Actually no, if you scroll forward, you just ask the database for the
>>next rows to materialize. So if you are ahead in your database and
>>ask for next rows, it should be faster then working w/ an offset
>>from start each time.
>
> Ah, a misunderstanding: I only need to calculate an index if the user
> wants a record that is not in or adjacent to the cache (in which case I
> can do a "select values > last value in the cache". So  I must always
> materialize all rows below the wanted index.
>
Yes, but still advancing a few blocks from where the cursor is
should be faster then re-issuing the query and scroll thru
the whole resultset to where you want to go.



Re: Can this query go faster???

From
Csaba Nagy
Date:
On Tue, 2005-12-06 at 13:20, Joost Kraaijeveld wrote:
[snip]
> Ah, a misunderstanding: I only need to calculate an index if the user
> wants a record that is not in or adjacent to the cache (in which case I
> can do a "select values > last value in the cache". So  I must always
> materialize all rows below the wanted index.

In this case the query will very likely not work faster. It must always
visit all the records till the required offset. If the plan should be
faster using the index, then you probably need to analyze (I don't
recall from your former posts if you did it recently or not), in any
case you could check an "explain analyze" to see if the planner is
mistaken or not - you might already know this.

Cheers,
Csaba.



Re: Can this query go faster???

From
Ron
Date:
At 04:43 AM 12/6/2005, Joost Kraaijeveld wrote:
>Hi,
>
>Is it possible to get this query run faster than it does now, by adding
>indexes, changing the query?
>
>SELECT customers.objectid FROM prototype.customers, prototype.addresses
>WHERE
>customers.contactaddress = addresses.objectid
>ORDER BY zipCode asc, housenumber asc
>LIMIT 1 OFFSET 283745
>
>Explain:
>
>Limit  (cost=90956.71..90956.71 rows=1 width=55)
>   ->  Sort  (cost=90247.34..91169.63 rows=368915 width=55)
>         Sort Key: addresses.zipcode, addresses.housenumber
>         ->  Hash Join  (cost=14598.44..56135.75 rows=368915 width=55)
>               Hash Cond: ("outer".contactaddress = "inner".objectid)
>               ->  Seq Scan on customers  (cost=0.00..31392.15
>rows=368915 width=80)
>               ->  Hash  (cost=13675.15..13675.15 rows=369315 width=55)
>                     ->  Seq Scan on addresses  (cost=0.00..13675.15
>rows=369315 width=55)
>
>The customers table has an index on contactaddress and objectid.
>The addresses table has an index on zipcode+housenumber and objectid.
>
>TIA
customer names, customers.objectid, addresses, and addresses.objectid
should all be static (addresses do not change, just the customers
associated with them; and once a customer has been assigned an id
that better never change...).

To me, this sounds like the addresses and customers tables should be
duplicated and then physically laid out in sorted order by
<tablename>.objectid in one set and by the "human friendly"
associated string in the other set.
Then a finding a specific <tablename>.objectid or it's associated
string can be done in at worse O(lgn) time assuming binary search
instead of O(n) time for a sequential scan.  If pg is clever enough,
it might be able to do better than that.

IOW, I'd try duplicating the addresses and customers tables and using
the appropriate CLUSTERed Index on each.

I know this breaks Normal Form.  OTOH, this kind of thing is common
practice for data mining problems on static or almost static data.

Hope this is helpful,
Ron



Re: Can this query go faster???

From
"Merlin Moncure"
Date:
> On Tue, 2005-12-06 at 11:32 +0100, Tino Wildenhain wrote:
> > You can have a row number in postgres easily too. For example if you
> > just include a serial for the row number.
> Not if the order of things is determined runtime and not at insert
time...
>
> > Cursor would work too but you would need to have a persistent
> connection.
> I just tried it: a cursor is not faster (what does not surprise me at
> all, as the amount of work looks the same to me)
>
> I guess there is no solution.
>

sure there is.  This begs the question: 'why do you want to read exactly
283745 rows ahead of row 'x'?) :)

If you are scrolling forwards in a set, just pull in, say, 100-1000 rows
at a time, ordered, and grab the next 1000 based on the highest value
read previously.

You can do this on server side (cursor) or client side (parameterized
query).   There are advantages and disadvantages to each.  If you are
looping over this set and doing processing, a cursor would be ideal (try
out pl/pgsql).

Welcome to PostgreSQL! :)

Merlin

Re: Can this query go faster???

From
Bruno Wolff III
Date:
On Tue, Dec 06, 2005 at 10:52:57 +0100,
  Csaba Nagy <nagy@ecircle-ag.com> wrote:
> Joost,
>
> Why do you use an offset here ? I guess you're traversing the table
> somehow, in this case it would be better to remember the last zipcode +
> housenumber and put an additional condition to get the next bigger than
> the last one you've got... that would go for the index on
> zipcode+housenumber and be very fast. The big offset forces postgres to
> traverse that many entries until it's able to pick the one row for the
> result...

The other problem with saving an offset, is unless the data isn't changing
or you are doing all of the searches in one serialized transaction, the
fixed offset might not put you back where you left off.
Using the last key, instead of counting records is normally a better way
to do this.