Thread: Can this query go faster???
Hi, Is it possible to get this query run faster than it does now, by adding indexes, changing the query? SELECT customers.objectid FROM prototype.customers, prototype.addresses WHERE customers.contactaddress = addresses.objectid ORDER BY zipCode asc, housenumber asc LIMIT 1 OFFSET 283745 Explain: Limit (cost=90956.71..90956.71 rows=1 width=55) -> Sort (cost=90247.34..91169.63 rows=368915 width=55) Sort Key: addresses.zipcode, addresses.housenumber -> Hash Join (cost=14598.44..56135.75 rows=368915 width=55) Hash Cond: ("outer".contactaddress = "inner".objectid) -> Seq Scan on customers (cost=0.00..31392.15 rows=368915 width=80) -> Hash (cost=13675.15..13675.15 rows=369315 width=55) -> Seq Scan on addresses (cost=0.00..13675.15 rows=369315 width=55) The customers table has an index on contactaddress and objectid. The addresses table has an index on zipcode+housenumber and objectid. TIA -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
> Hi, > > Is it possible to get this query run faster than it does now, by adding > indexes, changing the query? > > SELECT customers.objectid FROM prototype.customers, prototype.addresses > WHERE > customers.contactaddress = addresses.objectid > ORDER BY zipCode asc, housenumber asc > LIMIT 1 OFFSET 283745 > > Explain: > > Limit (cost=90956.71..90956.71 rows=1 width=55) > -> Sort (cost=90247.34..91169.63 rows=368915 width=55) > Sort Key: addresses.zipcode, addresses.housenumber > -> Hash Join (cost=14598.44..56135.75 rows=368915 width=55) > Hash Cond: ("outer".contactaddress = "inner".objectid) > -> Seq Scan on customers (cost=0.00..31392.15 > rows=368915 width=80) > -> Hash (cost=13675.15..13675.15 rows=369315 width=55) > -> Seq Scan on addresses (cost=0.00..13675.15 > rows=369315 width=55) > > The customers table has an index on contactaddress and objectid. > The addresses table has an index on zipcode+housenumber and objectid. When the resulting relation contains all the info from both tables, indexes won't help, seq scan is inevitable.
Joost, Why do you use an offset here ? I guess you're traversing the table somehow, in this case it would be better to remember the last zipcode + housenumber and put an additional condition to get the next bigger than the last one you've got... that would go for the index on zipcode+housenumber and be very fast. The big offset forces postgres to traverse that many entries until it's able to pick the one row for the result... On Tue, 2005-12-06 at 10:43, Joost Kraaijeveld wrote: > Hi, > > Is it possible to get this query run faster than it does now, by adding > indexes, changing the query? > > SELECT customers.objectid FROM prototype.customers, prototype.addresses > WHERE > customers.contactaddress = addresses.objectid > ORDER BY zipCode asc, housenumber asc > LIMIT 1 OFFSET 283745 > > Explain: > > Limit (cost=90956.71..90956.71 rows=1 width=55) > -> Sort (cost=90247.34..91169.63 rows=368915 width=55) > Sort Key: addresses.zipcode, addresses.housenumber > -> Hash Join (cost=14598.44..56135.75 rows=368915 width=55) > Hash Cond: ("outer".contactaddress = "inner".objectid) > -> Seq Scan on customers (cost=0.00..31392.15 > rows=368915 width=80) > -> Hash (cost=13675.15..13675.15 rows=369315 width=55) > -> Seq Scan on addresses (cost=0.00..13675.15 > rows=369315 width=55) > > The customers table has an index on contactaddress and objectid. > The addresses table has an index on zipcode+housenumber and objectid. > > TIA > > -- > Groeten, > > Joost Kraaijeveld > Askesis B.V. > Molukkenstraat 14 > 6524NB Nijmegen > tel: 024-3888063 / 06-51855277 > fax: 024-3608416 > e-mail: J.Kraaijeveld@Askesis.nl > web: www.askesis.nl > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
Hi, > -----Ursprüngliche Nachricht----- > Von: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] Im Auftrag > von Joost Kraaijeveld > Gesendet: Dienstag, 6. Dezember 2005 10:44 > An: Pgsql-Performance > Betreff: [PERFORM] Can this query go faster??? > SELECT customers.objectid FROM prototype.customers, > prototype.addresses WHERE customers.contactaddress = > addresses.objectid ORDER BY zipCode asc, housenumber asc > LIMIT 1 OFFSET 283745 > > Explain: > > Limit (cost=90956.71..90956.71 rows=1 width=55) > -> Sort (cost=90247.34..91169.63 rows=368915 width=55) > Sort Key: addresses.zipcode, addresses.housenumber > -> Hash Join (cost=14598.44..56135.75 rows=368915 width=55) > Hash Cond: ("outer".contactaddress = "inner".objectid) > -> Seq Scan on customers (cost=0.00..31392.15 > rows=368915 width=80) > -> Hash (cost=13675.15..13675.15 rows=369315 width=55) > -> Seq Scan on addresses (cost=0.00..13675.15 > rows=369315 width=55) > > The customers table has an index on contactaddress and objectid. > The addresses table has an index on zipcode+housenumber and objectid. The planner chooses sequential scans on customers.contactaddress and addresses.objectid instead of using the indices. Inorder to determine whether this is a sane decision, you should run EXPLAIN ANALYZE on this query, once with SET ENABLE_SEQSCAN= on; and once with SET ENABLE_SEQSCAN = off;. If the query is significantly faster with SEQSCAN off, thensomething is amiss - either you haven't run analyze often enough so the stats are out of date or you have random_page_costset too high (look for the setting in postgresql.conf) - these two are the "usual suspects". Kind regards Markus
On Tue, 2005-12-06 at 10:52 +0100, Csaba Nagy wrote: > Joost, > > Why do you use an offset here ? I guess you're traversing the table > somehow, in this case it would be better to remember the last zipcode + > housenumber and put an additional condition to get the next bigger than > the last one you've got... that would go for the index on > zipcode+housenumber and be very fast. The big offset forces postgres to > traverse that many entries until it's able to pick the one row for the I am forced to translate a sorting dependent record number to a record in the database. The GUI (a Java JTable) works with record /row numbers, which is handy if one has an ISAM database, but not if one uses PostgreSQL. I wonder if using a forward scrolling cursor would be faster. -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
Joost Kraaijeveld schrieb: > On Tue, 2005-12-06 at 10:52 +0100, Csaba Nagy wrote: > >>Joost, >> >>Why do you use an offset here ? I guess you're traversing the table >>somehow, in this case it would be better to remember the last zipcode + >>housenumber and put an additional condition to get the next bigger than >>the last one you've got... that would go for the index on >>zipcode+housenumber and be very fast. The big offset forces postgres to >>traverse that many entries until it's able to pick the one row for the > > I am forced to translate a sorting dependent record number to a record > in the database. The GUI (a Java JTable) works with record /row numbers, > which is handy if one has an ISAM database, but not if one uses > PostgreSQL. You can have a row number in postgres easily too. For example if you just include a serial for the row number. Cursor would work too but you would need to have a persistent connection. Regards Tino
Hi Tino, On Tue, 2005-12-06 at 11:32 +0100, Tino Wildenhain wrote: > You can have a row number in postgres easily too. For example if you > just include a serial for the row number. Not if the order of things is determined runtime and not at insert time... > Cursor would work too but you would need to have a persistent connection. I just tried it: a cursor is not faster (what does not surprise me at all, as the amount of work looks the same to me) I guess there is no solution. -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
Joost Kraaijeveld schrieb: > Hi Tino, > .. > >>Cursor would work too but you would need to have a persistent connection. > > I just tried it: a cursor is not faster (what does not surprise me at > all, as the amount of work looks the same to me) Actually no, if you scroll forward, you just ask the database for the next rows to materialize. So if you are ahead in your database and ask for next rows, it should be faster then working w/ an offset from start each time.
On Tue, 2005-12-06 at 12:36 +0100, Tino Wildenhain wrote: > > > > I just tried it: a cursor is not faster (what does not surprise me at > > all, as the amount of work looks the same to me) > > Actually no, if you scroll forward, you just ask the database for the > next rows to materialize. So if you are ahead in your database and > ask for next rows, it should be faster then working w/ an offset > from start each time. Ah, a misunderstanding: I only need to calculate an index if the user wants a record that is not in or adjacent to the cache (in which case I can do a "select values > last value in the cache". So I must always materialize all rows below the wanted index. -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
Joost Kraaijeveld schrieb: > On Tue, 2005-12-06 at 12:36 +0100, Tino Wildenhain wrote: > >>>I just tried it: a cursor is not faster (what does not surprise me at >>>all, as the amount of work looks the same to me) >> >>Actually no, if you scroll forward, you just ask the database for the >>next rows to materialize. So if you are ahead in your database and >>ask for next rows, it should be faster then working w/ an offset >>from start each time. > > Ah, a misunderstanding: I only need to calculate an index if the user > wants a record that is not in or adjacent to the cache (in which case I > can do a "select values > last value in the cache". So I must always > materialize all rows below the wanted index. > Yes, but still advancing a few blocks from where the cursor is should be faster then re-issuing the query and scroll thru the whole resultset to where you want to go.
On Tue, 2005-12-06 at 13:20, Joost Kraaijeveld wrote: [snip] > Ah, a misunderstanding: I only need to calculate an index if the user > wants a record that is not in or adjacent to the cache (in which case I > can do a "select values > last value in the cache". So I must always > materialize all rows below the wanted index. In this case the query will very likely not work faster. It must always visit all the records till the required offset. If the plan should be faster using the index, then you probably need to analyze (I don't recall from your former posts if you did it recently or not), in any case you could check an "explain analyze" to see if the planner is mistaken or not - you might already know this. Cheers, Csaba.
At 04:43 AM 12/6/2005, Joost Kraaijeveld wrote: >Hi, > >Is it possible to get this query run faster than it does now, by adding >indexes, changing the query? > >SELECT customers.objectid FROM prototype.customers, prototype.addresses >WHERE >customers.contactaddress = addresses.objectid >ORDER BY zipCode asc, housenumber asc >LIMIT 1 OFFSET 283745 > >Explain: > >Limit (cost=90956.71..90956.71 rows=1 width=55) > -> Sort (cost=90247.34..91169.63 rows=368915 width=55) > Sort Key: addresses.zipcode, addresses.housenumber > -> Hash Join (cost=14598.44..56135.75 rows=368915 width=55) > Hash Cond: ("outer".contactaddress = "inner".objectid) > -> Seq Scan on customers (cost=0.00..31392.15 >rows=368915 width=80) > -> Hash (cost=13675.15..13675.15 rows=369315 width=55) > -> Seq Scan on addresses (cost=0.00..13675.15 >rows=369315 width=55) > >The customers table has an index on contactaddress and objectid. >The addresses table has an index on zipcode+housenumber and objectid. > >TIA customer names, customers.objectid, addresses, and addresses.objectid should all be static (addresses do not change, just the customers associated with them; and once a customer has been assigned an id that better never change...). To me, this sounds like the addresses and customers tables should be duplicated and then physically laid out in sorted order by <tablename>.objectid in one set and by the "human friendly" associated string in the other set. Then a finding a specific <tablename>.objectid or it's associated string can be done in at worse O(lgn) time assuming binary search instead of O(n) time for a sequential scan. If pg is clever enough, it might be able to do better than that. IOW, I'd try duplicating the addresses and customers tables and using the appropriate CLUSTERed Index on each. I know this breaks Normal Form. OTOH, this kind of thing is common practice for data mining problems on static or almost static data. Hope this is helpful, Ron
> On Tue, 2005-12-06 at 11:32 +0100, Tino Wildenhain wrote: > > You can have a row number in postgres easily too. For example if you > > just include a serial for the row number. > Not if the order of things is determined runtime and not at insert time... > > > Cursor would work too but you would need to have a persistent > connection. > I just tried it: a cursor is not faster (what does not surprise me at > all, as the amount of work looks the same to me) > > I guess there is no solution. > sure there is. This begs the question: 'why do you want to read exactly 283745 rows ahead of row 'x'?) :) If you are scrolling forwards in a set, just pull in, say, 100-1000 rows at a time, ordered, and grab the next 1000 based on the highest value read previously. You can do this on server side (cursor) or client side (parameterized query). There are advantages and disadvantages to each. If you are looping over this set and doing processing, a cursor would be ideal (try out pl/pgsql). Welcome to PostgreSQL! :) Merlin
On Tue, Dec 06, 2005 at 10:52:57 +0100, Csaba Nagy <nagy@ecircle-ag.com> wrote: > Joost, > > Why do you use an offset here ? I guess you're traversing the table > somehow, in this case it would be better to remember the last zipcode + > housenumber and put an additional condition to get the next bigger than > the last one you've got... that would go for the index on > zipcode+housenumber and be very fast. The big offset forces postgres to > traverse that many entries until it's able to pick the one row for the > result... The other problem with saving an offset, is unless the data isn't changing or you are doing all of the searches in one serialized transaction, the fixed offset might not put you back where you left off. Using the last key, instead of counting records is normally a better way to do this.