Re: BUG #5294: Sorts on more than just the order-by clause - Mailing list pgsql-bugs

From Allen Johnson
Subject Re: BUG #5294: Sorts on more than just the order-by clause
Date
Msg-id 6786ed4f1001221307g3f34f41cn7a1aa8113886dbd2@mail.gmail.com
Whole thread Raw
In response to Re: BUG #5294: Sorts on more than just the order-by clause  (Greg Stark <gsstark@mit.edu>)
Responses Re: BUG #5294: Sorts on more than just the order-by clause  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Ok, I've generated a test database with:
  * 20,000 users
  * 250,000 contacts
  * 1,124,700 attachments

The summary of the results is that the normal query takes about 32sec
on my machine. The hack query takes about  13sec.

Below are the queries and their `explain analyze` outputs. I've
attached the test layout as well as a zip file containing the ruby
scripts that generate the data in the default format that the 'copy'
command expects.

If anyone else wants to give it a try this is the procedure. I wrote
these scripts in a hurry so I'm sure there could be many improvements
:)

1. Apply the tables.sql file to a test database (might want to run the
create index commands after the data load)
2. Create the data files using scripts in the create-scripts.zip like this:
      ruby create-user.rb ; creates /tmp/users
      ruby create-contacts.rb ; creates /tmp/contacts
      ruby create-attachments.rb ; creates /tmp/attachments
3. Load data into the test database
      copy users from '/tmp/users';
      copy contacts from '/tmp/contacts';
      copy attachments from '/tmp/attachments';
4. Run the `create index` statements in the tables.sql file

I also have a pg_dump version if anyone wants it (~6MB gzipped).

/* NORMAL QUERY */
select
  users.id, users.full_name, users.username,
  ct.name as type_name,
  c.lname, c.fname, c.mname, c.email,
  c.address1, c.city, c.state, c.zip_code,
  c.created_date,
  count(a.id) as attachment_count

from
  contacts c
  inner join users on (users.id = c.user_id)
  inner join contact_types ct on (ct.code = c.contact_type_code)
  left join attachments a on (a.contact_id = c.id)

where
  users.company_id = 1
  and c.contact_type_code in ('BOSS', 'EMP', 'WORK')

group by
  users.id, users.full_name, users.username,
  ct.name, c.lname, c.fname, c.mname, c.email,
  c.address1, c.city, c.state, c.zip_code,
  c.created_date

order by
  users.full_name,
  ct.name, c.lname, c.fname, c.mname,
  c.created_date

EXPLAIN ANALYZE OUTPUT:

        QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=102724.80..122894.80 rows=403400 width=124)
(actual time=30683.912..32431.153 rows=75228 loops=1)
   ->  Sort  (cost=102724.80..103733.30 rows=403400 width=124) (actual
time=30683.869..31236.760 rows=345543 loops=1)
         Sort Key: users.full_name, ct.name, c.lname, c.fname,
c.mname, c.created_date, users.id, users.username, c.email,
c.address1, c.city, c.state, c.zip_code
         Sort Method:  quicksort  Memory: 96578kB
         ->  Merge Right Join  (cost=16571.91..65164.53 rows=403400
width=124) (actual time=1946.609..7523.831 rows=345543 loops=1)
               Merge Cond: (a.contact_id = c.id)
               ->  Index Scan using attachments_contact on attachments
a  (cost=0.00..39729.87 rows=1124700 width=12) (actual
time=0.082..2150.242 rows=1124662 loops=1)
               ->  Sort  (cost=16571.91..16732.30 rows=64157
width=124) (actual time=1946.463..2384.232 rows=345537 loops=1)
                     Sort Key: c.id
                     Sort Method:  quicksort  Memory: 21439kB
                     ->  Hash Join  (cost=553.92..11449.19 rows=64157
width=124) (actual time=81.120..1584.615 rows=75228 loops=1)
                           Hash Cond: ((c.contact_type_code)::text =
(ct.code)::text)
                           ->  Hash Join  (cost=552.81..10565.92
rows=64157 width=121) (actual time=81.063..1285.727 rows=75228
loops=1)
                                 Hash Cond: (c.user_id = users.id)
                                 ->  Seq Scan on contacts c
(cost=0.00..7534.50 rows=122469 width=96) (actual time=0.023..526.259
rows=150077 loops=1)
                                       Filter:
((contact_type_code)::text = ANY ('{BOSS,EMP,WORK}'::text[]))
                                 ->  Hash  (cost=427.00..427.00
rows=10065 width=33) (actual time=80.974..80.974 rows=10065 loops=1)
                                       ->  Seq Scan on users
(cost=0.00..427.00 rows=10065 width=33) (actual time=0.022..37.797
rows=10065 loops=1)
                                             Filter: (company_id = 1)
                           ->  Hash  (cost=1.05..1.05 rows=5 width=12)
(actual time=0.037..0.037 rows=5 loops=1)
                                 ->  Seq Scan on contact_types ct
(cost=0.00..1.05 rows=5 width=12) (actual time=0.018..0.024 rows=5
loops=1)
 Total runtime: 32551.132 ms
(22 rows)


/* HACK QUERY */
select * from (
  select
    users.id, users.full_name, users.username,
    ct.name as type_name,
    c.lname, c.fname, c.mname, c.email,
    c.address1, c.city, c.state, c.zip_code,
    c.created_date,
    count(a.id) as attachment_count

  from
    contacts c
    inner join users on (users.id = c.user_id)
    inner join contact_types ct on (ct.code = c.contact_type_code)
    left join attachments a on (a.contact_id = c.id)

  where
    users.company_id = 1
    and c.contact_type_code in ('BOSS', 'EMP', 'WORK')

  group by
    users.id, users.full_name, users.username,
    ct.name, c.lname, c.fname, c.mname, c.email,
    c.address1, c.city, c.state, c.zip_code,
    c.created_date
) as results
order by
  full_name,
  type_name, lname, fname, mname,
  created_date

/* EXPLAIN ANALYZE OUTPUT */

        QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=420987.80..421996.30 rows=403400 width=1688) (actual
time=12579.266..12663.383 rows=75228 loops=1)
   Sort Key: users.full_name, ct.name, c.lname, c.fname, c.mname, c.created_date
   Sort Method:  quicksort  Memory: 21435kB
   ->  HashAggregate  (cost=79283.53..84326.03 rows=403400 width=124)
(actual time=9546.773..9721.322 rows=75228 loops=1)
         ->  Merge Right Join  (cost=16571.91..65164.53 rows=403400
width=124) (actual time=1857.597..7428.718 rows=345543 loops=1)
               Merge Cond: (a.contact_id = c.id)
               ->  Index Scan using attachments_contact on attachments
a  (cost=0.00..39729.87 rows=1124700 width=12) (actual
time=0.862..2298.099 rows=1124662 loops=1)
               ->  Sort  (cost=16571.91..16732.30 rows=64157
width=124) (actual time=1856.666..2279.646 rows=345537 loops=1)
                     Sort Key: c.id
                     Sort Method:  quicksort  Memory: 21439kB
                     ->  Hash Join  (cost=553.92..11449.19 rows=64157
width=124) (actual time=54.465..1500.539 rows=75228 loops=1)
                           Hash Cond: ((c.contact_type_code)::text =
(ct.code)::text)
                           ->  Hash Join  (cost=552.81..10565.92
rows=64157 width=121) (actual time=54.375..1210.697 rows=75228
loops=1)
                                 Hash Cond: (c.user_id = users.id)
                                 ->  Seq Scan on contacts c
(cost=0.00..7534.50 rows=122469 width=96) (actual time=0.035..492.531
rows=150077 loops=1)
                                       Filter:
((contact_type_code)::text = ANY ('{BOSS,EMP,WORK}'::text[]))
                                 ->  Hash  (cost=427.00..427.00
rows=10065 width=33) (actual time=54.277..54.277 rows=10065 loops=1)
                                       ->  Seq Scan on users
(cost=0.00..427.00 rows=10065 width=33) (actual time=0.032..27.676
rows=10065 loops=1)
                                             Filter: (company_id = 1)
                           ->  Hash  (cost=1.05..1.05 rows=5 width=12)
(actual time=0.055..0.055 rows=5 loops=1)
                                 ->  Seq Scan on contact_types ct
(cost=0.00..1.05 rows=5 width=12) (actual time=0.026..0.035 rows=5
loops=1)
 Total runtime: 12752.540 ms
(22 rows)

Attachment

pgsql-bugs by date:

Previous
From: "Milen A. Radev"
Date:
Subject: Re: BUG #5290: Simple loop with insert into and check to avoid duplicate values fails
Next
From: Tom Lane
Date:
Subject: Re: BUG #5294: Sorts on more than just the order-by clause