Thread: Index help

Index help

From
A Gilmore
Date:
Hello,

I have a query that looks similiar to this :

SELECT appointments.id,
    recur.id AS recur_id,
    recur.limitType,
    recur.limitDate,
    calendars_permission.perm_read,
    calendars_permission.perm_write
FROM appointments LEFT JOIN calendars_permission
    ON appointments.cal_id = calendars_permission.cal_id
    AND calendars_permission.user_id = '1'
    LEFT JOIN recur
    ON appointments.id = recur.appt_id
WHERE appointments.cal_id in ('82')
    AND appointments.start_date <= '2005-12-31'
    AND appointments.start_date >= '2004-01-01'
    AND appointments.modified >= '2005-01-01';

This query is run a lot so Id like to make it as fast as possible.  I
believe my problem is that its always doing a seq scan of the
appointments table, Ive tried creating multicolumn indexes and such but
it still does a seq scan.  Switching off seqscans will have it do a
index scan (but not the multicolumn index) but its more costly then the
seq scan.  Below is the explain analyze output :

---
With seq scan on :

Hash Left Join  (cost=3.13..123.32 rows=742 width=42) (actual
time=2.572..64.782 rows=742 loops=1)
    Hash Cond: ("outer".id = "inner".appt_id)
    ->  Hash Left Join  (cost=1.55..117.89 rows=742 width=26) (actual
time=0.964..46.467 rows=742 loops=1)
          Hash Cond: ("outer".cal_id = "inner".cal_id)
          ->  Seq Scan on appointments  (cost=0.00..109.62 rows=742
width=24) (actual time=0.365..32.246 rows=742 loops=1)
                Filter: ((cal_id = 82) AND (start_date <= '2005-12-31
00:00:00+00'::timestamp with time zone) AND (start_date >= '2004-01-01
00:00:00+00'::timestamp with time zone) AND (modified >= '2005-01-01
00:00:00+00'::timestamp with time zone))
          ->  Hash  (cost=1.55..1.55 rows=2 width=6) (actual
time=0.426..0.426 rows=0 loops=1)
                ->  Seq Scan on calendars_permission  (cost=0.00..1.55
rows=2 width=6) (actual time=0.153..0.396 rows=3 loops=1)
                      Filter: (user_id = 1)
    ->  Hash  (cost=1.46..1.46 rows=46 width=20) (actual
time=1.440..1.440 rows=0 loops=1)
          ->  Seq Scan on recur  (cost=0.00..1.46 rows=46 width=20)
(actual time=0.100..1.131 rows=46 loops=1)
  Total runtime: 68.321 ms

---
With seq scan off :

Hash Left Join  (cost=9.51..166.96 rows=742 width=42) (actual
time=11.049..162.821 rows=742 loops=1)
    Hash Cond: ("outer".id = "inner".appt_id)
    ->  Hash Left Join  (cost=4.99..158.60 rows=742 width=26) (actual
time=3.806..131.755 rows=742 loops=1)
          Hash Cond: ("outer".cal_id = "inner".cal_id)
          ->  Index Scan using appointments_id_index on appointments
(cost=0.00..146.89 rows=742 width=24) (actual time=1.925..109.835
rows=742 loops=1)
                Filter: ((cal_id = 82) AND (start_date <= '2005-12-31
00:00:00+00'::timestamp with time zone) AND (start_date >= '2004-01-01
00:00:00+00'::timestamp with time zone) AND (modified >= '2005-01-01
00:00:00+00'::timestamp with time zone))
          ->  Hash  (cost=4.99..4.99 rows=2 width=6) (actual
time=1.585..1.585 rows=0 loops=1)
                ->  Index Scan using calendars_permission_userid_index
on saa_calendars_permission  (cost=0.00..4.99 rows=2 width=6) (actual
time=1.333..1.485 rows=3 loops=1)
                      Index Cond: (user_id = 1)
    ->  Hash  (cost=4.40..4.40 rows=46 width=20) (actual
time=4.138..4.138 rows=0 loops=1)
          ->  Index Scan using recur_apptid_index on saa_recur
(cost=0.00..4.40 rows=46 width=20) (actual time=1.208..3.565 rows=46
loops=1)
  Total runtime: 168.144 ms


If someone could provide some insight into speeding this query up I
would greatly appreciate it.

Thank you,
-A Gilmore

Re: Index help

From
Date:
before the experts chime in...  i read that it is
fastest to sort your where clause statements with the
least amount of records returned first.

using this...

> WHERE appointments.cal_id in ('82')
>     AND appointments.start_date <= '2005-12-31'
>     AND appointments.start_date >= '2004-01-01'
>     AND appointments.modified >= '2005-01-01';

use pgadmin to query your db 4 times - using a
different where clause statement each time.

put the statement that returns the least amount of
rows first.

then use pgadmin to query you db 3 times - using a
different where clause statement (of three remaining)
each time.

repeat until you have the clauses in order of the
least  records returned.

you still have to think through it, though, since
records returned can change over time.  also, you'll
want to verify that fewer records actually corresponds
with less query time.

specific to your case,

>     AND appointments.modified >= '2005-01-01';

would probably return less records than either

>     AND appointments.start_date <= '2005-12-31'
>     AND appointments.start_date >= '2004-01-01'

and should be placed before them in there where clause
(if it does return less records in the production
environment).

i can't figure out what

> WHERE appointments.cal_id in ('82')

does so i can't comment on it.

you also may want to google "sql query optimization."

i hope this information is accurate and i'm interested
to hear what the real experts have to say on the
subject.


--- A Gilmore <agilmore@shaw.ca> wrote:

> Hello,
>
> I have a query that looks similiar to this :
>
> SELECT appointments.id,
>     recur.id AS recur_id,
>     recur.limitType,
>     recur.limitDate,
>     calendars_permission.perm_read,
>     calendars_permission.perm_write
> FROM appointments LEFT JOIN calendars_permission
>     ON appointments.cal_id =
> calendars_permission.cal_id
>     AND calendars_permission.user_id = '1'
>     LEFT JOIN recur
>     ON appointments.id = recur.appt_id
> WHERE appointments.cal_id in ('82')
>     AND appointments.start_date <= '2005-12-31'
>     AND appointments.start_date >= '2004-01-01'
>     AND appointments.modified >= '2005-01-01';
>
> This query is run a lot so Id like to make it as
> fast as possible.  I
> believe my problem is that its always doing a seq
> scan of the
> appointments table, Ive tried creating multicolumn
> indexes and such but
> it still does a seq scan.  Switching off seqscans
> will have it do a
> index scan (but not the multicolumn index) but its
> more costly then the
> seq scan.  Below is the explain analyze output :
>
> ---
> With seq scan on :
>
> Hash Left Join  (cost=3.13..123.32 rows=742
> width=42) (actual
> time=2.572..64.782 rows=742 loops=1)
>     Hash Cond: ("outer".id = "inner".appt_id)
>     ->  Hash Left Join  (cost=1.55..117.89 rows=742
> width=26) (actual
> time=0.964..46.467 rows=742 loops=1)
>           Hash Cond: ("outer".cal_id =
> "inner".cal_id)
>           ->  Seq Scan on appointments
> (cost=0.00..109.62 rows=742
> width=24) (actual time=0.365..32.246 rows=742
> loops=1)
>                 Filter: ((cal_id = 82) AND
> (start_date <= '2005-12-31
> 00:00:00+00'::timestamp with time zone) AND
> (start_date >= '2004-01-01
> 00:00:00+00'::timestamp with time zone) AND
> (modified >= '2005-01-01
> 00:00:00+00'::timestamp with time zone))
>           ->  Hash  (cost=1.55..1.55 rows=2 width=6)
> (actual
> time=0.426..0.426 rows=0 loops=1)
>                 ->  Seq Scan on calendars_permission
>  (cost=0.00..1.55
> rows=2 width=6) (actual time=0.153..0.396 rows=3
> loops=1)
>                       Filter: (user_id = 1)
>     ->  Hash  (cost=1.46..1.46 rows=46 width=20)
> (actual
> time=1.440..1.440 rows=0 loops=1)
>           ->  Seq Scan on recur  (cost=0.00..1.46
> rows=46 width=20)
> (actual time=0.100..1.131 rows=46 loops=1)
>   Total runtime: 68.321 ms
>
> ---
> With seq scan off :
>
> Hash Left Join  (cost=9.51..166.96 rows=742
> width=42) (actual
> time=11.049..162.821 rows=742 loops=1)
>     Hash Cond: ("outer".id = "inner".appt_id)
>     ->  Hash Left Join  (cost=4.99..158.60 rows=742
> width=26) (actual
> time=3.806..131.755 rows=742 loops=1)
>           Hash Cond: ("outer".cal_id =
> "inner".cal_id)
>           ->  Index Scan using appointments_id_index
> on appointments
> (cost=0.00..146.89 rows=742 width=24) (actual
> time=1.925..109.835
> rows=742 loops=1)
>                 Filter: ((cal_id = 82) AND
> (start_date <= '2005-12-31
> 00:00:00+00'::timestamp with time zone) AND
> (start_date >= '2004-01-01
> 00:00:00+00'::timestamp with time zone) AND
> (modified >= '2005-01-01
> 00:00:00+00'::timestamp with time zone))
>           ->  Hash  (cost=4.99..4.99 rows=2 width=6)
> (actual
> time=1.585..1.585 rows=0 loops=1)
>                 ->  Index Scan using
> calendars_permission_userid_index
> on saa_calendars_permission  (cost=0.00..4.99 rows=2
> width=6) (actual
> time=1.333..1.485 rows=3 loops=1)
>                       Index Cond: (user_id = 1)
>     ->  Hash  (cost=4.40..4.40 rows=46 width=20)
> (actual
> time=4.138..4.138 rows=0 loops=1)
>           ->  Index Scan using recur_apptid_index on
> saa_recur
> (cost=0.00..4.40 rows=46 width=20) (actual
> time=1.208..3.565 rows=46
> loops=1)
>   Total runtime: 168.144 ms
>
>
> If someone could provide some insight into speeding
> this query up I
> would greatly appreciate it.
>
> Thank you,
> -A Gilmore
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>




____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs


Re: Index help

From
Tom Lane
Date:
A Gilmore <agilmore@shaw.ca> writes:
> I have a query that looks similiar to this :

> SELECT appointments.id,
>     recur.id AS recur_id,
>     recur.limitType,
>     recur.limitDate,
>     calendars_permission.perm_read,
>     calendars_permission.perm_write
> FROM appointments LEFT JOIN calendars_permission
>     ON appointments.cal_id = calendars_permission.cal_id
>     AND calendars_permission.user_id = '1'
>     LEFT JOIN recur
>     ON appointments.id = recur.appt_id
> WHERE appointments.cal_id in ('82')
>     AND appointments.start_date <= '2005-12-31'
>     AND appointments.start_date >= '2004-01-01'
>     AND appointments.modified >= '2005-01-01';

> This query is run a lot so Id like to make it as fast as possible.  I
> believe my problem is that its always doing a seq scan of the
> appointments table, Ive tried creating multicolumn indexes and such but
> it still does a seq scan.

Uh, what multicolumn indexes did you try, exactly?

If this is the standard form of the query, I'd think that an index on
(cal_id, start_date, modified) --- in that order --- would be a good
bet.  It's also possible that indexing only (cal_id, start_date), or
even just (cal_id), would be the winner.  With no info about the
statistics of your database, it's hard to tell which.

            regards, tom lane

Re: Index help

From
A Gilmore
Date:
Tom Lane wrote:
> A Gilmore <agilmore@shaw.ca> writes:
>
>>I have a query that looks similiar to this :
>
>
>>SELECT appointments.id,
>>    recur.id AS recur_id,
>>    recur.limitType,
>>    recur.limitDate,
>>    calendars_permission.perm_read,
>>    calendars_permission.perm_write
>>FROM appointments LEFT JOIN calendars_permission
>>    ON appointments.cal_id = calendars_permission.cal_id
>>    AND calendars_permission.user_id = '1'
>>    LEFT JOIN recur
>>    ON appointments.id = recur.appt_id
>>WHERE appointments.cal_id in ('82')
>>    AND appointments.start_date <= '2005-12-31'
>>    AND appointments.start_date >= '2004-01-01'
>>    AND appointments.modified >= '2005-01-01';
>
>
>>This query is run a lot so Id like to make it as fast as possible.  I
>>believe my problem is that its always doing a seq scan of the
>>appointments table, Ive tried creating multicolumn indexes and such but
>>it still does a seq scan.
>
>
> Uh, what multicolumn indexes did you try, exactly?
>
> If this is the standard form of the query, I'd think that an index on
> (cal_id, start_date, modified) --- in that order --- would be a good
> bet.  It's also possible that indexing only (cal_id, start_date), or
> even just (cal_id), would be the winner.  With no info about the
> statistics of your database, it's hard to tell which.
>

I tried a multicolumn on (cal_id, start_date, modified), and (cal_id,
start_date), and a single column for (cal_id), none are used.

The (cal_id) is a reference to another table, with about 30 rows.  I use
the array method because in practice Ill often want to match several
cal_ids, this is setup in php ahead of the query.

The (appointments) table only contains about 2500 rows on the test db Im
working with.  Typically the two (start_date)s WHERE clause are going to
be 3 months apart, and (modified) is a timestamp of row insert/update.
In the scenario Im most concerned about, the (modified) WHERE clause
will match everything, and probably should have been left out of my
example for clarity.

- A Gilmore

Re: Index help

From
A Gilmore
Date:
operationsengineer1@yahoo.com wrote:
> before the experts chime in...  i read that it is
> fastest to sort your where clause statements with the
> least amount of records returned first.
>
> using this...
>
>
>>WHERE appointments.cal_id in ('82')
>>    AND appointments.start_date <= '2005-12-31'
>>    AND appointments.start_date >= '2004-01-01'
>>    AND appointments.modified >= '2005-01-01';
>
>
> use pgadmin to query your db 4 times - using a
> different where clause statement each time.
>
> put the statement that returns the least amount of
> rows first.
>
> then use pgadmin to query you db 3 times - using a
> different where clause statement (of three remaining)
> each time.
>
> repeat until you have the clauses in order of the
> least  records returned.
>
> you still have to think through it, though, since
> records returned can change over time.  also, you'll
> want to verify that fewer records actually corresponds
> with less query time.
>
> specific to your case,
>
>
>>    AND appointments.modified >= '2005-01-01';
>
>
> would probably return less records than either
>
>
>>    AND appointments.start_date <= '2005-12-31'
>>    AND appointments.start_date >= '2004-01-01'
>
>
> and should be placed before them in there where clause
> (if it does return less records in the production
> environment).
>
> i can't figure out what
>
>
>>WHERE appointments.cal_id in ('82')
>
>
> does so i can't comment on it.
>
> you also may want to google "sql query optimization."
>
> i hope this information is accurate and i'm interested
> to hear what the real experts have to say on the
> subject.
>
>

Switching around the two (start_date)s may help in practice, Ill try
that out.  In testing, the only condition that wont match everything is
the (cal_id).

-A Gilmore


Table Design Issue & PGSQL Performance

From
Date:
i have a notes table that records notes for different
areas.  since i didn't plan on different area notes
initial, each area has its own note table.

i know, bad, bad boy!  i have three data entry pages
instead of a single one and i'm already tired of
maintaining multiple pages when it isn't necessary!

anyway, i want to correct this situation so that i
have a single table and add an area column to
differentiate the area to which the note belongs.

will pgsql insert and query faster if i use int2
compared to char(2)?  the reason i'm thinking of using
char(2) is b/c it makes the table self documenting.
for example, "pn" would obviously mean "production"
and "qa" would obviously meany "quality" in the
table's context.  having a 1 and a 2 represent
production and quality, respectively, is a lot less
self documenting.

i would appreciate a few people with experience
chiming and providing their opinion on this.

as always, thanks for sharing - it is much appreciated.



__________________________________
Do you Yahoo!?
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

Re: Table Design Issue & PGSQL Performance

From
Tom Lane
Date:
<operationsengineer1@yahoo.com> writes:
> will pgsql insert and query faster if i use int2
> compared to char(2)?

Yes, but whether the difference will be perceptible in the context of
your application is difficult to guess.  Try testing with some
quick-hack changes to your app.

            regards, tom lane

Re: Table Design Issue & PGSQL Performance

From
"Keith Worthington"
Date:
On Thu, 28 Jul 2005 10:33:09 -0700 (PDT), operationsengineer1 wrote
> i have a notes table that records notes for different
> areas.  since i didn't plan on different area notes
> initial, each area has its own note table.
>
> i know, bad, bad boy!  i have three data entry pages
> instead of a single one and i'm already tired of
> maintaining multiple pages when it isn't necessary!
>
> anyway, i want to correct this situation so that i
> have a single table and add an area column to
> differentiate the area to which the note belongs.
>
> will pgsql insert and query faster if i use int2
> compared to char(2)?  the reason i'm thinking of using
> char(2) is b/c it makes the table self documenting.
> for example, "pn" would obviously mean "production"
> and "qa" would obviously meany "quality" in the
> table's context.  having a 1 and a 2 represent
> production and quality, respectively, is a lot less
> self documenting.
>
> i would appreciate a few people with experience
> chiming and providing their opinion on this.
>
> as always, thanks for sharing - it is much appreciated.

I went through similar contortions when designing some new tables in our
database.  Basically I was considering emulating an enum type.  Then I thought
why bother?  I calculated the space requirements for each type.  int, char(1)
and char(20).  Then I looked at these in the context of the total record size.
 Since I was dealing with a 512 character description the size of the
identifier was almost irrelevant.  As you point out the self documenting issue
warrents consideration.  If you use an int then somewhere you will have to
create and maintain some translatation code be it a SQL CASE or a switch or
whatever.  Not only that but I would suggest you consider that what is obvious
to you may not be three generations of developers removed.  Is the space
difference between 'pn' and 'production' significant in the context of the
record size.

HTH

Kind Regards,
Keith

Re: Table Design Issue & PGSQL Performance

From
Date:
keith and tom, thanks for the responses.  as per a
suggestion i received via email only, i think i will
be best served by creating a t_note_area table and
linking its primary key, note_area_id into the notes
table -> t_notes.note_area_id.

that way i get keith's benefit of full nomenclature
and i also don't have to limit myself by hardcoding
the area names.

i do exactly this for note_contact, but somehow i
wasn't thinking through this situation the way i
should have been.

thanks to everyone for getting me back on the right track.



____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs