Thread: Index help
Hello, I have a query that looks similiar to this : SELECT appointments.id, recur.id AS recur_id, recur.limitType, recur.limitDate, calendars_permission.perm_read, calendars_permission.perm_write FROM appointments LEFT JOIN calendars_permission ON appointments.cal_id = calendars_permission.cal_id AND calendars_permission.user_id = '1' LEFT JOIN recur ON appointments.id = recur.appt_id WHERE appointments.cal_id in ('82') AND appointments.start_date <= '2005-12-31' AND appointments.start_date >= '2004-01-01' AND appointments.modified >= '2005-01-01'; This query is run a lot so Id like to make it as fast as possible. I believe my problem is that its always doing a seq scan of the appointments table, Ive tried creating multicolumn indexes and such but it still does a seq scan. Switching off seqscans will have it do a index scan (but not the multicolumn index) but its more costly then the seq scan. Below is the explain analyze output : --- With seq scan on : Hash Left Join (cost=3.13..123.32 rows=742 width=42) (actual time=2.572..64.782 rows=742 loops=1) Hash Cond: ("outer".id = "inner".appt_id) -> Hash Left Join (cost=1.55..117.89 rows=742 width=26) (actual time=0.964..46.467 rows=742 loops=1) Hash Cond: ("outer".cal_id = "inner".cal_id) -> Seq Scan on appointments (cost=0.00..109.62 rows=742 width=24) (actual time=0.365..32.246 rows=742 loops=1) Filter: ((cal_id = 82) AND (start_date <= '2005-12-31 00:00:00+00'::timestamp with time zone) AND (start_date >= '2004-01-01 00:00:00+00'::timestamp with time zone) AND (modified >= '2005-01-01 00:00:00+00'::timestamp with time zone)) -> Hash (cost=1.55..1.55 rows=2 width=6) (actual time=0.426..0.426 rows=0 loops=1) -> Seq Scan on calendars_permission (cost=0.00..1.55 rows=2 width=6) (actual time=0.153..0.396 rows=3 loops=1) Filter: (user_id = 1) -> Hash (cost=1.46..1.46 rows=46 width=20) (actual time=1.440..1.440 rows=0 loops=1) -> Seq Scan on recur (cost=0.00..1.46 rows=46 width=20) (actual time=0.100..1.131 rows=46 loops=1) Total runtime: 68.321 ms --- With seq scan off : Hash Left Join (cost=9.51..166.96 rows=742 width=42) (actual time=11.049..162.821 rows=742 loops=1) Hash Cond: ("outer".id = "inner".appt_id) -> Hash Left Join (cost=4.99..158.60 rows=742 width=26) (actual time=3.806..131.755 rows=742 loops=1) Hash Cond: ("outer".cal_id = "inner".cal_id) -> Index Scan using appointments_id_index on appointments (cost=0.00..146.89 rows=742 width=24) (actual time=1.925..109.835 rows=742 loops=1) Filter: ((cal_id = 82) AND (start_date <= '2005-12-31 00:00:00+00'::timestamp with time zone) AND (start_date >= '2004-01-01 00:00:00+00'::timestamp with time zone) AND (modified >= '2005-01-01 00:00:00+00'::timestamp with time zone)) -> Hash (cost=4.99..4.99 rows=2 width=6) (actual time=1.585..1.585 rows=0 loops=1) -> Index Scan using calendars_permission_userid_index on saa_calendars_permission (cost=0.00..4.99 rows=2 width=6) (actual time=1.333..1.485 rows=3 loops=1) Index Cond: (user_id = 1) -> Hash (cost=4.40..4.40 rows=46 width=20) (actual time=4.138..4.138 rows=0 loops=1) -> Index Scan using recur_apptid_index on saa_recur (cost=0.00..4.40 rows=46 width=20) (actual time=1.208..3.565 rows=46 loops=1) Total runtime: 168.144 ms If someone could provide some insight into speeding this query up I would greatly appreciate it. Thank you, -A Gilmore
before the experts chime in... i read that it is fastest to sort your where clause statements with the least amount of records returned first. using this... > WHERE appointments.cal_id in ('82') > AND appointments.start_date <= '2005-12-31' > AND appointments.start_date >= '2004-01-01' > AND appointments.modified >= '2005-01-01'; use pgadmin to query your db 4 times - using a different where clause statement each time. put the statement that returns the least amount of rows first. then use pgadmin to query you db 3 times - using a different where clause statement (of three remaining) each time. repeat until you have the clauses in order of the least records returned. you still have to think through it, though, since records returned can change over time. also, you'll want to verify that fewer records actually corresponds with less query time. specific to your case, > AND appointments.modified >= '2005-01-01'; would probably return less records than either > AND appointments.start_date <= '2005-12-31' > AND appointments.start_date >= '2004-01-01' and should be placed before them in there where clause (if it does return less records in the production environment). i can't figure out what > WHERE appointments.cal_id in ('82') does so i can't comment on it. you also may want to google "sql query optimization." i hope this information is accurate and i'm interested to hear what the real experts have to say on the subject. --- A Gilmore <agilmore@shaw.ca> wrote: > Hello, > > I have a query that looks similiar to this : > > SELECT appointments.id, > recur.id AS recur_id, > recur.limitType, > recur.limitDate, > calendars_permission.perm_read, > calendars_permission.perm_write > FROM appointments LEFT JOIN calendars_permission > ON appointments.cal_id = > calendars_permission.cal_id > AND calendars_permission.user_id = '1' > LEFT JOIN recur > ON appointments.id = recur.appt_id > WHERE appointments.cal_id in ('82') > AND appointments.start_date <= '2005-12-31' > AND appointments.start_date >= '2004-01-01' > AND appointments.modified >= '2005-01-01'; > > This query is run a lot so Id like to make it as > fast as possible. I > believe my problem is that its always doing a seq > scan of the > appointments table, Ive tried creating multicolumn > indexes and such but > it still does a seq scan. Switching off seqscans > will have it do a > index scan (but not the multicolumn index) but its > more costly then the > seq scan. Below is the explain analyze output : > > --- > With seq scan on : > > Hash Left Join (cost=3.13..123.32 rows=742 > width=42) (actual > time=2.572..64.782 rows=742 loops=1) > Hash Cond: ("outer".id = "inner".appt_id) > -> Hash Left Join (cost=1.55..117.89 rows=742 > width=26) (actual > time=0.964..46.467 rows=742 loops=1) > Hash Cond: ("outer".cal_id = > "inner".cal_id) > -> Seq Scan on appointments > (cost=0.00..109.62 rows=742 > width=24) (actual time=0.365..32.246 rows=742 > loops=1) > Filter: ((cal_id = 82) AND > (start_date <= '2005-12-31 > 00:00:00+00'::timestamp with time zone) AND > (start_date >= '2004-01-01 > 00:00:00+00'::timestamp with time zone) AND > (modified >= '2005-01-01 > 00:00:00+00'::timestamp with time zone)) > -> Hash (cost=1.55..1.55 rows=2 width=6) > (actual > time=0.426..0.426 rows=0 loops=1) > -> Seq Scan on calendars_permission > (cost=0.00..1.55 > rows=2 width=6) (actual time=0.153..0.396 rows=3 > loops=1) > Filter: (user_id = 1) > -> Hash (cost=1.46..1.46 rows=46 width=20) > (actual > time=1.440..1.440 rows=0 loops=1) > -> Seq Scan on recur (cost=0.00..1.46 > rows=46 width=20) > (actual time=0.100..1.131 rows=46 loops=1) > Total runtime: 68.321 ms > > --- > With seq scan off : > > Hash Left Join (cost=9.51..166.96 rows=742 > width=42) (actual > time=11.049..162.821 rows=742 loops=1) > Hash Cond: ("outer".id = "inner".appt_id) > -> Hash Left Join (cost=4.99..158.60 rows=742 > width=26) (actual > time=3.806..131.755 rows=742 loops=1) > Hash Cond: ("outer".cal_id = > "inner".cal_id) > -> Index Scan using appointments_id_index > on appointments > (cost=0.00..146.89 rows=742 width=24) (actual > time=1.925..109.835 > rows=742 loops=1) > Filter: ((cal_id = 82) AND > (start_date <= '2005-12-31 > 00:00:00+00'::timestamp with time zone) AND > (start_date >= '2004-01-01 > 00:00:00+00'::timestamp with time zone) AND > (modified >= '2005-01-01 > 00:00:00+00'::timestamp with time zone)) > -> Hash (cost=4.99..4.99 rows=2 width=6) > (actual > time=1.585..1.585 rows=0 loops=1) > -> Index Scan using > calendars_permission_userid_index > on saa_calendars_permission (cost=0.00..4.99 rows=2 > width=6) (actual > time=1.333..1.485 rows=3 loops=1) > Index Cond: (user_id = 1) > -> Hash (cost=4.40..4.40 rows=46 width=20) > (actual > time=4.138..4.138 rows=0 loops=1) > -> Index Scan using recur_apptid_index on > saa_recur > (cost=0.00..4.40 rows=46 width=20) (actual > time=1.208..3.565 rows=46 > loops=1) > Total runtime: 168.144 ms > > > If someone could provide some insight into speeding > this query up I > would greatly appreciate it. > > Thank you, > -A Gilmore > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
A Gilmore <agilmore@shaw.ca> writes: > I have a query that looks similiar to this : > SELECT appointments.id, > recur.id AS recur_id, > recur.limitType, > recur.limitDate, > calendars_permission.perm_read, > calendars_permission.perm_write > FROM appointments LEFT JOIN calendars_permission > ON appointments.cal_id = calendars_permission.cal_id > AND calendars_permission.user_id = '1' > LEFT JOIN recur > ON appointments.id = recur.appt_id > WHERE appointments.cal_id in ('82') > AND appointments.start_date <= '2005-12-31' > AND appointments.start_date >= '2004-01-01' > AND appointments.modified >= '2005-01-01'; > This query is run a lot so Id like to make it as fast as possible. I > believe my problem is that its always doing a seq scan of the > appointments table, Ive tried creating multicolumn indexes and such but > it still does a seq scan. Uh, what multicolumn indexes did you try, exactly? If this is the standard form of the query, I'd think that an index on (cal_id, start_date, modified) --- in that order --- would be a good bet. It's also possible that indexing only (cal_id, start_date), or even just (cal_id), would be the winner. With no info about the statistics of your database, it's hard to tell which. regards, tom lane
Tom Lane wrote: > A Gilmore <agilmore@shaw.ca> writes: > >>I have a query that looks similiar to this : > > >>SELECT appointments.id, >> recur.id AS recur_id, >> recur.limitType, >> recur.limitDate, >> calendars_permission.perm_read, >> calendars_permission.perm_write >>FROM appointments LEFT JOIN calendars_permission >> ON appointments.cal_id = calendars_permission.cal_id >> AND calendars_permission.user_id = '1' >> LEFT JOIN recur >> ON appointments.id = recur.appt_id >>WHERE appointments.cal_id in ('82') >> AND appointments.start_date <= '2005-12-31' >> AND appointments.start_date >= '2004-01-01' >> AND appointments.modified >= '2005-01-01'; > > >>This query is run a lot so Id like to make it as fast as possible. I >>believe my problem is that its always doing a seq scan of the >>appointments table, Ive tried creating multicolumn indexes and such but >>it still does a seq scan. > > > Uh, what multicolumn indexes did you try, exactly? > > If this is the standard form of the query, I'd think that an index on > (cal_id, start_date, modified) --- in that order --- would be a good > bet. It's also possible that indexing only (cal_id, start_date), or > even just (cal_id), would be the winner. With no info about the > statistics of your database, it's hard to tell which. > I tried a multicolumn on (cal_id, start_date, modified), and (cal_id, start_date), and a single column for (cal_id), none are used. The (cal_id) is a reference to another table, with about 30 rows. I use the array method because in practice Ill often want to match several cal_ids, this is setup in php ahead of the query. The (appointments) table only contains about 2500 rows on the test db Im working with. Typically the two (start_date)s WHERE clause are going to be 3 months apart, and (modified) is a timestamp of row insert/update. In the scenario Im most concerned about, the (modified) WHERE clause will match everything, and probably should have been left out of my example for clarity. - A Gilmore
operationsengineer1@yahoo.com wrote: > before the experts chime in... i read that it is > fastest to sort your where clause statements with the > least amount of records returned first. > > using this... > > >>WHERE appointments.cal_id in ('82') >> AND appointments.start_date <= '2005-12-31' >> AND appointments.start_date >= '2004-01-01' >> AND appointments.modified >= '2005-01-01'; > > > use pgadmin to query your db 4 times - using a > different where clause statement each time. > > put the statement that returns the least amount of > rows first. > > then use pgadmin to query you db 3 times - using a > different where clause statement (of three remaining) > each time. > > repeat until you have the clauses in order of the > least records returned. > > you still have to think through it, though, since > records returned can change over time. also, you'll > want to verify that fewer records actually corresponds > with less query time. > > specific to your case, > > >> AND appointments.modified >= '2005-01-01'; > > > would probably return less records than either > > >> AND appointments.start_date <= '2005-12-31' >> AND appointments.start_date >= '2004-01-01' > > > and should be placed before them in there where clause > (if it does return less records in the production > environment). > > i can't figure out what > > >>WHERE appointments.cal_id in ('82') > > > does so i can't comment on it. > > you also may want to google "sql query optimization." > > i hope this information is accurate and i'm interested > to hear what the real experts have to say on the > subject. > > Switching around the two (start_date)s may help in practice, Ill try that out. In testing, the only condition that wont match everything is the (cal_id). -A Gilmore
i have a notes table that records notes for different areas. since i didn't plan on different area notes initial, each area has its own note table. i know, bad, bad boy! i have three data entry pages instead of a single one and i'm already tired of maintaining multiple pages when it isn't necessary! anyway, i want to correct this situation so that i have a single table and add an area column to differentiate the area to which the note belongs. will pgsql insert and query faster if i use int2 compared to char(2)? the reason i'm thinking of using char(2) is b/c it makes the table self documenting. for example, "pn" would obviously mean "production" and "qa" would obviously meany "quality" in the table's context. having a 1 and a 2 represent production and quality, respectively, is a lot less self documenting. i would appreciate a few people with experience chiming and providing their opinion on this. as always, thanks for sharing - it is much appreciated. __________________________________ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250
<operationsengineer1@yahoo.com> writes: > will pgsql insert and query faster if i use int2 > compared to char(2)? Yes, but whether the difference will be perceptible in the context of your application is difficult to guess. Try testing with some quick-hack changes to your app. regards, tom lane
On Thu, 28 Jul 2005 10:33:09 -0700 (PDT), operationsengineer1 wrote > i have a notes table that records notes for different > areas. since i didn't plan on different area notes > initial, each area has its own note table. > > i know, bad, bad boy! i have three data entry pages > instead of a single one and i'm already tired of > maintaining multiple pages when it isn't necessary! > > anyway, i want to correct this situation so that i > have a single table and add an area column to > differentiate the area to which the note belongs. > > will pgsql insert and query faster if i use int2 > compared to char(2)? the reason i'm thinking of using > char(2) is b/c it makes the table self documenting. > for example, "pn" would obviously mean "production" > and "qa" would obviously meany "quality" in the > table's context. having a 1 and a 2 represent > production and quality, respectively, is a lot less > self documenting. > > i would appreciate a few people with experience > chiming and providing their opinion on this. > > as always, thanks for sharing - it is much appreciated. I went through similar contortions when designing some new tables in our database. Basically I was considering emulating an enum type. Then I thought why bother? I calculated the space requirements for each type. int, char(1) and char(20). Then I looked at these in the context of the total record size. Since I was dealing with a 512 character description the size of the identifier was almost irrelevant. As you point out the self documenting issue warrents consideration. If you use an int then somewhere you will have to create and maintain some translatation code be it a SQL CASE or a switch or whatever. Not only that but I would suggest you consider that what is obvious to you may not be three generations of developers removed. Is the space difference between 'pn' and 'production' significant in the context of the record size. HTH Kind Regards, Keith
keith and tom, thanks for the responses. as per a suggestion i received via email only, i think i will be best served by creating a t_note_area table and linking its primary key, note_area_id into the notes table -> t_notes.note_area_id. that way i get keith's benefit of full nomenclature and i also don't have to limit myself by hardcoding the area names. i do exactly this for note_contact, but somehow i wasn't thinking through this situation the way i should have been. thanks to everyone for getting me back on the right track. ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs