Thread: Slow query on OS X box
I have a table that contains over 13 million rows. This query takes an extremely long time to return. I've vacuum full, analyzed, and re-indexed the table. Still the results are the same. Any ideas? TIA Patrick mdc_oz=# explain analyze select wizard from search_log where wizard ='Keyword' and sdate between '2002-12-01' and '2003-01-15'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) (actual time=99833.83..162951.25 rows=3280573 loops=1) Filter: ((wizard = 'Keyword'::character varying) AND (sdate > = '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) Total runtime: 174713.25 msec (3 rows) My box I'm running PG on: Dual 500 Mac OS X 1g ram Pg 7.3.0 Conf settings max_connections = 200 shared_buffers = 15200 #max_fsm_relations = 100 # min 10, fsm is free space map, ~40 bytes #max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes #max_locks_per_transaction = 64 # min 10 #wal_buffers = 8 # min 4, typically 8KB each CREATE TABLE public.search_log ( wizard varchar(50) NOT NULL, sub_wizard varchar(50), timestamp varchar(75), department int4, gender varchar(25), occasion varchar(50), age varchar(25), product_type varchar(2000), price_range varchar(1000), brand varchar(2000), keyword varchar(1000), result_count int4, html_count int4, fragrance_type varchar(50), frag_type varchar(50), frag_gender char(1), trip_length varchar(25), carry_on varchar(25), suiter varchar(25), expandable varchar(25), wheels varchar(25), style varchar(1000), heel_type varchar(25), option varchar(50), metal varchar(255), gem varchar(255), bra_size varchar(25), feature1 varchar(50), feature2 varchar(50), feature3 varchar(50), sdate date, stimestamp timestamptz, file_name text ) WITH OIDS; CREATE INDEX date_idx ON search_log USING btree (sdate); CREATE INDEX slog_wizard_idx ON search_log USING btree (wizard);
On Wed, 2003-01-22 at 13:26, Patrick Hatcher wrote: > I have a table that contains over 13 million rows. This query takes an > extremely long time to return. I've vacuum full, analyzed, and re-indexed > the table. Still the results are the same. Any ideas? Yeah, you're pulling out 3.2 million rows from (possibly) a wide table bytewise. Do all of those fields actually have data? Thats always going to take a while -- and I find it hard to believe you're actually doing something with all of those rows that runs regularly. If every one of those rows was maxed out (ignoring the text field at the end) you could have ~ 15GB of data to pull out. Without knowing the type of data actually in the table, I'm going to bet it's a harddrive limitation. The index on 'wizard' is next to useless as at least 1/4 of the data in the table is under the same key. You might try a partial index on 'wizard' (skip the value 'Keyword'). It won't help this query, but it'll help ones looking for values other than 'Keyword'. Anyway, you might try a CURSOR. Fetch rows out 5000 at a time, do some work with them, then grab some more. This -- more or less -- will allow you to process the rows received while awaiting the remaining lines to be processed by the database. Depending on what you're doing with them it'll give a chance for the diskdrive to catch up. If the kernels smart it'll read ahead of the scan. This doesn't remove read time, but hides it while you're transferring the data out (from the db to your client) or processing it. > mdc_oz=# explain analyze select wizard from search_log where wizard > ='Keyword' and sdate between '2002-12-01' and '2003-01-15'; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) > (actual time=99833.83..162951.25 rows=3280573 loops=1) > Filter: ((wizard = 'Keyword'::character varying) AND (sdate > > = '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) > Total runtime: 174713.25 msec > (3 rows) -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
What about creating a multi-segment index on wizard/sdate? On a side note: that record is ~8KB long, which is kinda big. You could split those column into a seperate table (or tables), so that when you want to query, say, gender, department & trip_length, you won't have to read in *so*much* extra data, slowing the query down. Also, these column sizes seem kind excessive, and allow for bad data to seep in to the table: timestamp varchar(75), age varchar(25), metal varchar(255), gem varchar(255), bra_size varchar(25), On Wed, 2003-01-22 at 12:26, Patrick Hatcher wrote: > I have a table that contains over 13 million rows. This query takes an > extremely long time to return. I've vacuum full, analyzed, and re-indexed > the table. Still the results are the same. Any ideas? > TIA > Patrick > > mdc_oz=# explain analyze select wizard from search_log where wizard > ='Keyword' and sdate between '2002-12-01' and '2003-01-15'; > QUERY PLAN > ----------------------------------------------------------------------------- > Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) > (actual time=99833.83..162951.25 rows=3280573 loops=1) > Filter: ((wizard = 'Keyword'::character varying) AND (sdate > > = '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) > Total runtime: 174713.25 msec > (3 rows) > > My box I'm running PG on: > Dual 500 Mac OS X > 1g ram > Pg 7.3.0 > > Conf settings > max_connections = 200 > shared_buffers = 15200 > #max_fsm_relations = 100 # min 10, fsm is free space map, ~40 bytes > #max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes > #max_locks_per_transaction = 64 # min 10 > #wal_buffers = 8 # min 4, typically 8KB each > > > > > CREATE TABLE public.search_log ( > wizard varchar(50) NOT NULL, > sub_wizard varchar(50), > timestamp varchar(75), > department int4, > gender varchar(25), > occasion varchar(50), > age varchar(25), > product_type varchar(2000), > price_range varchar(1000), > brand varchar(2000), > keyword varchar(1000), > result_count int4, > html_count int4, > fragrance_type varchar(50), > frag_type varchar(50), > frag_gender char(1), > trip_length varchar(25), > carry_on varchar(25), > suiter varchar(25), > expandable varchar(25), > wheels varchar(25), > style varchar(1000), > heel_type varchar(25), > option varchar(50), > metal varchar(255), > gem varchar(255), > bra_size varchar(25), > feature1 varchar(50), > feature2 varchar(50), > feature3 varchar(50), > sdate date, > stimestamp timestamptz, > file_name text > ) WITH OIDS; > > CREATE INDEX date_idx ON search_log USING btree (sdate); > CREATE INDEX slog_wizard_idx ON search_log USING btree (wizard); > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html -- +---------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "My advice to you is to get married: If you find a good wife, | | you will be happy; if not, you will become a philosopher." | | Socrates | +---------------------------------------------------------------+
Sorry I'm being really dense today. I didn't even notice the 3.2 million row being returned. :( To answer your question, no, all fields would not have data. The data we receive is from a Web log file. It's parsed and then uploaded to this table. I guess the bigger issue is that when trying to do aggregates, grouping by the wizard field, it takes just as long. Ex: mdc_oz=# explain analyze select wizard,count(wizard) from search_log where sdate between '2002-12-01' and '2003-01-15' group by wizard; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1083300.43..1112411.55 rows=388148 width=10) (actual time=229503.85..302617.75 rows=14 loops=1) -> Group (cost=1083300.43..1102707.84 rows=3881482 width=10) (actual time=229503.60..286014.83 rows=3717161 loops=1) -> Sort (cost=1083300.43..1093004.14 rows=3881482 width=10) (actual time=229503.57..248415.81 rows=3717161 loops=1) Sort Key: wizard -> Seq Scan on search_log (cost=0.00..575217.57 rows=3881482 width=10) (actual time=91235.76..157559.58 rows=3717161 loops=1) Filter: ((sdate >= '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) Total runtime: 302712.48 msec (7 rows) Thanks again for the help Patrick Hatcher Rod Taylor <rbt@rbt.ca> To: Patrick Hatcher <PHatcher@macys.com> cc: Postgresql Performance <pgsql-performance@postgresql.org> 01/22/2003 Subject: Re: [PERFORM] Slow query on OS X box 12:02 PM On Wed, 2003-01-22 at 13:26, Patrick Hatcher wrote: > I have a table that contains over 13 million rows. This query takes an > extremely long time to return. I've vacuum full, analyzed, and re-indexed > the table. Still the results are the same. Any ideas? Yeah, you're pulling out 3.2 million rows from (possibly) a wide table bytewise. Do all of those fields actually have data? Thats always going to take a while -- and I find it hard to believe you're actually doing something with all of those rows that runs regularly. If every one of those rows was maxed out (ignoring the text field at the end) you could have ~ 15GB of data to pull out. Without knowing the type of data actually in the table, I'm going to bet it's a harddrive limitation. The index on 'wizard' is next to useless as at least 1/4 of the data in the table is under the same key. You might try a partial index on 'wizard' (skip the value 'Keyword'). It won't help this query, but it'll help ones looking for values other than 'Keyword'. Anyway, you might try a CURSOR. Fetch rows out 5000 at a time, do some work with them, then grab some more. This -- more or less -- will allow you to process the rows received while awaiting the remaining lines to be processed by the database. Depending on what you're doing with them it'll give a chance for the diskdrive to catch up. If the kernels smart it'll read ahead of the scan. This doesn't remove read time, but hides it while you're transferring the data out (from the db to your client) or processing it. > mdc_oz=# explain analyze select wizard from search_log where wizard > ='Keyword' and sdate between '2002-12-01' and '2003-01-15'; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) > (actual time=99833.83..162951.25 rows=3280573 loops=1) > Filter: ((wizard = 'Keyword'::character varying) AND (sdate > > = '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) > Total runtime: 174713.25 msec > (3 rows) -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc (See attached file: signature.asc)
Attachment
Patrick Hatcher wrote: >I have a table that contains over 13 million rows. This query takes an >extremely long time to return. I've vacuum full, analyzed, and re-indexed >the table. Still the results are the same. Any ideas? >TIA >Patrick > >mdc_oz=# explain analyze select wizard from search_log where wizard >='Keyword' and sdate between '2002-12-01' and '2003-01-15'; > QUERY PLAN >----------------------------------------------------------------------------------------------------------------------------- > Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) >(actual time=99833.83..162951.25 rows=3280573 loops=1) > Filter: ((wizard = 'Keyword'::character varying) AND (sdate > >= '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) > Total runtime: 174713.25 msec >(3 rows) > >My box I'm running PG on: >Dual 500 Mac OS X >1g ram >Pg 7.3.0 > >Conf settings >max_connections = 200 >shared_buffers = 15200 >#max_fsm_relations = 100 # min 10, fsm is free space map, ~40 bytes >#max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes >#max_locks_per_transaction = 64 # min 10 >#wal_buffers = 8 # min 4, typically 8KB each > > > > >CREATE TABLE public.search_log ( > wizard varchar(50) NOT NULL, > sub_wizard varchar(50), > timestamp varchar(75), > department int4, > gender varchar(25), > occasion varchar(50), > age varchar(25), > product_type varchar(2000), > price_range varchar(1000), > brand varchar(2000), > keyword varchar(1000), > result_count int4, > html_count int4, > fragrance_type varchar(50), > frag_type varchar(50), > frag_gender char(1), > trip_length varchar(25), > carry_on varchar(25), > suiter varchar(25), > expandable varchar(25), > wheels varchar(25), > style varchar(1000), > heel_type varchar(25), > option varchar(50), > metal varchar(255), > gem varchar(255), > bra_size varchar(25), > feature1 varchar(50), > feature2 varchar(50), > feature3 varchar(50), > sdate date, > stimestamp timestamptz, > file_name text >) WITH OIDS; > >CREATE INDEX date_idx ON search_log USING btree (sdate); >CREATE INDEX slog_wizard_idx ON search_log USING btree (wizard); Did you try to change theses 2 indexes into 1? CREATE INDEX date_wizard_idx on search_log USING btree(wizard,sdate) How selective are these fields: - if you ask about wizard="Keyword", the answer is 0.1% or 5% or 50% of rows? - if you ask about sdate >= '2002-12-01'::date) AND (sdate <= '2003-01-15'::date) what is the answer? Consider creating table "wizards", and changing field "wizard" in table "search_log" into integer field "wizardid". Searching by integer is faster than by varchar. Regards, Tomasz Myrta
"Patrick Hatcher" <PHatcher@macys.com> writes: > I have a table that contains over 13 million rows. This query takes an > extremely long time to return. I've vacuum full, analyzed, and re-indexed > the table. Still the results are the same. Any ideas? > mdc_oz=# explain analyze select wizard from search_log where wizard > ='Keyword' and sdate between '2002-12-01' and '2003-01-15'; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) > (actual time=99833.83..162951.25 rows=3280573 loops=1) > Filter: ((wizard = 'Keyword'::character varying) AND (sdate > > = '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) > Total runtime: 174713.25 msec > (3 rows) This query is selecting 3280573 rows out of your 13 million. I'd say the machine is doing the best it can. Returning 19000 rows per second is not all that shabby. Perhaps you should rethink what you're doing. Do you actually need to return 3 million rows to the client? regards, tom lane
Yup, since you still need to pull everything off the disk (the slowest part), which is quite a bit of data. You're simply dealing with a lot of data for a single query -- not much you can do. Is this a dedicated -- one client doing big selects like this? Knock your shared_buffers down to about 2000, bump your sort mem up to around 32MB (128MB or so if it's a dedicated box with a vast majority of queries like the below). Okay, need to do something about the rest of the data. 13million * 2k is a big number. Do you have a set of columns that are rarely used? If so, toss them into a separate table and link via a unique identifier (int4). It'll cost extra when you do hit them, but pulling out a few of the large ones information wise would buy quite a bit. Now, wizard. For that particular query it would be best if entries were made for all the values of wizard into a lookup table, and change search_log.wizard into a reference to that entry in the lookup. Index the lookup table well (one in the wizard primary key -- int4, and a unique index on the 'wizard' varchar). Group by the number, join to the lookup table for the name. Any other values with highly repetitive data? Might want to consider doing the same for them. In search_log, index the numeric representation of 'wizard' (key from lookup table), but don't bother indexing numbers that occur regularly. Look up how to create a partial index. Ie. The value 'Keyword' could be skipped as it occurs once in four tuples -- too often for an index to be useful. On Wed, 2003-01-22 at 15:49, Patrick Hatcher wrote: > Sorry I'm being really dense today. I didn't even notice the 3.2 million > row being returned. :( > > To answer your question, no, all fields would not have data. The data we > receive is from a Web log file. It's parsed and then uploaded to this > table. > > I guess the bigger issue is that when trying to do aggregates, grouping by > the wizard field, it takes just as long. > > Ex: > mdc_oz=# explain analyze select wizard,count(wizard) from search_log where > sdate > between '2002-12-01' and '2003-01-15' group by wizard; > QUERY > PLAN > ----------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=1083300.43..1112411.55 rows=388148 width=10) (actual > time=229503.85..302617.75 rows=14 loops=1) > -> Group (cost=1083300.43..1102707.84 rows=3881482 width=10) (actual > time=229503.60..286014.83 rows=3717161 loops=1) > -> Sort (cost=1083300.43..1093004.14 rows=3881482 width=10) > (actual time=229503.57..248415.81 rows=3717161 loops=1) > Sort Key: wizard > -> Seq Scan on search_log (cost=0.00..575217.57 > rows=3881482 width=10) (actual time=91235.76..157559.58 rows=3717161 > loops=1) > Filter: ((sdate >= '2002-12-01'::date) AND (sdate > <= '2003-01-15'::date)) > Total runtime: 302712.48 msec > (7 rows) > On Wed, 2003-01-22 at 13:26, Patrick Hatcher wrote: > > I have a table that contains over 13 million rows. This query takes an > > extremely long time to return. I've vacuum full, analyzed, and > re-indexed > > the table. Still the results are the same. Any ideas? > > mdc_oz=# explain analyze select wizard from search_log where wizard > > ='Keyword' and sdate between '2002-12-01' and '2003-01-15'; > > QUERY PLAN > > > ----------------------------------------------------------------------------------------------------------------------------- > > > Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) > > (actual time=99833.83..162951.25 rows=3280573 loops=1) > > Filter: ((wizard = 'Keyword'::character varying) AND (sdate > > > = '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) > > Total runtime: 174713.25 msec > > (3 rows) > -- > Rod Taylor <rbt@rbt.ca> > > PGP Key: http://www.rbt.ca/rbtpub.asc > (See attached file: signature.asc) > > > ______________________________________________________________________ > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc
Attachment
Thanks everyone. I'll give your suggestions a try and report back. Patrick Hatcher Macys.Com Legacy Integration Developer 415-422-1610 office HatcherPT - AIM Rod Taylor <rbt@rbt.ca> Sent by: To: Patrick Hatcher <PHatcher@macys.com> pgsql-performance-owner@post cc: Postgresql Performance <pgsql-performance@postgresql.org> gresql.org Subject: Re: [PERFORM] Slow query on OS X box 01/22/2003 02:54 PM Yup, since you still need to pull everything off the disk (the slowest part), which is quite a bit of data. You're simply dealing with a lot of data for a single query -- not much you can do. Is this a dedicated -- one client doing big selects like this? Knock your shared_buffers down to about 2000, bump your sort mem up to around 32MB (128MB or so if it's a dedicated box with a vast majority of queries like the below). Okay, need to do something about the rest of the data. 13million * 2k is a big number. Do you have a set of columns that are rarely used? If so, toss them into a separate table and link via a unique identifier (int4). It'll cost extra when you do hit them, but pulling out a few of the large ones information wise would buy quite a bit. Now, wizard. For that particular query it would be best if entries were made for all the values of wizard into a lookup table, and change search_log.wizard into a reference to that entry in the lookup. Index the lookup table well (one in the wizard primary key -- int4, and a unique index on the 'wizard' varchar). Group by the number, join to the lookup table for the name. Any other values with highly repetitive data? Might want to consider doing the same for them. In search_log, index the numeric representation of 'wizard' (key from lookup table), but don't bother indexing numbers that occur regularly. Look up how to create a partial index. Ie. The value 'Keyword' could be skipped as it occurs once in four tuples -- too often for an index to be useful. On Wed, 2003-01-22 at 15:49, Patrick Hatcher wrote: > Sorry I'm being really dense today. I didn't even notice the 3.2 million > row being returned. :( > > To answer your question, no, all fields would not have data. The data we > receive is from a Web log file. It's parsed and then uploaded to this > table. > > I guess the bigger issue is that when trying to do aggregates, grouping by > the wizard field, it takes just as long. > > Ex: > mdc_oz=# explain analyze select wizard,count(wizard) from search_log where > sdate > between '2002-12-01' and '2003-01-15' group by wizard; > QUERY > PLAN > ----------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=1083300.43..1112411.55 rows=388148 width=10) (actual > time=229503.85..302617.75 rows=14 loops=1) > -> Group (cost=1083300.43..1102707.84 rows=3881482 width=10) (actual > time=229503.60..286014.83 rows=3717161 loops=1) > -> Sort (cost=1083300.43..1093004.14 rows=3881482 width=10) > (actual time=229503.57..248415.81 rows=3717161 loops=1) > Sort Key: wizard > -> Seq Scan on search_log (cost=0.00..575217.57 > rows=3881482 width=10) (actual time=91235.76..157559.58 rows=3717161 > loops=1) > Filter: ((sdate >= '2002-12-01'::date) AND (sdate > <= '2003-01-15'::date)) > Total runtime: 302712.48 msec > (7 rows) > On Wed, 2003-01-22 at 13:26, Patrick Hatcher wrote: > > I have a table that contains over 13 million rows. This query takes an > > extremely long time to return. I've vacuum full, analyzed, and > re-indexed > > the table. Still the results are the same. Any ideas? > > mdc_oz=# explain analyze select wizard from search_log where wizard > > ='Keyword' and sdate between '2002-12-01' and '2003-01-15'; > > QUERY PLAN > > > ----------------------------------------------------------------------------------------------------------------------------- > > > Seq Scan on search_log (cost=0.00..609015.34 rows=3305729 width=10) > > (actual time=99833.83..162951.25 rows=3280573 loops=1) > > Filter: ((wizard = 'Keyword'::character varying) AND (sdate > > > = '2002-12-01'::date) AND (sdate <= '2003-01-15'::date)) > > Total runtime: 174713.25 msec > > (3 rows) > -- > Rod Taylor <rbt@rbt.ca> > > PGP Key: http://www.rbt.ca/rbtpub.asc > (See attached file: signature.asc) > > > ______________________________________________________________________ > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Rod Taylor <rbt@rbt.ca> PGP Key: http://www.rbt.ca/rbtpub.asc (See attached file: signature.asc)