Thread: Slow sub-selects, max and count(*)
I have a procedure ( included below with table definition) to import end of day quotes into a table fsechist.
The issue is with the speed ( or lackof it) that the procedure proceeds. Apparently it is the subselects that are the worst issue and I have tried to replace these. Also max() and count() refuse to use indexes. I have replaced max() with (select ... as mx from ... order by ... desc limit 1) and this is much faster but appear to have to avoid count() as I can not find a way to point this at an index.
Please, I am sure that there is a better way to do this. And 5 to 7 minutes to insert a day is really too slow. Each days import table contains about 3200 records. Total table size for fsechist is about 2.5 million records.
If you could point me in the right direction here it would be much appreciated.
thanks
Richard Sydney-Smith
--------------------------------------------------------------------------------------------------------
CREATE OR REPLACE FUNCTION public.import_sharedata(varchar)
RETURNS bool AS
'
RETURNS bool AS
'
declare
filnam alias for $1;
alldone bool := true;
/* cnt integer := 0; */ /* count of number inserted not used */
filnam alias for $1;
alldone bool := true;
/* cnt integer := 0; */ /* count of number inserted not used */
begin
/*
Imports a text file containing end of day price quotes
filnam should be in double backslash format e.g. c:\\data\\sharequotes\\quotes.dat
Ensures that it only brings in quotes for companies we have listed in our database
and ensures only the latest copy of the quote is kept for each day
/*
Imports a text file containing end of day price quotes
filnam should be in double backslash format e.g. c:\\data\\sharequotes\\quotes.dat
Ensures that it only brings in quotes for companies we have listed in our database
and ensures only the latest copy of the quote is kept for each day
Import table has 7 columns in CSV tab delimited ascii format.
EzyChart format has prices in cents,
METASTOCK in dollars
MetaStock +100 has the volume divided by 100.
Depending on your import data source you may need to process the temp_shareprices so that the
values are in Dollars , not cents and the volume is per unit not per 100 units.
EzyChart format has prices in cents,
METASTOCK in dollars
MetaStock +100 has the volume divided by 100.
Depending on your import data source you may need to process the temp_shareprices so that the
values are in Dollars , not cents and the volume is per unit not per 100 units.
Present calculation is for EzyChart format.
Both tables have to have a 4 digit year yyyymmdd
Warning : If you have the wrong format this procedure will import garbage into your database !!!
Warning : If you have the wrong format this procedure will import garbage into your database !!!
*/
perform drop_table(\'temp_shareprices\');
create table temp_shareprices(
ticker char(10) null,
dte char(8),
v1 numeric(19,4) not null default 0,
v2 numeric(19,4) not null default 0,
v3 numeric(19,4) not null default 0,
v4 numeric(19,4) not null default 0,
v5 numeric(19,4) not null default 0
);
execute \'copy temp_shareprices from \'||quote_literal(filnam)||\' delimiter \'||quote_literal(\',\');
create table temp_shareprices(
ticker char(10) null,
dte char(8),
v1 numeric(19,4) not null default 0,
v2 numeric(19,4) not null default 0,
v3 numeric(19,4) not null default 0,
v4 numeric(19,4) not null default 0,
v5 numeric(19,4) not null default 0
);
execute \'copy temp_shareprices from \'||quote_literal(filnam)||\' delimiter \'||quote_literal(\',\');
-- extend yr to 4 digits, alter quotes to dollars
update temp_shareprices set dte = \'20\'||dte,v1=v1/100,v2=v2/100,v3=v3/100,v4=v4/100;
-- add in any new company tickers
insert into fsecurity (sec_cde) select ticker from temp_shareprices where length(trim(ticker))=3 and ticker not in ( select sec_cde from fsecurity);
insert into fsecurity (sec_cde) select ticker from temp_shareprices where length(trim(ticker))=3 and ticker not in ( select sec_cde from fsecurity);
-- ====================
-- remove rows from temp_shareprices where we do not have a corresponding security
-- using sub-select is slow
-- delete from temp_shareprices where ticker not in (select sec_cde from fsecurity);
-- delete from temp_shareprices where ticker not in (select sec_cde from fsecurity);
-- can just delete ones where length<>3 as insert above ensures all length 3 are inserted. All ASX codes are 3char long
delete from temp_shareprices where length(trim(ticker))<>3;
delete from temp_shareprices where length(trim(ticker))<>3;
-- Attempt to remove without limiting to length and not using sub select does not work
--update temp_shareprices set ticker = s.sec_cde from temp_shareprices t left outer join fsecurity s on t.ticker = s.sec_cde;
--delete from temp_shareprices where ticker is null;
--update temp_shareprices set ticker = s.sec_cde from temp_shareprices t left outer join fsecurity s on t.ticker = s.sec_cde;
--delete from temp_shareprices where ticker is null;
-- ====================
-- put indexes on temp table
CREATE INDEX temphist_tick ON temp_shareprices USING btree (ticker);
CREATE INDEX temphist_dte ON temp_shareprices USING btree (dte);
CREATE INDEX temphist_tick ON temp_shareprices USING btree (ticker);
CREATE INDEX temphist_dte ON temp_shareprices USING btree (dte);
-- ====================
-- Scan the input table and delete as appropriate from fsechist-- Using sub select to remove any pre-existing records in fsechist very slow
-- delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);
-- delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);
-- Tried to replace subselect
update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte;
delete from fsechist where hist_tick = \'@@\';
update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte;
delete from fsechist where hist_tick = \'@@\';
-- ====================
-- insert newvalues
insert into fsechist(hist_tick,hist_date,hist_v1,hist_v2,hist_v3,hist_v4,hist_vol) select ticker,dte,v1,v2,v3,v4,v5 from temp_shareprices;
insert into fsechist(hist_tick,hist_date,hist_v1,hist_v2,hist_v3,hist_v4,hist_vol) select ticker,dte,v1,v2,v3,v4,v5 from temp_shareprices;
-- count the number inserted
-- select count(*) into cnt from temp_shareprices;
-- select count(*) into cnt from temp_shareprices;
-- perform drop_table(\'temp_shareprices\');
return alldone;
end;'
LANGUAGE 'plpgsql' VOLATILE;
LANGUAGE 'plpgsql' VOLATILE;
Note : drop_table is a simple procedure that drops a procedure by name iff it exists.
==============================================
Definition of Fsechist
CREATE TABLE public.fsechist
(
hist_q int4 NOT NULL DEFAULT nextval('hist_q'::text),
hist_v1 numeric(19,4),
hist_v2 numeric(19,4),
hist_v3 numeric(19,4),
hist_v4 numeric(19,4),
hist_vol numeric(19,4),
hist_deltar numeric(19,4),
hist_deltag numeric(19,4),
hist_date char(8),
hist_tick char(10),
hist_announce char(1),
CONSTRAINT fsechist_pkey PRIMARY KEY (hist_q)
) WITH OIDS;
(
hist_q int4 NOT NULL DEFAULT nextval('hist_q'::text),
hist_v1 numeric(19,4),
hist_v2 numeric(19,4),
hist_v3 numeric(19,4),
hist_v4 numeric(19,4),
hist_vol numeric(19,4),
hist_deltar numeric(19,4),
hist_deltag numeric(19,4),
hist_date char(8),
hist_tick char(10),
hist_announce char(1),
CONSTRAINT fsechist_pkey PRIMARY KEY (hist_q)
) WITH OIDS;
CREATE INDEX fsechist_date ON public.fsechist USING btree (hist_date);
CREATE UNIQUE INDEX fsechist_q ON public.fsechist USING btree (hist_q);
CREATE INDEX fsechist_tick ON public.fsechist USING btree (hist_tick);
Richard, > The issue is with the speed ( or lackof it) that the procedure proceeds. Apparently it is the subselects that are the worst issue and I have tried to replace these. Also max() and count() refuse to use indexes. Max() and Count() cannot use indexes for technical reasons. Browse through the archives of SQL, PERFORM, and HACKERS for about 12,000 discussions on the subject. > Please, I am sure that there is a better way to do this. And 5 to 7 minutes to insert a day is really too slow. Each days import table contains about 3200 records. Total table size for fsechist is about 2.5 million records. Standard advice: 1) Where possible, use COPY and not INSERT for bulk imports. 2) Where COPY is not possible, group inserts into 1000-statement blocks and wrap them in a transaction. 3) Where safe, suspend all triggers, foriegn keys, and constraints on the table while inserting and re-apply them afterward. -- -Josh BerkusAglio Database SolutionsSan Francisco
Hi,
I don't think thatyou gave enough information to get any direct help, for example, what are these "sub-selects"?
I often see performance problems arising from procedural code that puts selects that don't use indexes inside loops. That's a plain programming issue, and understanding your system.
Another situation I encountered recently was using "in" sub selects. Recoding them as "exists" or "not exists" as the case may be turned a query that ran for 18hours with no sign of ending (before I killed it) into a query that ran in a couple of minutes.
I always try to avoid procedural code in such batch updates. If it can be encoded into one insert then it is usually better, It just depends on what you are doing.
----- Original Message -----From: Richard Sydney-SmithSent: Thursday, February 05, 2004 10:48 AMSubject: [SQL] Slow sub-selects, max and count(*)I have a procedure ( included below with table definition) to import end of day quotes into a table fsechist.The issue is with the speed ( or lackof it) that the procedure proceeds. Apparently it is the subselects that are the worst issue and I have tried to replace these. Also max() and count() refuse to use indexes. I have replaced max() with (select ... as mx from ... order by ... desc limit 1) and this is much faster but appear to have to avoid count() as I can not find a way to point this at an index.Please, I am sure that there is a better way to do this. And 5 to 7 minutes to insert a day is really too slow. Each days import table contains about 3200 records. Total table size for fsechist is about 2.5 million records.If you could point me in the right direction here it would be much appreciated.thanksRichard Sydney-Smith--------------------------------------------------------------------------------------------------------CREATE OR REPLACE FUNCTION public.import_sharedata(varchar)
RETURNS bool AS
'declare
filnam alias for $1;
alldone bool := true;
/* cnt integer := 0; */ /* count of number inserted not used */begin
/*
Imports a text file containing end of day price quotes
filnam should be in double backslash format e.g. c:\\data\\sharequotes\\quotes.dat
Ensures that it only brings in quotes for companies we have listed in our database
and ensures only the latest copy of the quote is kept for each dayImport table has 7 columns in CSV tab delimited ascii format.
EzyChart format has prices in cents,
METASTOCK in dollars
MetaStock +100 has the volume divided by 100.
Depending on your import data source you may need to process the temp_shareprices so that the
values are in Dollars , not cents and the volume is per unit not per 100 units.Present calculation is for EzyChart format.Both tables have to have a 4 digit year yyyymmdd
Warning : If you have the wrong format this procedure will import garbage into your database !!!*/perform drop_table(\'temp_shareprices\');
create table temp_shareprices(
ticker char(10) null,
dte char(8),
v1 numeric(19,4) not null default 0,
v2 numeric(19,4) not null default 0,
v3 numeric(19,4) not null default 0,
v4 numeric(19,4) not null default 0,
v5 numeric(19,4) not null default 0
);
execute \'copy temp_shareprices from \'||quote_literal(filnam)||\' delimiter \'||quote_literal(\',\');
-- extend yr to 4 digits, alter quotes to dollarsupdate temp_shareprices set dte = \'20\'||dte,v1=v1/100,v2=v2/100,v3=v3/100,v4=v4/100;-- add in any new company tickers
insert into fsecurity (sec_cde) select ticker from temp_shareprices where length(trim(ticker))=3 and ticker not in ( select sec_cde from fsecurity);-- ====================-- remove rows from temp_shareprices where we do not have a corresponding security-- using sub-select is slow
-- delete from temp_shareprices where ticker not in (select sec_cde from fsecurity);-- can just delete ones where length<>3 as insert above ensures all length 3 are inserted. All ASX codes are 3char long
delete from temp_shareprices where length(trim(ticker))<>3;-- Attempt to remove without limiting to length and not using sub select does not work
--update temp_shareprices set ticker = s.sec_cde from temp_shareprices t left outer join fsecurity s on t.ticker = s.sec_cde;
--delete from temp_shareprices where ticker is null;-- ====================-- put indexes on temp table
CREATE INDEX temphist_tick ON temp_shareprices USING btree (ticker);
CREATE INDEX temphist_dte ON temp_shareprices USING btree (dte);-- ====================-- Scan the input table and delete as appropriate from fsechist-- Using sub select to remove any pre-existing records in fsechist very slow
-- delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);-- Tried to replace subselect
update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte;
delete from fsechist where hist_tick = \'@@\';-- ====================-- insert newvalues
insert into fsechist(hist_tick,hist_date,hist_v1,hist_v2,hist_v3,hist_v4,hist_vol) select ticker,dte,v1,v2,v3,v4,v5 from temp_shareprices;-- count the number inserted
-- select count(*) into cnt from temp_shareprices;-- perform drop_table(\'temp_shareprices\');return alldone;end;'
LANGUAGE 'plpgsql' VOLATILE;Note : drop_table is a simple procedure that drops a procedure by name iff it exists.==============================================Definition of FsechistCREATE TABLE public.fsechist
(
hist_q int4 NOT NULL DEFAULT nextval('hist_q'::text),
hist_v1 numeric(19,4),
hist_v2 numeric(19,4),
hist_v3 numeric(19,4),
hist_v4 numeric(19,4),
hist_vol numeric(19,4),
hist_deltar numeric(19,4),
hist_deltag numeric(19,4),
hist_date char(8),
hist_tick char(10),
hist_announce char(1),
CONSTRAINT fsechist_pkey PRIMARY KEY (hist_q)
) WITH OIDS;CREATE INDEX fsechist_date ON public.fsechist USING btree (hist_date);CREATE UNIQUE INDEX fsechist_q ON public.fsechist USING btree (hist_q);CREATE INDEX fsechist_tick ON public.fsechist USING btree (hist_tick);
Thanks Josh and Ian,
narrowing the problem down. The really slow line is the one where I try and remove potential duplicates. It does not look at the indexes.
Point on using copy rather than insert is taken. I use copy to load to a temp file so I can test the data and alter certain values. I suppose I could: copy to temp / alter/ save/ copy into fsechist but that is rather ugly and as it is not the insert that is the real bug bear I would like to fix this first.
Ian you suggested a simular problem was solved using "exists" rather than "in". I am not sure how the use differs. I have tried to include it in option 4 below.
Trial Solutions:
1) delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);
far too slow
2)
update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte;
delete from fsechist where hist_tick = \'@@\';
delete from fsechist where hist_tick = \'@@\';
really , really slow. Thought this would use the indexes but "explain" ( being my friend) still gives the query plan as a sequential scan
3)
-- does not allow insert of a single company data
delete from fsechist where hist_date in (select distinct dte from temp_shareprices);
delete from fsechist where hist_date in (select distinct dte from temp_shareprices);
Works quick enough but has limited functionality. ( does not cope with imports other than a full end of trading)
4)
delete from fsechist where exists(select 1 from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);
Tried and still took over 60 secs before I cancelled the request.
Indexes
Both tables have indexes defined for tick and date. tick and date of the same data type in both cases.
On Thu, Feb 05, 2004 at 15:53:08 +0800, Richard Sydney-Smith <richard@ibisaustralia.com> wrote: > Trial Solutions: > > 1) delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte= hist_date); Don't you want: delete from fsechist where hist_q in (select hist_q from temp_shareprices where hist_tick = ticker and dte = hist_date); You should be able to refer to the table being deleted instead of joining to it again in the subselect. As long as you are using 7.4.x, this should be pretty fast. > > far too slow Since this is slower than the exists solutions, my guess is that you are using a version prior to 7.4. > > 2) > > update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date= t.dte; > delete from fsechist where hist_tick = \'@@\'; > > really , really slow. Thought this would use the indexes but "explain" ( being my friend) still gives the query plan asa sequential scan Again don't add fsechist to the from list. This joins with fsechist twice. > > 3) > > -- does not allow insert of a single company data > delete from fsechist where hist_date in (select distinct dte from temp_shareprices); > > Works quick enough but has limited functionality. ( does not cope with imports other than a full end of trading) I think using the distinct clause is going to slow this down, especially if you are using a version prior to 7.4 since a sort is used instead of a hash table to get unique values. > > 4) > > delete from fsechist where exists(select 1 from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date); > > Tried and still took over 60 secs before I cancelled the request. > Indexes Again, drop fsechist from the exists subselect. > > Both tables have indexes defined for tick and date. tick and date of the same data type in both cases. You can also do a join using the nonstandard implied from syntax. You can do something like: delete from fsechist where hist_tick = temp_shareprices.ticker and temp_shareprices.dte = hist_date;
Richard, > Ian you suggested a simular problem was solved using "exists" rather than "in". I am not sure how the use differs. I have tried to include it in option 4 below. Hmmm ... this piece of advice is dated; with PG 7.4, IN() queries should be plenty fast. If you're using 7.3 or less, though watch out! > 1) delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date); I can't help you with this unless you attribute column names to their tables. Please use full table.column syntax. > update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte; Put an ANALYZE fsechist here. > delete from fsechist where hist_tick = \'@@\'; Is there a reason why you are doing this in two steps? > -- does not allow insert of a single company data > delete from fsechist where hist_date in (select distinct dte from temp_shareprices); The DISTINCT is completely unnecessary. > delete from fsechist where exists(select 1 from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date); This is not the same query as #1. -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus <josh@agliodbs.com> writes: > Max() and Count() cannot use indexes for technical reasons. Browse through > the archives of SQL, PERFORM, and HACKERS for about 12,000 discussions on the > subject. Please don't confuse the issue by throwing Max() and Count() into the same basket. The issue with Min()/Max() is that the result could be generated efficiently by scanning indexes but it's just hard, especially given generic aggregate functions, and the work simply hasn't been done, or even started, yet. The issue with Count() is that people want the result to be cached in a single per-table counter, but that can't be done as simply as that because of transactions. People have discussed complex solutions to this but it's a much more complex problem than it appears. They're really two entirely separate issues. -- greg
Greg, > Please don't confuse the issue by throwing Max() and Count() into the same > basket. When on earth is that post of mine from? Seems like it's several months, if not a couple of years, old. -- -Josh BerkusAglio Database SolutionsSan Francisco