Thread: Slow sub-selects, max and count(*)

Slow sub-selects, max and count(*)

From

"Richard Sydney-Smith"

Date:

04 February 2004, 21:46:16

I have a procedure ( included below with table definition) to import end of day quotes into a table fsechist.

The issue is with the speed ( or lackof it) that the procedure proceeds. Apparently it is the subselects that are the worst issue and I have tried to replace these. Also max() and count() refuse to use indexes. I have replaced max() with (select ... as mx from ... order by ... desc limit 1) and this is much faster but appear to have to avoid count() as I can not find a way to point this at an index.

Please, I am sure that there is a better way to do this. And 5 to 7 minutes to insert a day is really too slow. Each days import table contains about 3200 records. Total table size for fsechist is about 2.5 million records.

If you could point me in the right direction here it would be much appreciated.

thanks

Richard Sydney-Smith

--------------------------------------------------------------------------------------------------------

CREATE OR REPLACE FUNCTION public.import_sharedata(varchar)
RETURNS bool AS
'

declare
   filnam alias for $1;
   alldone bool := true;
   /* cnt integer := 0; */ /* count of number inserted not used */

begin
/*
   Imports a text file containing end of day price quotes
   filnam should be in double backslash format e.g. c:\\data\\sharequotes\\quotes.dat

   Ensures that it only brings in quotes for companies we have listed in our database
   and ensures only the latest copy of the quote is kept for each day

   Import table has 7 columns in CSV tab delimited ascii format.
      EzyChart format has prices in cents,
      METASTOCK in dollars
      MetaStock +100 has the volume divided by 100.

   Depending on your import data source you may need to process the temp_shareprices so that the
   values are in Dollars , not cents and the volume is per unit not per 100 units.

Present calculation is for EzyChart format.

Both tables have to have a 4 digit year yyyymmdd

Warning : If you have the wrong format this procedure will import garbage into your database !!!

perform drop_table(\'temp_shareprices\');

create table temp_shareprices(
    ticker char(10) null,
    dte char(8),
    v1 numeric(19,4) not null default 0,
    v2 numeric(19,4) not null default 0,
    v3 numeric(19,4) not null default 0,
    v4 numeric(19,4) not null default 0,
    v5 numeric(19,4) not null default 0
    );
execute \'copy temp_shareprices from \'||quote_literal(filnam)||\' delimiter \'||quote_literal(\',\');

-- extend yr to 4 digits, alter quotes to dollars

update temp_shareprices set dte = \'20\'||dte,v1=v1/100,v2=v2/100,v3=v3/100,v4=v4/100;

-- add in any new company tickers
insert into fsecurity (sec_cde) select ticker from temp_shareprices where length(trim(ticker))=3 and ticker not in ( select sec_cde from fsecurity);

-- ====================

-- remove rows from temp_shareprices where we do not have a corresponding security

-- using sub-select is slow
-- delete from temp_shareprices where ticker not in (select sec_cde from fsecurity);

-- can just delete ones where length<>3 as insert above ensures all length 3 are inserted. All ASX codes are 3char long
delete from temp_shareprices where length(trim(ticker))<>3;

-- Attempt to remove without limiting to length and not using sub select does not work
--update temp_shareprices set ticker = s.sec_cde from temp_shareprices t left outer join fsecurity s on t.ticker = s.sec_cde;
--delete from temp_shareprices where ticker is null;

-- ====================

-- put indexes on temp table
CREATE INDEX temphist_tick ON temp_shareprices USING btree (ticker);
CREATE INDEX temphist_dte ON temp_shareprices USING btree (dte);

-- ====================

-- Scan the input table and delete as appropriate from fsechist

-- Using sub select to remove any pre-existing records in fsechist very slow
-- delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);

-- Tried to replace subselect
update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte;
delete from fsechist where hist_tick = \'@@\';

-- ====================

-- insert newvalues
insert into fsechist(hist_tick,hist_date,hist_v1,hist_v2,hist_v3,hist_v4,hist_vol) select ticker,dte,v1,v2,v3,v4,v5 from temp_shareprices;

-- count the number inserted
-- select count(*) into cnt from temp_shareprices;

-- perform drop_table(\'temp_shareprices\');

return alldone;

end;'
LANGUAGE 'plpgsql' VOLATILE;

Note : drop_table is a simple procedure that drops a procedure by name iff it exists.

==============================================

Definition of Fsechist

CREATE TABLE public.fsechist
(
hist_q int4 NOT NULL DEFAULT nextval('hist_q'::text),
hist_v1 numeric(19,4),
hist_v2 numeric(19,4),
hist_v3 numeric(19,4),
hist_v4 numeric(19,4),
hist_vol numeric(19,4),
hist_deltar numeric(19,4),
hist_deltag numeric(19,4),
hist_date char(8),
hist_tick char(10),
hist_announce char(1),
CONSTRAINT fsechist_pkey PRIMARY KEY (hist_q)
) WITH OIDS;

CREATE INDEX fsechist_date ON public.fsechist USING btree (hist_date);

CREATE UNIQUE INDEX fsechist_q ON public.fsechist USING btree (hist_q);

CREATE INDEX fsechist_tick ON public.fsechist USING btree (hist_tick);

Re: Slow sub-selects, max and count(*)

From

Josh Berkus

Date:

05 February 2004, 01:13:58

Richard,

> The issue is with the speed ( or lackof it) that the procedure proceeds. 
Apparently it is the subselects that are the worst issue and I have tried to 
replace these. Also max() and count() refuse to use indexes. 

Max() and Count() cannot use indexes for technical reasons.   Browse through 
the archives of SQL, PERFORM, and HACKERS for about 12,000 discussions on the 
subject.

> Please, I am sure that there is a better way to do this. And 5 to 7 minutes 
to insert a day is really too slow. Each days import table contains about 
3200 records. Total table size for fsechist is about 2.5 million records.

Standard advice:
1) Where possible, use COPY and not INSERT for bulk imports.
2) Where COPY is not possible, group inserts into 1000-statement blocks and 
wrap them in a transaction.
3) Where safe, suspend all triggers, foriegn keys, and constraints on the 
table while inserting and re-apply them afterward.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco

Re: Slow sub-selects, max and count(*)

From

"Iain"

Date:

05 February 2004, 02:42:12

Hi,

I don't think thatyou gave enough information to get any direct help, for example, what are these "sub-selects"?

I often see performance problems arising from procedural code that puts selects that don't use indexes inside loops. That's a plain programming issue, and understanding your system.

Another situation I encountered recently was using "in" sub selects. Recoding them as "exists" or "not exists" as the case may be turned a query that ran for 18hours with no sign of ending (before I killed it) into a query that ran in a couple of minutes.

I always try to avoid procedural code in such batch updates. If it can be encoded into one insert then it is usually better, It just depends on what you are doing.

----- Original Message -----
From: Richard Sydney-Smith
To: pgsql-sql@postgresql.org
Sent: Thursday, February 05, 2004 10:48 AM
Subject: [SQL] Slow sub-selects, max and count(*)

I have a procedure ( included below with table definition) to import end of day quotes into a table fsechist.

The issue is with the speed ( or lackof it) that the procedure proceeds. Apparently it is the subselects that are the worst issue and I have tried to replace these. Also max() and count() refuse to use indexes. I have replaced max() with (select ... as mx from ... order by ... desc limit 1) and this is much faster but appear to have to avoid count() as I can not find a way to point this at an index.

Please, I am sure that there is a better way to do this. And 5 to 7 minutes to insert a day is really too slow. Each days import table contains about 3200 records. Total table size for fsechist is about 2.5 million records.

If you could point me in the right direction here it would be much appreciated.

thanks

Richard Sydney-Smith

--------------------------------------------------------------------------------------------------------

CREATE OR REPLACE FUNCTION public.import_sharedata(varchar)
RETURNS bool AS
'

declare
   filnam alias for $1;
   alldone bool := true;
   /* cnt integer := 0; */ /* count of number inserted not used */

begin
/*
   Imports a text file containing end of day price quotes
   filnam should be in double backslash format e.g. c:\\data\\sharequotes\\quotes.dat

   Ensures that it only brings in quotes for companies we have listed in our database
   and ensures only the latest copy of the quote is kept for each day

   Import table has 7 columns in CSV tab delimited ascii format.
      EzyChart format has prices in cents,
      METASTOCK in dollars
      MetaStock +100 has the volume divided by 100.

   Depending on your import data source you may need to process the temp_shareprices so that the
   values are in Dollars , not cents and the volume is per unit not per 100 units.

   Present calculation is for EzyChart format.

   Both tables have to have a 4 digit year yyyymmdd

   Warning : If you have the wrong format this procedure will import garbage into your database !!!

*/

perform drop_table(\'temp_shareprices\');

create table temp_shareprices(
    ticker char(10) null,
    dte char(8),
    v1 numeric(19,4) not null default 0,
    v2 numeric(19,4) not null default 0,
    v3 numeric(19,4) not null default 0,
    v4 numeric(19,4) not null default 0,
    v5 numeric(19,4) not null default 0
    );
execute \'copy temp_shareprices from \'||quote_literal(filnam)||\' delimiter \'||quote_literal(\',\');

-- extend yr to 4 digits, alter quotes to dollars

update temp_shareprices set dte = \'20\'||dte,v1=v1/100,v2=v2/100,v3=v3/100,v4=v4/100;

-- add in any new company tickers
insert into fsecurity (sec_cde) select ticker from temp_shareprices where length(trim(ticker))=3 and ticker not in ( select sec_cde from fsecurity);

-- ====================
-- remove rows from temp_shareprices where we do not have a corresponding security
-- using sub-select is slow
-- delete from temp_shareprices where ticker not in (select sec_cde from fsecurity);
-- can just delete ones where length<>3 as insert above ensures all length 3 are inserted. All ASX codes are 3char long
delete from temp_shareprices where length(trim(ticker))<>3;

-- Attempt to remove without limiting to length and not using sub select does not work
--update temp_shareprices set ticker = s.sec_cde from temp_shareprices t left outer join fsecurity s on t.ticker = s.sec_cde;
--delete from temp_shareprices where ticker is null;
-- ====================

-- put indexes on temp table
CREATE INDEX temphist_tick ON temp_shareprices USING btree (ticker);
CREATE INDEX temphist_dte ON temp_shareprices USING btree (dte);

-- ====================
-- Scan the input table and delete as appropriate from fsechist

-- Using sub select to remove any pre-existing records in fsechist very slow
-- delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);

-- Tried to replace subselect
update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte;
delete from fsechist where hist_tick = \'@@\';
-- ====================

-- insert newvalues
insert into fsechist(hist_tick,hist_date,hist_v1,hist_v2,hist_v3,hist_v4,hist_vol) select   ticker,dte,v1,v2,v3,v4,v5 from temp_shareprices;

-- count the number inserted
-- select   count(*) into cnt from temp_shareprices;

-- perform drop_table(\'temp_shareprices\');

return alldone;

end;'
LANGUAGE 'plpgsql' VOLATILE;

Note : drop_table is a simple procedure that drops a procedure by name iff it exists.

==============================================
Definition of Fsechist

CREATE TABLE public.fsechist
(
hist_q int4 NOT NULL DEFAULT nextval('hist_q'::text),
hist_v1 numeric(19,4),
hist_v2 numeric(19,4),
hist_v3 numeric(19,4),
hist_v4 numeric(19,4),
hist_vol numeric(19,4),
hist_deltar numeric(19,4),
hist_deltag numeric(19,4),
hist_date char(8),
hist_tick char(10),
hist_announce char(1),
CONSTRAINT fsechist_pkey PRIMARY KEY (hist_q)
) WITH OIDS;

CREATE INDEX fsechist_date ON public.fsechist USING btree (hist_date);

CREATE UNIQUE INDEX fsechist_q ON public.fsechist USING btree (hist_q);
CREATE INDEX fsechist_tick ON public.fsechist USING btree (hist_tick);

Re: Slow sub-selects, max and count(*)

From

"Richard Sydney-Smith"

Date:

05 February 2004, 23:16:30

Thanks Josh and Ian,

narrowing the problem down. The really slow line is the one where I try and remove potential duplicates. It does not look at the indexes.

Point on using copy rather than insert is taken. I use copy to load to a temp file so I can test the data and alter certain values. I suppose I could: copy to temp / alter/ save/ copy into fsechist but that is rather ugly and as it is not the insert that is the real bug bear I would like to fix this first.

Ian you suggested a simular problem was solved using "exists" rather than "in". I am not sure how the use differs. I have tried to include it in option 4 below.

Trial Solutions:

1) delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);

far too slow

update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and h.hist_date = t.dte;
delete from fsechist where hist_tick = \'@@\';

really , really slow. Thought this would use the indexes but "explain" ( being my friend) still gives the query plan as a sequential scan

-- does not allow insert of a single company data
delete from fsechist where hist_date in (select distinct dte from temp_shareprices);

Works quick enough but has limited functionality. ( does not cope with imports other than a full end of trading)

delete from fsechist where exists(select 1 from fsechist, temp_shareprices where hist_tick = ticker and dte = hist_date);

Tried and still took over 60 secs before I cancelled the request.

Indexes

Both tables have indexes defined for tick and date. tick and date of the same data type in both cases.

Re: Slow sub-selects, max and count(*)

From

Bruno Wolff III

Date:

05 February 2004, 23:38:13

On Thu, Feb 05, 2004 at 15:53:08 +0800, Richard Sydney-Smith <richard@ibisaustralia.com> wrote:
> Trial Solutions:
> 
> 1) delete from fsechist where hist_q in (select hist_q from fsechist, temp_shareprices where hist_tick = ticker and
dte= hist_date);

Don't you want:
delete from fsechist where hist_q in (select hist_q from temp_shareprices where hist_tick = ticker and dte =
hist_date);

You should be able to refer to the table being deleted instead of joining
to it again in the subselect. As long as you are using 7.4.x, this should
be pretty fast.

> 
> far too slow

Since this is slower than the exists solutions, my guess is that you are
using a version prior to 7.4.

> 
> 2) 
> 
>  update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t where t.ticker = h.hist_tick and
h.hist_date= t.dte;

>  delete from fsechist where hist_tick = \'@@\';
> 
> really , really slow. Thought this would use the indexes but "explain" ( being my friend) still gives the query plan
asa sequential scan

Again don't add fsechist to the from list. This joins with fsechist twice.

> 
> 3) 
> 
> -- does not allow insert of a single company data
> delete from fsechist where hist_date in (select distinct dte from temp_shareprices);
> 
> Works quick enough but has limited functionality. ( does not cope with imports other than a full end of trading)

I think using the distinct clause is going to slow this down, especially
if you are using a version prior to 7.4 since a sort is used instead of
a hash table to get unique values.

> 
> 4) 
> 
> delete from fsechist where exists(select 1 from fsechist, temp_shareprices where hist_tick = ticker and dte =
hist_date);
> 
> Tried and still took over 60 secs before I cancelled the request.
> Indexes

Again, drop fsechist from the exists subselect.

> 
> Both tables have indexes defined for tick and date. tick and date of the same data type in both cases.

You can also do a join using the nonstandard implied from syntax.
You can do something like:
delete from fsechist where hist_tick = temp_shareprices.ticker and temp_shareprices.dte = hist_date;

Re: Slow sub-selects, max and count(*)

From

Josh Berkus

Date:

06 February 2004, 00:01:04

Richard,

> Ian you suggested a simular problem was solved using "exists" rather than 
"in". I am not sure how the use differs. I have tried to include it in option 
4 below. 

Hmmm ... this piece of advice is dated; with PG 7.4, IN() queries should be 
plenty fast.  If you're using 7.3 or less, though watch out!


> 1) delete from fsechist where hist_q in (select hist_q from fsechist, 
temp_shareprices where hist_tick = ticker and dte = hist_date);

I can't help you with this unless you attribute column names to their tables.  
Please use full table.column syntax.


>  update fsechist set hist_tick = \'@@\' from fsechist h, temp_shareprices t 
where t.ticker = h.hist_tick and h.hist_date = t.dte;

Put an ANALYZE fsechist here.   
>  delete from fsechist where hist_tick = \'@@\';

Is there a reason why you are doing this in two steps?

> -- does not allow insert of a single company data
> delete from fsechist where hist_date in (select distinct dte from 
temp_shareprices);

The DISTINCT is completely unnecessary.

> delete from fsechist where exists(select 1 from fsechist, temp_shareprices 
where hist_tick = ticker and dte = hist_date);

This is not the same query as #1.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco

Re: Slow sub-selects, max and count(*)

From

Greg Stark

Date:

28 March 2004, 03:53:38

Josh Berkus <josh@agliodbs.com> writes:

> Max() and Count() cannot use indexes for technical reasons.   Browse through 
> the archives of SQL, PERFORM, and HACKERS for about 12,000 discussions on the 
> subject.

Please don't confuse the issue by throwing Max() and Count() into the same
basket.

The issue with Min()/Max() is that the result could be generated efficiently
by scanning indexes but it's just hard, especially given generic aggregate
functions, and the work simply hasn't been done, or even started, yet.

The issue with Count() is that people want the result to be cached in a single
per-table counter, but that can't be done as simply as that because of
transactions. People have discussed complex solutions to this but it's a much
more complex problem than it appears.

They're really two entirely separate issues.

-- 
greg

Re: Slow sub-selects, max and count(*)

From

Josh Berkus

Date:

29 March 2004, 17:48:27

Greg,

> Please don't confuse the issue by throwing Max() and Count() into the same
> basket.

When on earth is that post of mine from?  Seems like it's several months, if 
not a couple of years, old.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco