Thread: Is there a faster way to do this?

Is there a faster way to do this?

From

Patrick Hatcher

Date:

15 June 2004, 16:06:32




pg: 7.4.2
RedHat 7.2

Can I get some advice on a possible faster way of doing this:

Scenario:  Each day I update a column in a table with an internal
percentile value.  To arrive at  this value, I'll get a count of records
with sales > 0 and then divide this count by the total number of tile
groups I want.  So for example:
Total records w/sales > 0 = 730,000tile# = 100
total percentile groups (730,000 / 100):7300

Now that I have the total number of groups I need, I cycle through my
recordset, grab the key field and the current percentile number and stuff
the values into a temp table. I mod the current row counter against the
total percentile group number.  If it is 0 then I add 1 to the  current
percentile number .  After inserting records into the temp file I then
update the main table.

Using the example above, the first 7300 records would get a  percentile
number of 1, the next 7300  records would get  a percentile number of 2,
then 3,4,5,etc.

Unfortunately, I am going record by record in a loop and the process takes
upwards of 20mins.  Is there a faster way to do this?  I thought about
using limit and offset, but I'm not sure how I would accomplish it.

Below is the function I currently use.  Thanks for any help provided

CREATE OR REPLACE FUNCTION cdm.percentile_calc() RETURNS text AS
'DECLARE    v_interval int4;    v_tile int4;    v_percentile int4;    v_check int4;    v_count int4;    v_rowcount
int4;   myRec  RECORD;
 

BEGIN    v_count:=0;    v_tile:= 100;    v_percentile:=1;    v_rowcount :=1;    v_check:=0;

    /* Get count of records with val_purch > 0 */    select into v_count count(*)  from cdm.cdm_indiv_mast where
val_purch_com >0;
    /* this number will be used as part of our MOD to tell when to add one
to our percentile */    v_interval := v_count / v_tile;
CREATE TEMP TABLE cdmperct (f1 int8, f2 int2);

    FOR myRec IN  select indiv_key from cdm.cdm_indiv_mast where
val_purch_com  >0 order by val_purch_com desc  LOOP     INSERT INTO cdmperct values (myRec.indiv_key,v_percentile);
v_check= mod(v_rowcount,v_interval);     IF v_check = 0 THEN      v_percentile:=v_percentile+1;     END IF;
v_rowcount:=v_rowcount+1;    END LOOP;
 

      UPDATE cdm.cdm_indiv_mast SET percentiler = f2 from  cdmperct where
indiv_key = f1;

    DROP TABLE cdmperct;    RETURN  \'DONE\';
END; ' LANGUAGE 'plpgsql' IMMUTABLE;

Patrick Hatcher
Macys.Com

Re: Is there a faster way to do this?

From

Richard Huxton

Date:

15 June 2004, 17:17:22

Patrick Hatcher wrote:
> 
> Scenario:  Each day I update a column in a table with an internal
> percentile value.  To arrive at  this value, I'll get a count of records
> with sales > 0 and then divide this count by the total number of tile
> groups I want.  So for example:
> Total records w/sales > 0 = 730,000
>  tile# = 100
> total percentile groups (730,000 / 100):7300
> 
> Now that I have the total number of groups I need, I cycle through my
> recordset, grab the key field and the current percentile number and stuff
> the values into a temp table. I mod the current row counter against the
> total percentile group number.  If it is 0 then I add 1 to the  current
> percentile number .  After inserting records into the temp file I then
> update the main table.
> 
> Using the example above, the first 7300 records would get a  percentile
> number of 1, the next 7300  records would get  a percentile number of 2,
> then 3,4,5,etc.
> 
> Unfortunately, I am going record by record in a loop and the process takes
> upwards of 20mins.  Is there a faster way to do this?  I thought about
> using limit and offset, but I'm not sure how I would accomplish it.

Idea #1:
Well, if you ordered by something unique, you could use ORDER BY, LIMIT 
and OFFSET

UPDATE my_table SET my_group=1 WHERE val > 0 ORDER BY val, my_id_column 
OFFSET 0 LIMIT 7300;
UPDATE my_table SET my_group=2 WHERE val > 0 ORDER BY val, my_id_column 
OFFSET 7300 LIMIT 7300;
etc.

You'll need to make sure the order is guaranteed for all rows when "val" 
is the same though.

Now, the problem with this is that by the time you get to group 99, 
you've had to scan over group 1 98 times - not good. If you knew how 
many rows there were for any given "val" then you could restrict it much 
more though.

Idea #2:
Turn your function around. Declare a cursor on the sorted SELECT of 
target-rows. SKIP 7300 rows and read the primary-key. Use that to issue 
an update of all rows between two values. That's only 100 updates issued 
rather than one for each target row.

Idea #3:
Someone might well be able to come up with a clever idea involving a 
join against a set-returning function, but I'm not sure about 
guaranteeing the order of the join vs the returned set (and it's getting 
late here). Any ideas people?

Maybe one of those is some use
--   Richard Huxton  Archonet Ltd

Re: Is there a faster way to do this?

From

Edmund Bacon

Date:

15 June 2004, 17:20:21

Perhaps something along the lines of

CREATE TEMP SEQUENCE pctile_seq;

SELECT ... , ((nextval('pctile_seq')/(rec_per_group+1) +1)
FROM (SELECT ... WHERE sales > 0 ORDER BY SALES) sales;

DROP TEMP SEQUENCE pctile_seq;

This doesn't add in the 'sales = 0 are in percentile 0, but that 
shouldn't be too hard.  I think, with some playing around, one might be 
able to to an UPDATE from the above SELECT, which will eliminate the 
read/update loop.

Patrick Hatcher wrote:
> 
> 
> 
> pg: 7.4.2
> RedHat 7.2
> 
> Can I get some advice on a possible faster way of doing this:
> 
> Scenario:  Each day I update a column in a table with an internal
> percentile value.  To arrive at  this value, I'll get a count of records
> with sales > 0 and then divide this count by the total number of tile
> groups I want.  So for example:
> Total records w/sales > 0 = 730,000
>  tile# = 100
> total percentile groups (730,000 / 100):7300
> 
> Now that I have the total number of groups I need, I cycle through my
> recordset, grab the key field and the current percentile number and stuff
> the values into a temp table. I mod the current row counter against the
> total percentile group number.  If it is 0 then I add 1 to the  current
> percentile number .  After inserting records into the temp file I then
> update the main table.
> 
> Using the example above, the first 7300 records would get a  percentile
> number of 1, the next 7300  records would get  a percentile number of 2,
> then 3,4,5,etc.
> 
> Unfortunately, I am going record by record in a loop and the process takes
> upwards of 20mins.  Is there a faster way to do this?  I thought about
> using limit and offset, but I'm not sure how I would accomplish it.
> 
> Below is the function I currently use.  Thanks for any help provided
> 
> CREATE OR REPLACE FUNCTION cdm.percentile_calc()
>   RETURNS text AS
> 'DECLARE
>      v_interval int4;
>      v_tile int4;
>      v_percentile int4;
>      v_check int4;
>      v_count int4;
>      v_rowcount int4;
>      myRec  RECORD;
> 
> BEGIN
>      v_count:=0;
>      v_tile:= 100;
>      v_percentile:=1;
>      v_rowcount :=1;
>      v_check:=0;
> 
> 
>      /* Get count of records with val_purch > 0 */
>      select into v_count count(*)  from cdm.cdm_indiv_mast where
> val_purch_com >0;
> 
>      /* this number will be used as part of our MOD to tell when to add one
> to our percentile */
>      v_interval := v_count / v_tile;
> 
>  CREATE TEMP TABLE cdmperct (f1 int8, f2 int2);
> 
> 
>      FOR myRec IN  select indiv_key from cdm.cdm_indiv_mast where
> val_purch_com  >0 order by val_purch_com desc  LOOP
>       INSERT INTO cdmperct values (myRec.indiv_key,v_percentile);
>       v_check = mod(v_rowcount,v_interval);
>       IF v_check = 0 THEN
>        v_percentile:=v_percentile+1;
>       END IF;
>       v_rowcount:= v_rowcount+1;
>      END LOOP;
> 
> 
>        UPDATE cdm.cdm_indiv_mast SET percentiler = f2 from  cdmperct where
> indiv_key = f1;
> 
> 
>      DROP TABLE cdmperct;
>      RETURN  \'DONE\';
> END; '
>   LANGUAGE 'plpgsql' IMMUTABLE;
> 
> Patrick Hatcher
> Macys.Com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
Edmund Bacon <ebacon@onesystem.com>

Re: Is there a faster way to do this?

From

Jean-Luc Lachance

Date:

15 June 2004, 17:37:40

One way to do it would be to:

Not put the percentile in the sales table;

Create an percentile table with a foreign key to the sales table primary 
key and percentile int4:

CREATE TABLE percentiles(
fkey <whatever> PRIMARY KEY REFERENCES sales( <primary key>),
percentile INT4 );

Create a sequence for that ancillary table:
CREATE SEQUENCE percent_seq;

When ready to create the percentiles, truncate the percentile table and 
reset the sequence next value:

TRUNCATE percentiles; setval( 'percent_seq', 0);

Now query your sales with the proper order by:

INSERT INTO percentiles SELECT pk_sales, nextval( 'percent_seq') / 
tile_size FROM sales ORDER BY sales_value;

HTH



Patrick Hatcher wrote:
> 
> 
> 
> pg: 7.4.2
> RedHat 7.2
> 
> Can I get some advice on a possible faster way of doing this:
> 
> Scenario:  Each day I update a column in a table with an internal
> percentile value.  To arrive at  this value, I'll get a count of records
> with sales > 0 and then divide this count by the total number of tile
> groups I want.  So for example:
> Total records w/sales > 0 = 730,000
>  tile# = 100
> total percentile groups (730,000 / 100):7300
> 
> Now that I have the total number of groups I need, I cycle through my
> recordset, grab the key field and the current percentile number and stuff
> the values into a temp table. I mod the current row counter against the
> total percentile group number.  If it is 0 then I add 1 to the  current
> percentile number .  After inserting records into the temp file I then
> update the main table.
> 
> Using the example above, the first 7300 records would get a  percentile
> number of 1, the next 7300  records would get  a percentile number of 2,
> then 3,4,5,etc.
> 
> Unfortunately, I am going record by record in a loop and the process takes
> upwards of 20mins.  Is there a faster way to do this?  I thought about
> using limit and offset, but I'm not sure how I would accomplish it.
> 
> Below is the function I currently use.  Thanks for any help provided
> 
> CREATE OR REPLACE FUNCTION cdm.percentile_calc()
>   RETURNS text AS
> 'DECLARE
>      v_interval int4;
>      v_tile int4;
>      v_percentile int4;
>      v_check int4;
>      v_count int4;
>      v_rowcount int4;
>      myRec  RECORD;
> 
> BEGIN
>      v_count:=0;
>      v_tile:= 100;
>      v_percentile:=1;
>      v_rowcount :=1;
>      v_check:=0;
> 
> 
>      /* Get count of records with val_purch > 0 */
>      select into v_count count(*)  from cdm.cdm_indiv_mast where
> val_purch_com >0;
> 
>      /* this number will be used as part of our MOD to tell when to add one
> to our percentile */
>      v_interval := v_count / v_tile;
> 
>  CREATE TEMP TABLE cdmperct (f1 int8, f2 int2);
> 
> 
>      FOR myRec IN  select indiv_key from cdm.cdm_indiv_mast where
> val_purch_com  >0 order by val_purch_com desc  LOOP
>       INSERT INTO cdmperct values (myRec.indiv_key,v_percentile);
>       v_check = mod(v_rowcount,v_interval);
>       IF v_check = 0 THEN
>        v_percentile:=v_percentile+1;
>       END IF;
>       v_rowcount:= v_rowcount+1;
>      END LOOP;
> 
> 
>        UPDATE cdm.cdm_indiv_mast SET percentiler = f2 from  cdmperct where
> indiv_key = f1;
> 
> 
>      DROP TABLE cdmperct;
>      RETURN  \'DONE\';
> END; '
>   LANGUAGE 'plpgsql' IMMUTABLE;
> 
> Patrick Hatcher
> Macys.Com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: Is there a faster way to do this?

From

"Greg Sabino Mullane"

Date:

17 June 2004, 00:07:53

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 
Hard to imagine it taking that long, even for a table that large,
but we don't know the details of your system. I got a sample
table with 800,000 records down to 92 seconds using the function
below. See how it does for you. Notes follow. 
CREATE OR REPLACE FUNCTION percentile_calc(INTEGER) RETURNS text AS 'DECLARE v_tile ALIAS FOR $1; v_interval integer;
v_recordrecord; v_rowcount integer := 0; v_percentile integer := 1; BEGIN  SELECT COUNT(*)/v_tile FROM cdm_indiv_mast
WHEREval_purch_com > 0   INTO v_interval;  FOR v_record IN   SELECT ctid FROM cdm_indiv_mast WHERE val_purch_com>0
ORDERBY val_purch_com DESC LOOP   v_rowcount := v_rowcount + 1;   UPDATE cdm_indiv_mast SET percentiler=v_percentile
WHEREctid = v_record.ctid;   IF v_rowcount >= v_interval THEN     v_percentile := v_percentile + 1;     v_rowcount :=
0;  END IF; END LOOP;  RETURN  \'DONE\';
 
END;
' LANGUAGE plpgsql STABLE STRICT; 
CREATE OR REPLACE FUNCTION percentile_calc() RETURNS text AS 'SELECT new10(100);' LANGUAGE SQL; 
Notes: 
Since 100 seemed to be hard-coded into the original function, there was
no need for v_count. Instead, I made "v_tile" a variable, with a default
of "100" if the function is called with no argument. 
There may be a false assumption here. If the values of val_purch_com are
not unique, then two items with the same val_purch_com may have different
percentiler values. If this is the case, you may want to at least enforce
some ordering of these values by adding more to the ORDER BY clause.
Without knowing the full table structure, I can't recommend what columns
to add there. 
To really speed this up, make sure that you do not have any indexes on
the table. By using tids, we neatly avoid having to use any indexes in the
function itself. Unless you are using oids and really need them (highly
unlikely because of the "indiv_key" column), you should remove them: 
ALTER TABLE cdm_indiv_mast SET WITHOUT OIDS; 
Of course, vacuuming completely and often is recommended for a table this
size as well, especially when updating this many rows at once. I'd
recommend a VACUUM FULL immediately before running it.
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200406162303
-----BEGIN PGP SIGNATURE-----
iD8DBQFA0QpdvJuQZxSWSsgRAk88AKDtGrs6+/ypaaNU6DWqvhCgtsrM0gCg+2ve
J8JKOPgxp42c54Nx/rzHdxs=
=sNFW
-----END PGP SIGNATURE-----