Thread: Is there a faster way to do this?
pg: 7.4.2 RedHat 7.2 Can I get some advice on a possible faster way of doing this: Scenario: Each day I update a column in a table with an internal percentile value. To arrive at this value, I'll get a count of records with sales > 0 and then divide this count by the total number of tile groups I want. So for example: Total records w/sales > 0 = 730,000tile# = 100 total percentile groups (730,000 / 100):7300 Now that I have the total number of groups I need, I cycle through my recordset, grab the key field and the current percentile number and stuff the values into a temp table. I mod the current row counter against the total percentile group number. If it is 0 then I add 1 to the current percentile number . After inserting records into the temp file I then update the main table. Using the example above, the first 7300 records would get a percentile number of 1, the next 7300 records would get a percentile number of 2, then 3,4,5,etc. Unfortunately, I am going record by record in a loop and the process takes upwards of 20mins. Is there a faster way to do this? I thought about using limit and offset, but I'm not sure how I would accomplish it. Below is the function I currently use. Thanks for any help provided CREATE OR REPLACE FUNCTION cdm.percentile_calc() RETURNS text AS 'DECLARE v_interval int4; v_tile int4; v_percentile int4; v_check int4; v_count int4; v_rowcount int4; myRec RECORD; BEGIN v_count:=0; v_tile:= 100; v_percentile:=1; v_rowcount :=1; v_check:=0; /* Get count of records with val_purch > 0 */ select into v_count count(*) from cdm.cdm_indiv_mast where val_purch_com >0; /* this number will be used as part of our MOD to tell when to add one to our percentile */ v_interval := v_count / v_tile; CREATE TEMP TABLE cdmperct (f1 int8, f2 int2); FOR myRec IN select indiv_key from cdm.cdm_indiv_mast where val_purch_com >0 order by val_purch_com desc LOOP INSERT INTO cdmperct values (myRec.indiv_key,v_percentile); v_check= mod(v_rowcount,v_interval); IF v_check = 0 THEN v_percentile:=v_percentile+1; END IF; v_rowcount:=v_rowcount+1; END LOOP; UPDATE cdm.cdm_indiv_mast SET percentiler = f2 from cdmperct where indiv_key = f1; DROP TABLE cdmperct; RETURN \'DONE\'; END; ' LANGUAGE 'plpgsql' IMMUTABLE; Patrick Hatcher Macys.Com
Patrick Hatcher wrote: > > Scenario: Each day I update a column in a table with an internal > percentile value. To arrive at this value, I'll get a count of records > with sales > 0 and then divide this count by the total number of tile > groups I want. So for example: > Total records w/sales > 0 = 730,000 > tile# = 100 > total percentile groups (730,000 / 100):7300 > > Now that I have the total number of groups I need, I cycle through my > recordset, grab the key field and the current percentile number and stuff > the values into a temp table. I mod the current row counter against the > total percentile group number. If it is 0 then I add 1 to the current > percentile number . After inserting records into the temp file I then > update the main table. > > Using the example above, the first 7300 records would get a percentile > number of 1, the next 7300 records would get a percentile number of 2, > then 3,4,5,etc. > > Unfortunately, I am going record by record in a loop and the process takes > upwards of 20mins. Is there a faster way to do this? I thought about > using limit and offset, but I'm not sure how I would accomplish it. Idea #1: Well, if you ordered by something unique, you could use ORDER BY, LIMIT and OFFSET UPDATE my_table SET my_group=1 WHERE val > 0 ORDER BY val, my_id_column OFFSET 0 LIMIT 7300; UPDATE my_table SET my_group=2 WHERE val > 0 ORDER BY val, my_id_column OFFSET 7300 LIMIT 7300; etc. You'll need to make sure the order is guaranteed for all rows when "val" is the same though. Now, the problem with this is that by the time you get to group 99, you've had to scan over group 1 98 times - not good. If you knew how many rows there were for any given "val" then you could restrict it much more though. Idea #2: Turn your function around. Declare a cursor on the sorted SELECT of target-rows. SKIP 7300 rows and read the primary-key. Use that to issue an update of all rows between two values. That's only 100 updates issued rather than one for each target row. Idea #3: Someone might well be able to come up with a clever idea involving a join against a set-returning function, but I'm not sure about guaranteeing the order of the join vs the returned set (and it's getting late here). Any ideas people? Maybe one of those is some use -- Richard Huxton Archonet Ltd
Perhaps something along the lines of CREATE TEMP SEQUENCE pctile_seq; SELECT ... , ((nextval('pctile_seq')/(rec_per_group+1) +1) FROM (SELECT ... WHERE sales > 0 ORDER BY SALES) sales; DROP TEMP SEQUENCE pctile_seq; This doesn't add in the 'sales = 0 are in percentile 0, but that shouldn't be too hard. I think, with some playing around, one might be able to to an UPDATE from the above SELECT, which will eliminate the read/update loop. Patrick Hatcher wrote: > > > > pg: 7.4.2 > RedHat 7.2 > > Can I get some advice on a possible faster way of doing this: > > Scenario: Each day I update a column in a table with an internal > percentile value. To arrive at this value, I'll get a count of records > with sales > 0 and then divide this count by the total number of tile > groups I want. So for example: > Total records w/sales > 0 = 730,000 > tile# = 100 > total percentile groups (730,000 / 100):7300 > > Now that I have the total number of groups I need, I cycle through my > recordset, grab the key field and the current percentile number and stuff > the values into a temp table. I mod the current row counter against the > total percentile group number. If it is 0 then I add 1 to the current > percentile number . After inserting records into the temp file I then > update the main table. > > Using the example above, the first 7300 records would get a percentile > number of 1, the next 7300 records would get a percentile number of 2, > then 3,4,5,etc. > > Unfortunately, I am going record by record in a loop and the process takes > upwards of 20mins. Is there a faster way to do this? I thought about > using limit and offset, but I'm not sure how I would accomplish it. > > Below is the function I currently use. Thanks for any help provided > > CREATE OR REPLACE FUNCTION cdm.percentile_calc() > RETURNS text AS > 'DECLARE > v_interval int4; > v_tile int4; > v_percentile int4; > v_check int4; > v_count int4; > v_rowcount int4; > myRec RECORD; > > BEGIN > v_count:=0; > v_tile:= 100; > v_percentile:=1; > v_rowcount :=1; > v_check:=0; > > > /* Get count of records with val_purch > 0 */ > select into v_count count(*) from cdm.cdm_indiv_mast where > val_purch_com >0; > > /* this number will be used as part of our MOD to tell when to add one > to our percentile */ > v_interval := v_count / v_tile; > > CREATE TEMP TABLE cdmperct (f1 int8, f2 int2); > > > FOR myRec IN select indiv_key from cdm.cdm_indiv_mast where > val_purch_com >0 order by val_purch_com desc LOOP > INSERT INTO cdmperct values (myRec.indiv_key,v_percentile); > v_check = mod(v_rowcount,v_interval); > IF v_check = 0 THEN > v_percentile:=v_percentile+1; > END IF; > v_rowcount:= v_rowcount+1; > END LOOP; > > > UPDATE cdm.cdm_indiv_mast SET percentiler = f2 from cdmperct where > indiv_key = f1; > > > DROP TABLE cdmperct; > RETURN \'DONE\'; > END; ' > LANGUAGE 'plpgsql' IMMUTABLE; > > Patrick Hatcher > Macys.Com > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) -- Edmund Bacon <ebacon@onesystem.com>
One way to do it would be to: Not put the percentile in the sales table; Create an percentile table with a foreign key to the sales table primary key and percentile int4: CREATE TABLE percentiles( fkey <whatever> PRIMARY KEY REFERENCES sales( <primary key>), percentile INT4 ); Create a sequence for that ancillary table: CREATE SEQUENCE percent_seq; When ready to create the percentiles, truncate the percentile table and reset the sequence next value: TRUNCATE percentiles; setval( 'percent_seq', 0); Now query your sales with the proper order by: INSERT INTO percentiles SELECT pk_sales, nextval( 'percent_seq') / tile_size FROM sales ORDER BY sales_value; HTH Patrick Hatcher wrote: > > > > pg: 7.4.2 > RedHat 7.2 > > Can I get some advice on a possible faster way of doing this: > > Scenario: Each day I update a column in a table with an internal > percentile value. To arrive at this value, I'll get a count of records > with sales > 0 and then divide this count by the total number of tile > groups I want. So for example: > Total records w/sales > 0 = 730,000 > tile# = 100 > total percentile groups (730,000 / 100):7300 > > Now that I have the total number of groups I need, I cycle through my > recordset, grab the key field and the current percentile number and stuff > the values into a temp table. I mod the current row counter against the > total percentile group number. If it is 0 then I add 1 to the current > percentile number . After inserting records into the temp file I then > update the main table. > > Using the example above, the first 7300 records would get a percentile > number of 1, the next 7300 records would get a percentile number of 2, > then 3,4,5,etc. > > Unfortunately, I am going record by record in a loop and the process takes > upwards of 20mins. Is there a faster way to do this? I thought about > using limit and offset, but I'm not sure how I would accomplish it. > > Below is the function I currently use. Thanks for any help provided > > CREATE OR REPLACE FUNCTION cdm.percentile_calc() > RETURNS text AS > 'DECLARE > v_interval int4; > v_tile int4; > v_percentile int4; > v_check int4; > v_count int4; > v_rowcount int4; > myRec RECORD; > > BEGIN > v_count:=0; > v_tile:= 100; > v_percentile:=1; > v_rowcount :=1; > v_check:=0; > > > /* Get count of records with val_purch > 0 */ > select into v_count count(*) from cdm.cdm_indiv_mast where > val_purch_com >0; > > /* this number will be used as part of our MOD to tell when to add one > to our percentile */ > v_interval := v_count / v_tile; > > CREATE TEMP TABLE cdmperct (f1 int8, f2 int2); > > > FOR myRec IN select indiv_key from cdm.cdm_indiv_mast where > val_purch_com >0 order by val_purch_com desc LOOP > INSERT INTO cdmperct values (myRec.indiv_key,v_percentile); > v_check = mod(v_rowcount,v_interval); > IF v_check = 0 THEN > v_percentile:=v_percentile+1; > END IF; > v_rowcount:= v_rowcount+1; > END LOOP; > > > UPDATE cdm.cdm_indiv_mast SET percentiler = f2 from cdmperct where > indiv_key = f1; > > > DROP TABLE cdmperct; > RETURN \'DONE\'; > END; ' > LANGUAGE 'plpgsql' IMMUTABLE; > > Patrick Hatcher > Macys.Com > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hard to imagine it taking that long, even for a table that large, but we don't know the details of your system. I got a sample table with 800,000 records down to 92 seconds using the function below. See how it does for you. Notes follow. CREATE OR REPLACE FUNCTION percentile_calc(INTEGER) RETURNS text AS 'DECLARE v_tile ALIAS FOR $1; v_interval integer; v_recordrecord; v_rowcount integer := 0; v_percentile integer := 1; BEGIN SELECT COUNT(*)/v_tile FROM cdm_indiv_mast WHEREval_purch_com > 0 INTO v_interval; FOR v_record IN SELECT ctid FROM cdm_indiv_mast WHERE val_purch_com>0 ORDERBY val_purch_com DESC LOOP v_rowcount := v_rowcount + 1; UPDATE cdm_indiv_mast SET percentiler=v_percentile WHEREctid = v_record.ctid; IF v_rowcount >= v_interval THEN v_percentile := v_percentile + 1; v_rowcount := 0; END IF; END LOOP; RETURN \'DONE\'; END; ' LANGUAGE plpgsql STABLE STRICT; CREATE OR REPLACE FUNCTION percentile_calc() RETURNS text AS 'SELECT new10(100);' LANGUAGE SQL; Notes: Since 100 seemed to be hard-coded into the original function, there was no need for v_count. Instead, I made "v_tile" a variable, with a default of "100" if the function is called with no argument. There may be a false assumption here. If the values of val_purch_com are not unique, then two items with the same val_purch_com may have different percentiler values. If this is the case, you may want to at least enforce some ordering of these values by adding more to the ORDER BY clause. Without knowing the full table structure, I can't recommend what columns to add there. To really speed this up, make sure that you do not have any indexes on the table. By using tids, we neatly avoid having to use any indexes in the function itself. Unless you are using oids and really need them (highly unlikely because of the "indiv_key" column), you should remove them: ALTER TABLE cdm_indiv_mast SET WITHOUT OIDS; Of course, vacuuming completely and often is recommended for a table this size as well, especially when updating this many rows at once. I'd recommend a VACUUM FULL immediately before running it. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200406162303 -----BEGIN PGP SIGNATURE----- iD8DBQFA0QpdvJuQZxSWSsgRAk88AKDtGrs6+/ypaaNU6DWqvhCgtsrM0gCg+2ve J8JKOPgxp42c54Nx/rzHdxs= =sNFW -----END PGP SIGNATURE-----