Re: Is there a faster way to do this? - Mailing list pgsql-sql

From Richard Huxton
Subject Re: Is there a faster way to do this?
Date
Msg-id 40CF5941.3090301@archonet.com
Whole thread Raw
In response to Is there a faster way to do this?  (Patrick Hatcher <PHatcher@macys.com>)
List pgsql-sql
Patrick Hatcher wrote:
> 
> Scenario:  Each day I update a column in a table with an internal
> percentile value.  To arrive at  this value, I'll get a count of records
> with sales > 0 and then divide this count by the total number of tile
> groups I want.  So for example:
> Total records w/sales > 0 = 730,000
>  tile# = 100
> total percentile groups (730,000 / 100):7300
> 
> Now that I have the total number of groups I need, I cycle through my
> recordset, grab the key field and the current percentile number and stuff
> the values into a temp table. I mod the current row counter against the
> total percentile group number.  If it is 0 then I add 1 to the  current
> percentile number .  After inserting records into the temp file I then
> update the main table.
> 
> Using the example above, the first 7300 records would get a  percentile
> number of 1, the next 7300  records would get  a percentile number of 2,
> then 3,4,5,etc.
> 
> Unfortunately, I am going record by record in a loop and the process takes
> upwards of 20mins.  Is there a faster way to do this?  I thought about
> using limit and offset, but I'm not sure how I would accomplish it.

Idea #1:
Well, if you ordered by something unique, you could use ORDER BY, LIMIT 
and OFFSET

UPDATE my_table SET my_group=1 WHERE val > 0 ORDER BY val, my_id_column 
OFFSET 0 LIMIT 7300;
UPDATE my_table SET my_group=2 WHERE val > 0 ORDER BY val, my_id_column 
OFFSET 7300 LIMIT 7300;
etc.

You'll need to make sure the order is guaranteed for all rows when "val" 
is the same though.

Now, the problem with this is that by the time you get to group 99, 
you've had to scan over group 1 98 times - not good. If you knew how 
many rows there were for any given "val" then you could restrict it much 
more though.

Idea #2:
Turn your function around. Declare a cursor on the sorted SELECT of 
target-rows. SKIP 7300 rows and read the primary-key. Use that to issue 
an update of all rows between two values. That's only 100 updates issued 
rather than one for each target row.

Idea #3:
Someone might well be able to come up with a clever idea involving a 
join against a set-returning function, but I'm not sure about 
guaranteeing the order of the join vs the returned set (and it's getting 
late here). Any ideas people?

Maybe one of those is some use
--   Richard Huxton  Archonet Ltd


pgsql-sql by date:

Previous
From: Patrick Hatcher
Date:
Subject: Is there a faster way to do this?
Next
From: Edmund Bacon
Date:
Subject: Re: Is there a faster way to do this?