Thread: “Loop” in plpgsql Function - PostgreSQL 9.2
I've created a function that allows me to do an huge update.
But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.
How can I do that? Using for?
The function below is already working, but, the table has 40m rows. And it's taking years.
FUNCTION:
CREATE or REPLACE FUNCTION batch_number()
RETURNS INTEGER AS $$
declare batch_num integer; offset_num integer;
begin offset_num = 0; batch_num = 1;
while (select true from gorfs.nfs_data where batch_number is null limit 1) loop with ids(id) as ( select file_id from gorfs.nfs_data order by file_id offset offset_num limit 1000 ) update gorfs.nfs_data set batch_number = batch_num from ids where file_id = ids.id;
offset_num = offset_num + 1000; batch_num = batch_num + 1; end loop;
return batch_num;
end
$$ language 'plpgsql';
I've created a function that allows me to do an huge update.
But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.
How can I do that? Using for?
It is customary to restrict emails to a single list or at least make others aware when you do have a legitmate need to cross-post.Specifically your post on -admin <“Break” in plpgsql Function - PostgreSQL 9.2>Yes, it didn't belong on -admin in the first place but posting it here with a different title and not indicating on either thread that the other exists and/or is the valid one just makes it harder for others to follow along.
As for your general question I try to approach this problem in the following manner:SELECT however many of something that you need (FOR UPDATE)UPDATE those selected to indicate that they have been seenPROCESS them as neededrepeat step 1 until it returns no recordsIt doesn't always work - and given a sufficiently large number of records it may be unadvisable - but it is set-oriented which is generally a plus in SQL.The other way to assign batches is to use the integer modulus operator (e.g., 10 % 3 = 1 : read 3 remainder of 1) or integer division (10 / 3 = 3) to derive the batch number based upon an attribute of the data itself as opposed to its order of appearance in a result set.David J.
I've created a function that allows me to do an huge update.
But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.
How can I do that? Using for?
The function below is already working, but, the table has 40m rows. And it's taking years.
FUNCTION:
CREATE or REPLACE FUNCTION batch_number() RETURNS INTEGER AS $$ declare batch_num integer; offset_num integer; begin offset_num = 0; batch_num = 1; while (select true from gorfs.nfs_data where batch_number is null limit 1) loop with ids(id) as ( select file_id from gorfs.nfs_data order by file_id offset offset_num limit 1000 ) update gorfs.nfs_data set batch_number = batch_num from ids where file_id = ids.id; offset_num = offset_num + 1000; batch_num = batch_num + 1; end loop; return batch_num; end $$ language 'plpgsql';
Hello,On Mon, Jan 18, 2016 at 1:37 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:I've created a function that allows me to do an huge update.
But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.
How can I do that? Using for?
The function below is already working, but, the table has 40m rows. And it's taking years.
Do you need to run the function on any Staging(Not Production). I mean, do you want to run this batch processes on a single transaction.
If not, I had the similar problem, where I needed to implement a function, which we can run in multiple sessions. I ran this function in one of the BI servers, where we have around 5 Million records.Find this link about the implementation details.If your question was about "Using Loops", then please ignore my comments.
"drum.lucas@gmail.com" <drum.lucas@gmail.com> writes: > So, the new plan is: > > 1 - Select 50.000 rows and gives it a batch number. > 2 - Select *MORE* 50,000 rows and gives it a *NEW* batch number. > 3 - Select *MORE* 50,000 rows and gives it a *NEW* batch number. Why so complicated? Here's a simplified example: CREATE TABLE mytable ( id serial NOT NULL, payload int NOT NULL, batch_number int NULL, PRIMARY KEY (id) ); INSERT INTO mytable (payload) SELECT x FROM generate_series(1, 2000) AS g(x); \set batchSize 600 UPDATE mytable SET batch_number = (id % (SELECT count(*) FROM mytable) / :batchSize) + 1; SELECT batch_number, count(*) AS cnt FROM mytable GROUP BY batch_number ORDER BY batch_number;
hello, On Tue, 2016-01-19 at 10:37 +1300, drum.lucas@gmail.com wrote: > I've created a function that allows me to do an huge update. > > But I need to limit this function. I need to do 50k rows (example) and > then stop it. After that I need to continue from the rows that I've > stopped... I'll have to modify the call function select > batch_number() as well. > > How can I do that? Using for? > > The function below is already working, but, th > e table has 40m rows. And it's taking years. > > > FUNCTION: > > CREATE or REPLACE FUNCTION batch_number() > RETURNS INTEGER AS $$ > > declare > batch_num integer; > offset_num integer; > begin > offset_num = 0; > batch_num = 1; > > while (select true from gorfs.nfs_data where batch_number is null limit 1) loop > with ids(id) as > ( > select > file_id > from > gorfs.nfs_data > order by > file_id > offset offset_num > limit 1000 > ) > update > gorfs.nfs_data > set > batch_number = batch_num > from ids > where file_id = ids.id; > > offset_num = offset_num + 1000; > batch_num = batch_num + 1; > end loop; > > return batch_num; > end > > $$ language 'plpgsql'; do i read correctly you want to do something like that: WITH b(n,i) AS (SELECT row_number() OVER (ORDER BY file_id), file_id FROM gorfs.nfs_data) UPDATE gorfs.nfs_data SET batch_number = ((b.n-1)/1000)+1 FROM b WHERE file_id = b.i apply math to match your desired result, this one based on your fun. which should not scan that many times for batch_number being null. i don't know implementation details and i do not know if you have any partial index where batch_number is null which could be used to speed this up. of cource this is for one run to change all at once. but if you really want multiple tx on this how about that: CREATE INDEX ON gorfs.nfs_data (file_id) WHERE batch_number IS NULL; find next batchnum to use eg max(batch_number)+1 from gorfs.nfs_data, or some sequence, locking or whatever if concurency migth byte you. let it be YourNextBN WITH b(i) AS (SELECT file_id FROM gorfs.nfs_data WHERE batch_number IS NULL ORDER BY file_id LIMIT 1000 FOR UPDATE) UPDATE gorfs.nfs_data SET batch_number = YourNextBN FROM b WHERE batch_number IS NULL AND file_id = b.i now, wouldn't it be little simpler then your loop with multiple scanning same data and throwing it out via offset? on the other hand, if you think creating simple small index on file_id where batch_number is null is not an option, then this may not give you real speedup of cource your real life example might be more complicated than one column update, but looking at your fun, it scared me, i mean, your fun updates every one row in table anyway, so why so many times with additional scan to check loop condition? it gets worse if there is no partial index where batch_number is null. regards, mariusz