Thread: “Loop” in plpgsql Function - PostgreSQL 9.2

“Loop” in plpgsql Function - PostgreSQL 9.2

From
"drum.lucas@gmail.com"
Date:

I've created a function that allows me to do an huge update.

But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.

How can I do that? Using for?

The function below is already working, but, the table has 40m rows. And it's taking years.

FUNCTION:

CREATE or REPLACE FUNCTION batch_number()
RETURNS INTEGER AS $$

declare  batch_num integer;  offset_num integer;
begin   offset_num = 0;   batch_num = 1;
   while (select true from gorfs.nfs_data where batch_number is null limit 1) loop       with ids(id) as       (           select               file_id           from               gorfs.nfs_data           order by               file_id           offset offset_num           limit 1000       )       update           gorfs.nfs_data       set           batch_number = batch_num       from ids       where file_id = ids.id;
       offset_num = offset_num + 1000;       batch_num = batch_num + 1;   end loop;
   return batch_num;
end

$$ language 'plpgsql';

Re: [GENERAL] “Loop” in plpgsql Function - PostgreSQL 9.2

From
"David G. Johnston"
Date:
On Mon, Jan 18, 2016 at 2:37 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:

I've created a function that allows me to do an huge update.

But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.

How can I do that? Using for?

It is customary to restrict emails to a single list or at least make others aware when you do have a legitmate need to cross-post.

Specifically your post on -admin <“Break” in plpgsql Function - PostgreSQL 9.2>

Yes, it didn't belong on -admin in the first place but posting it here with a different title and not indicating on either thread that the other exists and/or is the valid one just makes it harder for others to follow along.


As for your general question I try to approach this problem in the following manner:

SELECT however many of something that you need (FOR UPDATE)
UPDATE those selected to indicate that they have been seen
PROCESS them as needed
​repeat step 1 until it returns no records

​It doesn't always work - and given a sufficiently large number of records it may be unadvisable - but it is set-oriented which is generally a plus in SQL.

The other way to assign batches is to use the integer modulus operator (e.g., 10 % 3 = 1 : read 3 remainder of 1) or integer division (10 / 3 = 3) to derive the batch number based upon an attribute of the data itself as opposed to its order of appearance in a result set.

David J.


 

Re: [GENERAL] “Loop” in plpgsql Function - PostgreSQL 9.2

From
"drum.lucas@gmail.com"
Date:
It is customary to restrict emails to a single list or at least make others aware when you do have a legitmate need to cross-post.

Specifically your post on -admin <“Break” in plpgsql Function - PostgreSQL 9.2>

Yes, it didn't belong on -admin in the first place but posting it here with a different title and not indicating on either thread that the other exists and/or is the valid one just makes it harder for others to follow along.


Sorry about that. It wasn't on purpose... It won't happen again.
 

As for your general question I try to approach this problem in the following manner:

SELECT however many of something that you need (FOR UPDATE)
UPDATE those selected to indicate that they have been seen
PROCESS them as needed
​repeat step 1 until it returns no records

​It doesn't always work - and given a sufficiently large number of records it may be unadvisable - but it is set-oriented which is generally a plus in SQL.

The other way to assign batches is to use the integer modulus operator (e.g., 10 % 3 = 1 : read 3 remainder of 1) or integer division (10 / 3 = 3) to derive the batch number based upon an attribute of the data itself as opposed to its order of appearance in a result set.

David J.


 

Re: [GENERAL] “Loop” in plpgsql Function - PostgreSQL 9.2

From
dinesh kumar
Date:
Hello,

On Mon, Jan 18, 2016 at 1:37 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:

I've created a function that allows me to do an huge update.

But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.

How can I do that? Using for?

The function below is already working, but, the table has 40m rows. And it's taking years.


Do you need to run the function on any Staging(Not Production).  I mean, do you want to run this batch processes on a single transaction.

If not, I had the similar problem, where I needed to implement a function, which we can run in multiple sessions. I ran this function in one of the BI servers, where we have around 5 Million records.

Find this link about the implementation details.

If your question was about "Using Loops", then please ignore my comments.

FUNCTION:

CREATE or REPLACE FUNCTION batch_number()
RETURNS INTEGER AS $$

declare  batch_num integer;  offset_num integer;
begin   offset_num = 0;   batch_num = 1;
   while (select true from gorfs.nfs_data where batch_number is null limit 1) loop       with ids(id) as       (           select               file_id           from               gorfs.nfs_data           order by               file_id           offset offset_num           limit 1000       )       update           gorfs.nfs_data       set           batch_number = batch_num       from ids       where file_id = ids.id;
       offset_num = offset_num + 1000;       batch_num = batch_num + 1;   end loop;
   return batch_num;
end

$$ language 'plpgsql';



--

Re: [GENERAL] “Loop” in plpgsql Function - PostgreSQL 9.2

From
"drum.lucas@gmail.com"
Date:
On 19 January 2016 at 11:44, dinesh kumar <dineshkumar02@gmail.com> wrote:
Hello,

On Mon, Jan 18, 2016 at 1:37 PM, drum.lucas@gmail.com <drum.lucas@gmail.com> wrote:

I've created a function that allows me to do an huge update.

But I need to limit this function. I need to do 50k rows (example) and then stop it. After that I need to continue from the rows that I've stopped... I'll have to modify the call function select batch_number() as well.

How can I do that? Using for?

The function below is already working, but, the table has 40m rows. And it's taking years.


Do you need to run the function on any Staging(Not Production).  I mean, do you want to run this batch processes on a single transaction.

I don't want to run this batch in a single transaction. I'm already doing that and, as it has 40 million rows, it's taking years.

So, the new plan is:

1 - Select 50.000 rows and gives it a batch number.
2 - Select MORE 50,000 rows and gives it a NEW batch number.
3 - Select MORE 50,000 rows and gives it a NEW batch number.
4 - etc  etc etc
 

If not, I had the similar problem, where I needed to implement a function, which we can run in multiple sessions. I ran this function in one of the BI servers, where we have around 5 Million records.

Find this link about the implementation details.

If your question was about "Using Loops", then please ignore my comments.

Re: “Loop” in plpgsql Function -PostgreSQL 9.2

From
Harald Fuchs
Date:
"drum.lucas@gmail.com" <drum.lucas@gmail.com> writes:

> So, the new plan is:
>
> 1 - Select 50.000 rows and gives it a batch number.
> 2 - Select *MORE* 50,000 rows and gives it a *NEW* batch number.
> 3 - Select *MORE* 50,000 rows and gives it a *NEW* batch number.

Why so complicated?  Here's a simplified example:

  CREATE TABLE mytable (
    id serial NOT NULL,
    payload int NOT NULL,
    batch_number int NULL,
    PRIMARY KEY (id)
  );

  INSERT INTO mytable (payload)
  SELECT x FROM generate_series(1, 2000) AS g(x);

  \set batchSize 600

  UPDATE mytable
  SET batch_number = (id % (SELECT count(*) FROM mytable) / :batchSize) + 1;

  SELECT batch_number, count(*) AS cnt
  FROM mytable
  GROUP BY batch_number
  ORDER BY batch_number;

Re: “Loop” in plpgsql Function - PostgreSQL 9.2

From
mariusz
Date:
hello,


On Tue, 2016-01-19 at 10:37 +1300, drum.lucas@gmail.com wrote:
> I've created a function that allows me to do an huge update.
>
> But I need to limit this function. I need to do 50k rows (example) and
> then stop it. After that I need to continue from the rows that I've
> stopped... I'll have to modify the call function select
> batch_number() as well.
>
> How can I do that? Using for?
>
>         The function below is already working, but, th
>         e table has 40m rows. And it's taking years.
>
>
> FUNCTION:
>
> CREATE or REPLACE FUNCTION batch_number()
> RETURNS INTEGER AS $$
>
> declare
>    batch_num integer;
>    offset_num integer;
> begin
>     offset_num = 0;
>     batch_num = 1;
>
>     while (select true from gorfs.nfs_data where batch_number is null limit 1) loop
>         with ids(id) as
>         (
>             select
>                 file_id
>             from
>                 gorfs.nfs_data
>             order by
>                 file_id
>             offset offset_num
>             limit 1000
>         )
>         update
>             gorfs.nfs_data
>         set
>             batch_number = batch_num
>         from ids
>         where file_id = ids.id;
>
>         offset_num = offset_num + 1000;
>         batch_num = batch_num + 1;
>     end loop;
>
>     return batch_num;
> end
>
> $$ language 'plpgsql';

do i read correctly you want to do something like that:

WITH b(n,i) AS
  (SELECT row_number() OVER (ORDER BY file_id), file_id
     FROM gorfs.nfs_data)
UPDATE gorfs.nfs_data SET batch_number = ((b.n-1)/1000)+1
  FROM b WHERE file_id = b.i

apply math to match your desired result, this one based on your fun.

which should not scan that many times for batch_number being null. i
don't know implementation details and i do not know if you have any
partial index where batch_number is null which could be used to speed
this up.

of cource this is for one run to change all at once.
but if you really want multiple tx on this how about that:

CREATE INDEX ON gorfs.nfs_data (file_id) WHERE batch_number IS NULL;

find next batchnum to use eg max(batch_number)+1 from gorfs.nfs_data,
or some sequence, locking or whatever if concurency migth byte you.
let it be YourNextBN

WITH b(i) AS
  (SELECT file_id
     FROM gorfs.nfs_data
     WHERE batch_number IS NULL ORDER BY file_id LIMIT 1000
     FOR UPDATE)
UPDATE gorfs.nfs_data SET batch_number = YourNextBN
  FROM b WHERE batch_number IS NULL AND file_id = b.i

now, wouldn't it be little simpler then your loop with multiple scanning
same data and throwing it out via offset?

on the other hand, if you think creating simple small index on file_id
where batch_number is null is not an option, then this may not give you
real speedup

of cource your real life example might be more complicated than one
column update, but looking at your fun, it scared me, i mean, your fun
updates every one row in table anyway, so why so many times with
additional scan to check loop condition? it gets worse if there is no
partial index where batch_number is null.

regards, mariusz