Thread: processing large amount of rows with plpgsql

processing large amount of rows with plpgsql

From
Geert Mak
Date:
hello everybody,

we are trying to move the data from table1 into table2 using a plpgsql stored procedure which is performing simple a
dataconversion 

there are about 50 million rows

the tables are relatively simple, less than a dozen columns, most are integer, a couple are char(32) and one is varchar
holdingURLs 

what happens is that when we execute the stored procedure, the execution eats up the 17 GB free space on the server and
theserver crashes 

if we apply a limit of 1 million rows, the execution completes successfully in about a minute

we understand, by now, that plpgsql functions are executed in their own transaction, which commits when the function
comesto an end 

the question is -

are plpgsql functions suitable for such massive operations on data and can this work without so much disk space is
beingeaten for something which should be simply "read-change-write, read-change-write, read-change-write, …"? i. e. any
wayto force commit inside, or so? 

or should we rather implement this operation in some external scripting language (speed is not that important, this is
aone time conversion) 

thank you,
geert

Re: processing large amount of rows with plpgsql

From
Merlin Moncure
Date:
On Wed, Aug 8, 2012 at 2:41 PM, Geert Mak <pobox@verysmall.org> wrote:
> hello everybody,
>
> we are trying to move the data from table1 into table2 using a plpgsql stored procedure which is performing simple a
dataconversion 
>
> there are about 50 million rows
>
> the tables are relatively simple, less than a dozen columns, most are integer, a couple are char(32) and one is
varcharholding URLs 
>
> what happens is that when we execute the stored procedure, the execution eats up the 17 GB free space on the server
andthe server crashes 
>
> if we apply a limit of 1 million rows, the execution completes successfully in about a minute
>
> we understand, by now, that plpgsql functions are executed in their own transaction, which commits when the function
comesto an end 
>
> the question is -
>
> are plpgsql functions suitable for such massive operations on data and can this work without so much disk space is
beingeaten for something which should be simply "read-change-write, read-change-write, read-change-write, …"? i. e. any
wayto force commit inside, or so? 
>
> or should we rather implement this operation in some external scripting language (speed is not that important, this
isa one time conversion) 

What is the general structure of the procedure?  In particular, how
are you browsing and updating the rows?  There is (almost) no way to
force commit inside a function -- there has been some discussion about
stored procedure and/or autonomous transaction feature in terms of
getting there.

I say 'almost' because you can emulate some aspects of autonomous
transactions with dblink, but that may not be a very good fit for your
particular case.

merlin

Re: processing large amount of rows with plpgsql

From
Geert Mak
Date:
On 08.08.2012, at 22:04, Merlin Moncure wrote:

> What is the general structure of the procedure?  In particular, how
> are you browsing and updating the rows?

Here it is -

BEGIN
for statistics_row in SELECT * FROM statistics ORDER BY time ASC
LOOP
    ...
    ... here some very minimal transformation is done
    ... and the row is written into the second table
    ...
END LOOP;
RETURN 1;
END;

> There is (almost) no way to
> force commit inside a function --

So what you are saying is that this behavior is normal and we should either equip ourselves with enough disk space
(whichI am trying now, it is a cloud server, which I am resizing to gain more disk space and see what will happen) or
doit with an external (scripting) language? 

> there has been some discussion about
> stored procedure and/or autonomous transaction feature in terms of
> getting there.
>
> I say 'almost' because you can emulate some aspects of autonomous
> transactions with dblink, but that may not be a very good fit for your
> particular case.

I met already dblink mention in this context somewhere... Though if plpgsql performs well with more disk space, I'll
leaveit for now. It is a one time operation this one. 

Thank you,
Geert

Re: processing large amount of rows with plpgsql

From
"Marc Mamin"
Date:
> > There is (almost) no way to
> > force commit inside a function --
>
> So what you are saying is that this behavior is normal and we should
> either equip ourselves with enough disk space (which I am trying now,
> it is a cloud server, which I am resizing to gain more disk space and
> see what will happen) or do it with an external (scripting) language?
>

Hello,

a relative simple way to workaround your performance/resource problem is
to slice the update.

e.g.:

create function myupdate(slice int) ...

for statistics_row in
   SELECT * FROM statistics
   WHERE id % 16 = slice
   or:
   WHERE hashtext(id::text) % 16 = slice
   ...

and then call your function with the values 1..15 (when using 16 slices)

Use a power of 2 for the number of slices.

It may be faster to use many slices and
this allows to do the job in parallel on a few threads.

HTH,

Marc Mamin

Re: processing large amount of rows with plpgsql

From
"Marc Mamin"
Date:
oops,

> and then call your function with the values 1..15 (when using 16

it should  of course be 0..15

Marc Mamin



> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Marc Mamin
> Sent: Donnerstag, 9. August 2012 09:12
> To: Geert Mak; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] processing large amount of rows with plpgsql
>
> > > There is (almost) no way to
> > > force commit inside a function --
> >
> > So what you are saying is that this behavior is normal and we should
> > either equip ourselves with enough disk space (which I am trying
now,
> > it is a cloud server, which I am resizing to gain more disk space
and
> > see what will happen) or do it with an external (scripting)
language?
> >
>
> Hello,
>
> a relative simple way to workaround your performance/resource problem
> is
> to slice the update.
>
> e.g.:
>
> create function myupdate(slice int) ...
>
> for statistics_row in
>    SELECT * FROM statistics
>    WHERE id % 16 = slice
>    or:
>    WHERE hashtext(id::text) % 16 = slice
>    ...
>
> and then call your function with the values 1..15 (when using 16
> slices)
>
> Use a power of 2 for the number of slices.
>
> It may be faster to use many slices and
> this allows to do the job in parallel on a few threads.
>
> HTH,
>
> Marc Mamin
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general