Thread: ToDo: fast update of arrays with fixed length fields for PL/pgSQL
Hello
this proposal is related to reported issue http://www.postgresql.org/message-id/E1VQuta-0007Y4-Ip@wrigleys.postgresql.orgWe can directly modify any fields of int, float, double arrays (when result size will be immutable). This trick is used now for acceleration of some aggregates.
Ideas, comments?
Regards
Pavel
Hi, On 2013-10-02 18:56:51 +0200, Pavel Stehule wrote: > Hello > > this proposal is related to reported issue > http://www.postgresql.org/message-id/E1VQuta-0007Y4-Ip@wrigleys.postgresql.org > > > We can directly modify any fields of int, float, double arrays (when result > size will be immutable). This trick is used now for acceleration of some > aggregates. > > Ideas, comments? A specific array might be located directly on a buffer - starting to manipulate them inplace will lead to havoc in such scenarios. I don't think we have the infrastructure for detecting such cases. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
2013/10/2 Andres Freund <andres@2ndquadrant.com>
Hi,A specific array might be located directly on a buffer - starting to
On 2013-10-02 18:56:51 +0200, Pavel Stehule wrote:
> Hello
>
> this proposal is related to reported issue
> http://www.postgresql.org/message-id/E1VQuta-0007Y4-Ip@wrigleys.postgresql.org
>
>
> We can directly modify any fields of int, float, double arrays (when result
> size will be immutable). This trick is used now for acceleration of some
> aggregates.
>
> Ideas, comments?
manipulate them inplace will lead to havoc in such scenarios. I don't
think we have the infrastructure for detecting such cases.
If you can do a update of some array in plpgsql now, then you have to work with local copy only. It is a necessary precondition, and I am think it is valid.
Pavel
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Pavel Stehule <pavel.stehule@gmail.com> writes: > If you can do a update of some array in plpgsql now, then you have to work > with local copy only. It is a necessary precondition, and I am think it is > valid. If the proposal only relates to assignments to elements of plpgsql local variables, it's probably safe, but it's also probably not of much value. plpgsql has enough overhead that I'm doubting you'd get much real-world speedup. I'm also not very excited about putting even more low-level knowledge about array representation into plpgsql. regards, tom lane
2013/10/3 Tom Lane <tgl@sss.pgh.pa.us>
Pavel Stehule <pavel.stehule@gmail.com> writes:If the proposal only relates to assignments to elements of plpgsql local
> If you can do a update of some array in plpgsql now, then you have to work
> with local copy only. It is a necessary precondition, and I am think it is
> valid.
variables, it's probably safe, but it's also probably not of much value.
plpgsql has enough overhead that I'm doubting you'd get much real-world
speedup. I'm also not very excited about putting even more low-level
knowledge about array representation into plpgsql.
I looked to code, and I am thinking so this can be done inside array related routines. We just have to signalize request for inplace update (if we have a local copy).
I have not idea, how significant speedup can be (if any), but current behave is not friendly (and for multidimensional arrays there are no workaround), so it is interesting way - and long time I though about some similar optimization.
Regards
Pavel
Pavel
regards, tom lane
Hello
a very ugly test shows a possibility about 100% speedup on reported example (on small arrays, a patch is buggy and doesn't work for larger arrays).I updated a code to be read only
CREATE OR REPLACE FUNCTION public.fill_2d_array(rows integer, cols integer)
RETURNS integer
LANGUAGE plpgsql
AS $function$
DECLARE
img double precision[][];
i integer; j integer;
cont integer; r double precision;
BEGIN
img := ARRAY( SELECT 0 FROM generate_series(1, rows * cols) ) ;
cont:= 0;
For i IN 1..rows LOOP
For j IN 1..cols LOOP r := img[i * cols + j];
r := (i * cols + j)::double precision;
cont := cont + 1; --raise notice '%', img;
END LOOP;
END LOOP;
return cont;
END;
$function$
It exec all expressions
-- original
postgres=# select fill_2d_array(200,200);
fill_2d_array
---------------
40000
(1 row)
Time: 12726.117 ms
fill_2d_array
---------------
40000
(1 row)
Time: 12726.117 ms
-- read only version
postgres=# select fill_2d_array(200,200); fill_2d_array
---------------
40000
(1 row)
Time: 245.894 ms
---------------
40000
(1 row)
Time: 245.894 ms
so there is about 50x slowdown
2013/10/3 Pavel Stehule <pavel.stehule@gmail.com>
2013/10/3 Tom Lane <tgl@sss.pgh.pa.us>Pavel Stehule <pavel.stehule@gmail.com> writes:If the proposal only relates to assignments to elements of plpgsql local
> If you can do a update of some array in plpgsql now, then you have to work
> with local copy only. It is a necessary precondition, and I am think it is
> valid.
variables, it's probably safe, but it's also probably not of much value.
plpgsql has enough overhead that I'm doubting you'd get much real-world
speedup. I'm also not very excited about putting even more low-level
knowledge about array representation into plpgsql.I looked to code, and I am thinking so this can be done inside array related routines. We just have to signalize request for inplace update (if we have a local copy).I have not idea, how significant speedup can be (if any), but current behave is not friendly (and for multidimensional arrays there are no workaround), so it is interesting way - and long time I though about some similar optimization.Regards
Pavel
regards, tom lane
Attachment
Hello
I am sending little bit cleaned patch postgres=# select fill_2d_array(300,300,1);
fill_2d_array
───────────────
90000
(1 row)
Time: 751.572 ms -- patched
postgres=# \q
bash-4.1$ psql postgres
psql (9.4devel)
Type "help" for help.
postgres=# select fill_2d_array(300,300,2);
fill_2d_array
───────────────
90000
(1 row)
Time: 87453.351 ms -- original
2013/10/3 Tom Lane <tgl@sss.pgh.pa.us>
Pavel Stehule <pavel.stehule@gmail.com> writes:If the proposal only relates to assignments to elements of plpgsql local
> If you can do a update of some array in plpgsql now, then you have to work
> with local copy only. It is a necessary precondition, and I am think it is
> valid.
variables, it's probably safe, but it's also probably not of much value.
plpgsql has enough overhead that I'm doubting you'd get much real-world
speedup. I'm also not very excited about putting even more low-level
knowledge about array representation into plpgsql.
regards, tom lane
Attachment
Hello
I fixed patch - there was a missing cast to domain when it was used (and all regress tests are ok now).set array_fast_update to off;
postgres=# select fill_2d_array(300,300);
fill_2d_array
───────────────
90000
(1 row)
Time: 33570.087 ms
postgres=# set array_fast_update to on;
SET
Time: 0.279 ms
postgres=# select fill_2d_array(300,300);
fill_2d_array
───────────────
90000
(1 row)
Time: 505.202 ms
CREATE OR REPLACE FUNCTION public.quicksort(l integer, r integer, a integer[])
RETURNS integer[]
LANGUAGE plpgsql
IMMUTABLE STRICT
AS $function$
DECLARE akt int[] = a;
i integer := l; j integer := r; x integer = akt[(l+r) / 2];
w integer;
BEGIN
LOOP
WHILE akt[i] < x LOOP i := i + 1; END LOOP;
WHILE x < akt[j] loop j := j - 1; END LOOP;
IF i <= j THEN
w := akt[i];
akt[i] := akt[j]; akt[j] := w;
i := i + 1; j := j - 1;
END IF;
EXIT WHEN i > j;
END LOOP;
IF l < j THEN akt := quicksort(l,j,akt); END IF;
IF i < r then akt := quicksort(i,r,akt); END IF;
RETURN akt;
END;
$function$
postgres=# set array_fast_update to off;
SET
Time: 0.282 ms
postgres=# SELECT array_upper(quicksort(1,21000,array_agg(a)),1) FROM test;
array_upper
─────────────
21000
(1 row)
Time: 5086.781 ms
postgres=# set array_fast_update to on;
SET
Time: 0.702 ms
postgres=# SELECT array_upper(quicksort(1,21000,array_agg(a)),1) FROM test;
array_upper
─────────────
21000
(1 row)
Time: 1751.920 ms
This code can be cleaned (a domains can be better optimized generally - a IO cast can be expensive for large arrays and domain_check can be used there instead), but it is good enough for discussion if we would this optimization or not.
Regards
Pavel
2013/10/3 Pavel Stehule <pavel.stehule@gmail.com>
Helloa very ugly test shows a possibility about 100% speedup on reported example (on small arrays, a patch is buggy and doesn't work for larger arrays).I updated a code to be read only
CREATE OR REPLACE FUNCTION public.fill_2d_array(rows integer, cols integer)
RETURNS integer
LANGUAGE plpgsql
AS $function$
DECLARE
img double precision[][];
i integer; j integer;
cont integer; r double precision;
BEGIN
img := ARRAY( SELECT 0 FROM generate_series(1, rows * cols) ) ;
cont:= 0;
For i IN 1..rows LOOP
For j IN 1..cols LOOP r := img[i * cols + j];
r := (i * cols + j)::double precision;
cont := cont + 1; --raise notice '%', img;
END LOOP;
END LOOP;
return cont;
END;
$function$It exec all expressions-- originalpostgres=# select fill_2d_array(200,200);
fill_2d_array
---------------
40000
(1 row)
Time: 12726.117 ms-- read only versionpostgres=# select fill_2d_array(200,200); fill_2d_array
---------------
40000
(1 row)
Time: 245.894 msso there is about 50x slowdown2013/10/3 Pavel Stehule <pavel.stehule@gmail.com>2013/10/3 Tom Lane <tgl@sss.pgh.pa.us>Pavel Stehule <pavel.stehule@gmail.com> writes:If the proposal only relates to assignments to elements of plpgsql local
> If you can do a update of some array in plpgsql now, then you have to work
> with local copy only. It is a necessary precondition, and I am think it is
> valid.
variables, it's probably safe, but it's also probably not of much value.
plpgsql has enough overhead that I'm doubting you'd get much real-world
speedup. I'm also not very excited about putting even more low-level
knowledge about array representation into plpgsql.I looked to code, and I am thinking so this can be done inside array related routines. We just have to signalize request for inplace update (if we have a local copy).I have not idea, how significant speedup can be (if any), but current behave is not friendly (and for multidimensional arrays there are no workaround), so it is interesting way - and long time I though about some similar optimization.Regards
Pavel
regards, tom lane
Attachment
On 2013-10-07 16:00:54 +0200, Pavel Stehule wrote: > /* > * We need to do subscript evaluation, which might require > @@ -4321,6 +4322,14 @@ exec_assign_value(PLpgSQL_execstate *estate, > oldarrayval = (ArrayType *) DatumGetPointer(oldarraydatum); > > /* > + * support fast update for array scalar variable is enabled only > + * when target is a scalar variable and variable holds a local > + * copy of some array. > + */ > + inplace_update = (((PLpgSQL_datum *) target)->dtype == PLPGSQL_DTYPE_VAR > + && ((PLpgSQL_var *) target)->freeval); > + > + /* > * Build the modified array value. > */ Will this recognize if the local Datum is just a reference to an external toast Datum (i.e. varattrib_1b_e)? I don't know much about plpgsql's implementation, so please excuse if the question is stupid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
2013/10/7 Andres Freund <andres@2ndquadrant.com>
On 2013-10-07 16:00:54 +0200, Pavel Stehule wrote:
> /*
> * We need to do subscript evaluation, which might require
> @@ -4321,6 +4322,14 @@ exec_assign_value(PLpgSQL_execstate *estate,
> oldarrayval = (ArrayType *) DatumGetPointer(oldarraydatum);
>
> /*
> + * support fast update for array scalar variable is enabled only
> + * when target is a scalar variable and variable holds a local
> + * copy of some array.
> + */
> + inplace_update = (((PLpgSQL_datum *) target)->dtype == PLPGSQL_DTYPE_VAR
> + && ((PLpgSQL_var *) target)->freeval);
> +
> + /*
> * Build the modified array value.
> */
Will this recognize if the local Datum is just a reference to an
external toast Datum (i.e. varattrib_1b_e)?
this works on plpgsql local copies only (when cleaning is controlled by plpgsql) - so it should be untoasted every time.
btw when I tested this patch I found a second plpgsql array issue - repeated untoasting :( and we should not forget it.
I don't know much about plpgsql's implementation, so please excuse if
the question is stupid.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi,
I started reviewing this patch. Gone through the mail thread discussion to
understand the need of the patch. With patch there is significant performance
improvement in case of update for the array scalar variable.
- Patch gets apply cleanly on master branch
- make, make install and make check works fine
Did code walk through, code change looks good to me. I was doing some testing
in the area of scalar variable array and domain type. For the big array size
noticed significant performance improvement. So far haven't found any issues
with the code.
Patch looks good to me.
Just FYI.. Do need to remove array_fast_update GUC in the final version of
commit.
Regards,
Rushabh
On Mon, Oct 7, 2013 at 7:52 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2013/10/7 Andres Freund <andres@2ndquadrant.com>On 2013-10-07 16:00:54 +0200, Pavel Stehule wrote:
> /*
> * We need to do subscript evaluation, which might require
> @@ -4321,6 +4322,14 @@ exec_assign_value(PLpgSQL_execstate *estate,
> oldarrayval = (ArrayType *) DatumGetPointer(oldarraydatum);
>
> /*
> + * support fast update for array scalar variable is enabled only
> + * when target is a scalar variable and variable holds a local
> + * copy of some array.
> + */
> + inplace_update = (((PLpgSQL_datum *) target)->dtype == PLPGSQL_DTYPE_VAR
> + && ((PLpgSQL_var *) target)->freeval);
> +
> + /*
> * Build the modified array value.
> */
Will this recognize if the local Datum is just a reference to an
external toast Datum (i.e. varattrib_1b_e)?this works on plpgsql local copies only (when cleaning is controlled by plpgsql) - so it should be untoasted every time.btw when I tested this patch I found a second plpgsql array issue - repeated untoasting :( and we should not forget it.I don't know much about plpgsql's implementation, so please excuse if
the question is stupid.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Rushabh Lathia
(sorry for possible duplicates, used wrong account on first try) On 07.10.2013 17:00, Pavel Stehule wrote: > Hello > > I fixed patch - there was a missing cast to domain when it was used (and > all regress tests are ok now). This doesn't work correctly for varlen datatypes. I modified your quicksort example to work with text[], and got this: postgres=# SELECT array_upper(quicksort(1,20000,array_agg((g::text))),1) FROM generate_series(1,20000) g; ERROR: invalid input syntax for integer: "" CONTEXT: PL/pgSQL function quicksort(integer,integer,text[]) line 10 at assignment PL/pgSQL function quicksort(integer,integer,text[]) line 16 at assignment The handling of array domains is also wrong. You replace the new value to the array and only check the domain constraint after that. If the constraint is violated, it's too late to roll that back. For example: postgres=# create domain posarray as int[] check (value[1] > 0); CREATE DOMAIN postgres=# do $$ declare iarr posarray; begin begin iarr[1] := 1; iarr[1] := -1; exception when others then raise notice 'got error: %', sqlerrm; end; raise notice 'value: %', iarr[1]; end; $$; NOTICE: got error: value for domain posarray violates check constraint "posarray_check" NOTICE: value: -1 DO Without the patch, it prints 1, but with the patch, -1. In general, I'm not convinced this patch is worth the trouble. The speedup isn't all that great; manipulating large arrays in PL/pgSQL is still so slow that if you care about performance you'll want to rewrite your function in some other language or use temporary tables. And you only get a gain with arrays of fixed-length element type with no NULLs. So I think we should drop this patch in its current form. If we want to make array manipulation in PL/pgSQL, I think we'll have to do something similar to how we handle "row" variables, or something else entirely. But that's a much bigger patch, and I'm not sure it's worth the trouble either. Looking at perf profile and the functions involved, though, I see some low-hanging fruit: in array_set, the new array is allocated with palloc0'd, but we only need to zero out a few alignment bytes where the new value is copied. Replacing it with palloc() will save some cycles, per the attached patch. Nowhere near as much as your patch, but this is pretty uncontroversial. - Heikki
Attachment
Heikki Linnakangas <heikki@localhost.vmware.com> writes: > In general, I'm not convinced this patch is worth the trouble. The > speedup isn't all that great; manipulating large arrays in PL/pgSQL is > still so slow that if you care about performance you'll want to rewrite > your function in some other language or use temporary tables. And you > only get a gain with arrays of fixed-length element type with no NULLs. > So I think we should drop this patch in its current form. If we want to > make array manipulation in PL/pgSQL, I think we'll have to do something > similar to how we handle "row" variables, or something else entirely. I think that this area would be a fruitful place to make use of the noncontiguous datatype storage ideas that we were discussing with the PostGIS guys recently. I agree that tackling it in the context of plpgsql alone is not a good way to go at it. I'm not saying this in a vacuum of information, either. Some of the guys at Salesforce have been poking at noncontiguous storage for arrays and have gotten nice speedups --- but their patch is for plpgsql only and only addresses arrays, which makes it enough of a kluge that I've not wanted to bring it to the community. I think we should work towards a general solution instead. regards, tom lane
2013/11/25 Heikki Linnakangas <heikki@localhost.vmware.com>
On 07.10.2013 17:00, Pavel Stehule wrote:This doesn't work correctly for varlen datatypes. I modified your quicksort example to work with text[], and got this:Hello
I fixed patch - there was a missing cast to domain when it was used (and
all regress tests are ok now).
postgres=# SELECT array_upper(quicksort(1,20000,array_agg((g::text))),1) FROM generate_series(1,20000) g;
ERROR: invalid input syntax for integer: " "
CONTEXT: PL/pgSQL function quicksort(integer,integer,text[]) line 10 at assignment
PL/pgSQL function quicksort(integer,integer,text[]) line 16 at assignment
The handling of array domains is also wrong. You replace the new value to the array and only check the domain constraint after that. If the constraint is violated, it's too late to roll that back. For example:
both are bug, and should be fixed
postgres=# create domain posarray as int[] check (value[1] > 0);
CREATE DOMAIN
postgres=# do $$
declare
iarr posarray;
begin
begin
iarr[1] := 1;
iarr[1] := -1;
exception when others then raise notice 'got error: %', sqlerrm; end;
raise notice 'value: %', iarr[1];
end;
$$;
NOTICE: got error: value for domain posarray violates check constraint "posarray_check"
NOTICE: value: -1
DO
Without the patch, it prints 1, but with the patch, -1.
In general, I'm not convinced this patch is worth the trouble. The speedup isn't all that great; manipulating large arrays in PL/pgSQL is still so slow that if you care about performance you'll want to rewrite your function in some other language or use temporary tables. And you only get a gain with arrays of fixed-length element type with no NULLs.
I understand to your objection, but I disagree so particular solution is not enough. Update of fixed array should be fast - for this task there are no any workaround in plpgsql, so using a array as fast cache for calculation has some (sometimes very high) hidden costs. A very advanced users can use a different PL or can prepare C extensions, but for common users, this cost is hidden and unsolvable.
So I think we should drop this patch in its current form. If we want to make array manipulation in PL/pgSQL, I think we'll have to do something similar to how we handle "row" variables, or something else entirely. But that's a much bigger patch, and I'm not sure it's worth the trouble either.
Sorry, I disagree again - using same mechanism for arrays of fixed types and arrays of varlena types should be ineffective. Slow part of this area is manipulation with large memory blocks - there we fighting with physical CPU and access to memory limits.
Now, I know about few significant PL/pgSQL issues related to manipulation with variables
* repeated detoast
* missing preallocation
* low effectiveness manipulation with large arrays
* low effectiveness with casting (using IO casting instead binary)
A final solution of these issues needs redesign of work with variables in PL/pgSQL - some in Perl style (maybe). It can be nice, but I don't know somebody who will work on this early.
I would to concentrate to integration of plpgsql_check to core - and maybe later I can work on this topic (maybe after 2 years).
So any particular solution (should not be mine) can serve well two three years. This code is 100% hidden on top, and we can change it safely, so particular solution can be enough.
Regards
Pavel
Looking at perf profile and the functions involved, though, I see some low-hanging fruit: in array_set, the new array is allocated with palloc0'd, but we only need to zero out a few alignment bytes where the new value is copied. Replacing it with palloc() will save some cycles, per the attached patch. Nowhere near as much as your patch, but this is pretty uncontroversial.
- Heikki
2013/11/25 Tom Lane <tgl@sss.pgh.pa.us>
Heikki Linnakangas <heikki@localhost.vmware.com> writes:I think that this area would be a fruitful place to make use of the
> In general, I'm not convinced this patch is worth the trouble. The
> speedup isn't all that great; manipulating large arrays in PL/pgSQL is
> still so slow that if you care about performance you'll want to rewrite
> your function in some other language or use temporary tables. And you
> only get a gain with arrays of fixed-length element type with no NULLs.
> So I think we should drop this patch in its current form. If we want to
> make array manipulation in PL/pgSQL, I think we'll have to do something
> similar to how we handle "row" variables, or something else entirely.
noncontiguous datatype storage ideas that we were discussing with the
PostGIS guys recently. I agree that tackling it in the context of plpgsql
alone is not a good way to go at it.
I'm not saying this in a vacuum of information, either. Some of the guys
at Salesforce have been poking at noncontiguous storage for arrays and
have gotten nice speedups --- but their patch is for plpgsql only and
only addresses arrays, which makes it enough of a kluge that I've not
wanted to bring it to the community. I think we should work towards
a general solution instead.
I am for general solution (because these issues are good performance traps), but a early particular solution can be valuable for lot of users too - mainly if general solution can carry in two, three years horizon
Regards
Pavel
regards, tom lane