Thread: Fix for PL/Python slow input arrays traversal issue

Fix for PL/Python slow input arrays traversal issue

From
Alexey Grishchenko
Date:
Hi

Following issue exists with PL/Python: when your function takes array as input parameters, processing arrays of fixed-size elements containing null values is many times slower than processing same array without nulls. Here is an example:
-- Function
create or replace function test(a int8[]) returns int8 as $BODY$
return sum([x for x in a if x is not None])
$BODY$ language plpythonu volatile;

pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         ) as q;
    test    
------------
 5000050000
(1 row)

Time: 22.248 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         union all
pl_regression(#         select null::int8 as a
pl_regression(#         ) as q;
    test    
------------
 5000050000
(1 row)

Time: 7179.921 ms

As you can see, single null in array introduces 320x slowdown. The reason for this is following:
Original implementation uses array_ref for each element of the array. Each call to array_ref causes subsequent call to array_seek. Function array_seek in turn has a shortcut for fixed-size arrays with no nulls. But if your array is not of fixed-size elements, or if it contains nulls, each call to array_seek would cause calculation of the Kth element offset starting from the first element. This is O(N^2) algorithm, resulting in high processing time for arrays of non-fixed-size elements and arrays with nulls.

The fix I propose applies same logic used at array_out function for efficient array traversal, keeping the pointer to the last fetched element's offset, which results in dramatical performance improvement for affected cases. With this implementation, both arrays of fixed-size elements without nulls, fixed-size elements with nulls and variable-size elements are processed with the same speed. Here is the test after this fix is applied:
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         ) as q;
    test    
------------
 5000050000
(1 row)

Time: 21.056 ms
pl_regression=# select test(array_agg(a)::int8[])
pl_regression-#     from (
pl_regression(#         select generate_series(1,100000) as a
pl_regression(#         union all
pl_regression(#         select null::int8 as a
pl_regression(#         ) as q;
    test    
------------
 5000050000
(1 row)

Time: 22.839 ms

--
Best regards,
Alexey Grishchenko
Attachment

Re: Fix for PL/Python slow input arrays traversal issue

From
Pavel Stehule
Date:
This entry, should be closed, because this patch is part of another patch

The new status of this patch is: Waiting on Author

Re: Fix for PL/Python slow input arrays traversal issue

From
Dave Cramer
Date:
Pavel,

I will pick these up.

Re: Fix for PL/Python slow input arrays traversal issue

From
Dave Cramer
Date:
Yes, this should be closed as it is contained in https://commitfest.postgresql.org/10/697/