Thread: Performance bottleneck due to array manipulation

Performance bottleneck due to array manipulation

From
Genc, Ömer
Date:

Hey,

 

i have a very long running stored procedure, due to array manipulation in a stored procedure. The following procedure takes 13 seconds to finish.

 

BEGIN

    point_ids_older_than_one_hour := '{}';

    object_ids_to_be_invalidated := '{}';

 

    select ARRAY(SELECT

                    point_id

                from ONLY

                    public.ims_point as p

                where

                    p.timestamp < m_before_one_hour

                )

    into point_ids_older_than_one_hour ; -- this array has a size of 20k

 

    select ARRAY(SELECT

                        object_id

                  from

                        public.ims_object_header h

                  WHERE

                        h.last_point_id= ANY(point_ids_older_than_one_hour)

                 )

    into object_ids_to_be_invalidated; -- this array has a size of 100

 

    --    current_last_point_ids will have a size of 100k

    current_last_point_ids := ARRAY( SELECT

                                            last_point_id

                                      from

                                            public.ims_object_header h

                                     );                                    

    -- START OF PERFORMANCE BOTTLENECK

    IF(array_length(current_last_point_ids, 1) > 0)

    THEN   

        FOR i IN 0 .. array_upper(current_last_point_ids, 1)

        LOOP

            point_ids_older_than_one_hour = array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint);

        END LOOP;

    END IF;

    -- END OF PERFORMANCE BOTTLENECK

END;

 

The array manipulation part is the performance bottleneck. I am pretty sure, that there is a better way of doing this, however I couldn’t find one.

What I have is two table, lets call them ims_point and ims_object_header. ims_object_header references some entries of ims_point in the column last_point_id.

Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion. Therefore I stored the ids in arrays and iterate over those arrays to exclude the referenced values from being deleted.

 

However, I not sure if using an array for an operation like this is the best approach.

 

Can anyone give me some advice how this could be enhanced.

 

Thanks in advance.

 

Re: Performance bottleneck due to array manipulation

From
Tom Lane
Date:
=?iso-8859-1?Q?Genc=2C_=D6mer?= <Oemer.Genc@iais.fraunhofer.de> writes:
> i have a very long running stored procedure, due to array manipulation in a stored procedure. The following procedure
takes13 seconds to finish. 

> BEGIN
>     point_ids_older_than_one_hour := '{}';
>     object_ids_to_be_invalidated := '{}';

>     select ARRAY(SELECT
>                     point_id
>                 from ONLY
>                     public.ims_point as p
>                 where
>                     p.timestamp < m_before_one_hour
>                 )
>     into point_ids_older_than_one_hour ; -- this array has a size of 20k

>     select ARRAY(SELECT
>                         object_id
>                   from
>                         public.ims_object_header h
>                   WHERE
>                         h.last_point_id= ANY(point_ids_older_than_one_hour)
>                  )
>     into object_ids_to_be_invalidated; -- this array has a size of 100

>     --    current_last_point_ids will have a size of 100k
>     current_last_point_ids := ARRAY( SELECT
>                                             last_point_id
>                                       from
>                                             public.ims_object_header h
>                                      );
>     -- START OF PERFORMANCE BOTTLENECK
>     IF(array_length(current_last_point_ids, 1) > 0)
>     THEN
>         FOR i IN 0 .. array_upper(current_last_point_ids, 1)
>         LOOP
>             point_ids_older_than_one_hour = array_remove(point_ids_older_than_one_hour,
current_last_point_ids[i]::bigint);
>         END LOOP;
>     END IF;
>     -- END OF PERFORMANCE BOTTLENECK
> END;

Well, in the first place, this is the cardinal sin of working with SQL
databases: doing procedurally that which should be done relationally.
Forget the arrays entirely and use EXCEPT, ie

  SELECT point_id FROM ...
  EXCEPT
  SELECT last_point_id FROM ...

Or maybe you want EXCEPT ALL, depending on whether duplicates are possible
and what you want to do with them if so.

Having said that, the way you have this is necessarily O(N^2) because
array_remove has to search the entire array on each call, and then
reconstruct the entire array less any removed element(s).  The new
"expanded array" infrastructure in 9.5 could perhaps reduce some of the
constant factor, but the array search loop remains so it would still be
O(N^2); and anyway array_remove has not been taught about expanded arrays
(which means this example is probably even slower in 9.5 :-().

If you were using integers, you could possibly get decent performance from
contrib/intarray's int[] - int[] operator (which I think does a sort and
merge internally); but I gather that these are bigints, so that won't
work.

            regards, tom lane








>                                       from






> The array manipulation part is the performance bottleneck. I am pretty sure, that there is a better way of doing
this,however I couldn't find one. 
> What I have is two table, lets call them ims_point and ims_object_header. ims_object_header references some entries
ofims_point in the column last_point_id. 
> Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being
referencedids of the table ims_object_header should be excluded from this deletion. Therefore I stored the ids in
arraysand iterate over those arrays to exclude the referenced values from being deleted. 

> However, I not sure if using an array for an operation like this is the best approach.

> Can anyone give me some advice how this could be enhanced.

> Thanks in advance.




Re: Performance bottleneck due to array manipulation

From
Félix GERZAGUET
Date:
Hello,

On Fri, Aug 21, 2015 at 2:48 PM, Genc, Ömer <Oemer.Genc@iais.fraunhofer.de> wrote:

Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion.

 


delete from public.ims_point ip
  where ip.timestamp < current_timestamp - interval '1 hour'
    and not exists ( select 'reference exists'
                       from public.ims_object_header ioh
                      where ioh.last_point_id = ip.point_id
                     )
;

Does this works for you ?

Re: Performance bottleneck due to array manipulation

From
Igor Neyman
Date:

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Genc, Ömer
Sent: Friday, August 21, 2015 8:49 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance bottleneck due to array manipulation

 

Hey,

 

i have a very long running stored procedure, due to array manipulation in a stored procedure. The following procedure takes 13 seconds to finish.

 

BEGIN

    point_ids_older_than_one_hour := '{}';

    object_ids_to_be_invalidated := '{}';

 

    select ARRAY(SELECT

                    point_id

                from ONLY

                    public.ims_point as p

                where

                    p.timestamp < m_before_one_hour

                )

    into point_ids_older_than_one_hour ; -- this array has a size of 20k

 

    select ARRAY(SELECT

                        object_id

                  from

                        public.ims_object_header h

                  WHERE

                        h.last_point_id= ANY(point_ids_older_than_one_hour)

                 )

    into object_ids_to_be_invalidated; -- this array has a size of 100

 

    --    current_last_point_ids will have a size of 100k

    current_last_point_ids := ARRAY( SELECT

                                            last_point_id

                                      from

                                            public.ims_object_header h

                                     );                                    

    -- START OF PERFORMANCE BOTTLENECK

    IF(array_length(current_last_point_ids, 1) > 0)

    THEN   

        FOR i IN 0 .. array_upper(current_last_point_ids, 1)

        LOOP

            point_ids_older_than_one_hour = array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint);

        END LOOP;

    END IF;

    -- END OF PERFORMANCE BOTTLENECK

END;

 

The array manipulation part is the performance bottleneck. I am pretty sure, that there is a better way of doing this, however I couldn’t find one.

What I have is two table, lets call them ims_point and ims_object_header. ims_object_header references some entries of ims_point in the column last_point_id.

Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion. Therefore I stored the ids in arrays and iterate over those arrays to exclude the referenced values from being deleted.

 

However, I not sure if using an array for an operation like this is the best approach.

 

Can anyone give me some advice how this could be enhanced.

 

Thanks in advance.

 

 

I think in this case (as is in many other cases) “pure” SQL does the job much better than procedural language:

 

DELETE FROM public.ims_point as P

WHERE  P.timestamp < m_before_one_hour

     AND NOT EXISTS (SELECT 1 FROM  public.ims_object_header OH

                                                WHERE OH.last_point_id = P.object_id);

 

Is that what you are trying to accomplish?

 

Regards,

Igor Neyman

 

 

 

 

Re: Performance bottleneck due to array manipulation

From
Genc, Ömer
Date:

Thanks a lot,

 

The mentioned advices helped me a lot. I used an approach similar to the one mentioned by Igor and Felix and now the stored procedure runs fast.

 

Kind regards,

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Genc, Ömer
Sent: Friday, August 21, 2015 8:49 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance bottleneck due to array manipulation

 

Hey,

 

i have a very long running stored procedure, due to array manipulation in a stored procedure. The following procedure takes 13 seconds to finish.

 

BEGIN

    point_ids_older_than_one_hour := '{}';

    object_ids_to_be_invalidated := '{}';

 

    select ARRAY(SELECT

                    point_id

                from ONLY

                    public.ims_point as p

                where

                    p.timestamp < m_before_one_hour

                )

    into point_ids_older_than_one_hour ; -- this array has a size of 20k

 

    select ARRAY(SELECT

                        object_id

                  from

                        public.ims_object_header h

                  WHERE

                        h.last_point_id= ANY(point_ids_older_than_one_hour)

                 )

    into object_ids_to_be_invalidated; -- this array has a size of 100

 

    --    current_last_point_ids will have a size of 100k

    current_last_point_ids := ARRAY( SELECT

                                            last_point_id

                                      from

                                            public.ims_object_header h

                                     );                                    

    -- START OF PERFORMANCE BOTTLENECK

    IF(array_length(current_last_point_ids, 1) > 0)

    THEN   

        FOR i IN 0 .. array_upper(current_last_point_ids, 1)

        LOOP

            point_ids_older_than_one_hour = array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint);

        END LOOP;

    END IF;

    -- END OF PERFORMANCE BOTTLENECK

END;

 

The array manipulation part is the performance bottleneck. I am pretty sure, that there is a better way of doing this, however I couldn’t find one.

What I have is two table, lets call them ims_point and ims_object_header. ims_object_header references some entries of ims_point in the column last_point_id.

Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion. Therefore I stored the ids in arrays and iterate over those arrays to exclude the referenced values from being deleted.

 

However, I not sure if using an array for an operation like this is the best approach.

 

Can anyone give me some advice how this could be enhanced.

 

Thanks in advance.

 

 

I think in this case (as is in many other cases) “pure” SQL does the job much better than procedural language:

 

DELETE FROM public.ims_point as P

WHERE  P.timestamp < m_before_one_hour

     AND NOT EXISTS (SELECT 1 FROM  public.ims_object_header OH

                                                WHERE OH.last_point_id = P.object_id);

 

Is that what you are trying to accomplish?

 

Regards,

Igor Neyman