Thread: =ANY(ARRAY) vs =ANY(ARRAY(expr)) performance

=ANY(ARRAY) vs =ANY(ARRAY(expr)) performance

From

Ramdip Gill

Date:

14 November 2022, 01:49:13

Hello!

I have been struggling with finding a proper solution for this query for some time and wanted to ask if someone here knows how to approach this?

I have a table named "report" which has an index on report.reporter_id. This column consists of IDs which are grouped together using a table named "group_links".

So for every reporter id which is part of the same group, there is a row in "group_links" with the same group_id.

Now, I noticed that I can select reports for a group in two ways. Both queries return the same but one is using =ANY(ARRAY(expr)) ("subselect") and one is using =ANY(ARRAY) ("static array") with the same array as the expression would return.

The static array query is running very fast for small selections and where not a lot of rows match the condition. It uses a bitmap index scan.

The subselect is running very slow and uses an index scan. However, it is particularly slow if not many rows match the condition and thus a lot of rows are filtered while scanning the index.

I was able to reproduce a similar issue with using `= ANY(VALUES)` instead of `= ANY(ARRAY)`:

1. fast query using =ANY(ARRAY): https://explain.depesz.com/s/dwP8

2. slow query using =ANY(ARRAY(expr)): https://explain.depesz.com/s/3hGb

3. slow query using =ANY(VALUES): https://explain.depesz.com/s/cYrn

I guess the difference comes from the query planner not being able to know the exact values for the WHERE condition beforehand. But how should cases like this be best handled?

Should I denormalize the data such that I have a table with columns report.id and group_id and report.created such that I can create an index on (created, group_id)? Then I don't have to do a subselect anymore.

I would be very glad for any help regarding this!

Postgres version: PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

------------------------------------

\d report
Table "public.report"
Column | Type | Collation | Nullable | Default
---------------+--------------------------+-----------+----------+---------
reporter_id | uuid | | not null |
parsed | boolean | | |
id | text | | not null |
request_id | uuid | | |
created | timestamp with time zone | | not null | now()
customer | text | | |
subject | text | | |
parser_result | text | | not null |
parser | text | | |
event_types | jsonb | | |
event_count | integer | | |
account_id | integer | | |
reviewable | boolean | | not null | false
reviewed | boolean | | not null | false
Indexes:
"PK_99e4d0bea58cba73c57f935a546" PRIMARY KEY, btree (id)
"idx_report_created_desc_id_asc" btree (created DESC, id)
"idx_report_created_desc_reporter_id_asc" btree (created DESC, reporter_id)
"idx_report_event_types" gin (event_types)
"idx_report_parser_gin" gin (parser gin_trgm_ops)
"idx_report_parser_result_created_desc" btree (parser_result, created DESC)
"idx_report_reporter_id_asc_created_desc" btree (reporter_id, created DESC)
"idx_report_request_id_asc_created_desc" btree (request_id, created DESC)
"idx_report_subject_gin" gin (subject gin_trgm_ops)
Check constraints:
"report_parser_result_constraint" CHECK (parser_result = ANY (ARRAY['PARSED'::text, 'UNPARSED'::text, 'REJECTED'::text]))
Foreign-key constraints:
"FK_5b809608bb38d119333b69f65f9" FOREIGN KEY (request_id) REFERENCES request(id)
"FK_d41df66b60944992386ed47cf2e" FOREIGN KEY (reporter_id) REFERENCES reporter(id)
Referenced by:
TABLE "event" CONSTRAINT "event_report_id_foreign" FOREIGN KEY (report_id) REFERENCES report(id)

------------------------------------

SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname='report';
relname | relpages | reltuples | relallvisible | relkind | relnatts | relhassubclass | reloptions | pg_table_size
---------+----------+---------------+---------------+---------+----------+----------------+------------+---------------
report | 2062252 | 8.5893344e+07 | 2062193 | r | 22 | f | | 16898801664
(1 row)

------------------------------------

\d group_links
Table "public.group_links"
Column | Type | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+-------------------
rule_id | uuid | | not null |
reporter_id | uuid | | not null |
group_id | uuid | | not null |
exclusion | boolean | | | false
last_update_time | timestamp with time zone | | | CURRENT_TIMESTAMP
Indexes:
"group_rules_matches_pkey" PRIMARY KEY, btree (rule_id, reporter_id)
"idx_group_rules_matches_group_id" btree (group_id)
"idx_group_rules_matches_group_id_reporter_id_exclusion" btree (group_id, reporter_id, exclusion)
"idx_group_rules_matches_reporter_id" btree (reporter_id)
Foreign-key constraints:
"group_rules_matches_group_id_foreign" FOREIGN KEY (group_id) REFERENCES "group"(id) ON DELETE CASCADE
"group_rules_matches_reporter_id_foreign" FOREIGN KEY (reporter_id) REFERENCES reporter(id)

"group_rules_matches_rule_id_foreign" FOREIGN KEY (rule_id) REFERENCES group_rules(id) ON DELETE CASCADE

------------------------------------

Re: =ANY(ARRAY) vs =ANY(ARRAY(expr)) performance

From

Ramdip Gill

Date:

14 November 2022, 04:17:17

Okay, increasing the collection of statistics seems to have helped. I used `ALTER TABLE report ALTER COLUMN reporter_id SET STATISTICS 10000` and now queries which previously didn't finish at all now finish in < 1 ms.

The following gave me the hint:

“The amount of information stored in `pg_statistic` by `ANALYZE`, in particular the maximum number of entries in the `most_common_vals` and `histogram_bounds` arrays for each column, can be set on a column-by-column basis using the `ALTER TABLE SET STATISTICS` command, or globally by setting the default_statistics_target configuration variable. The default limit is presently 100 entries. Raising the limit might allow more accurate planner estimates to be made, particularly for columns with irregular data distributions, at the price of consuming more space in `pg_statistic` and slightly more time to compute the estimates. Conversely, a lower limit might be sufficient for columns with simple data distributions.”

— https://www.postgresql.org/docs/current/planner-stats.html

Re: =ANY(ARRAY) vs =ANY(ARRAY(expr)) performance

From

Rick Otten

Date:

14 November 2022, 14:11:18

I was able to reproduce a similar issue with using `= ANY(VALUES)` instead of `= ANY(ARRAY)`:

1. fast query using =ANY(ARRAY): https://explain.depesz.com/s/dwP8
2. slow query using =ANY(ARRAY(expr)): https://explain.depesz.com/s/3hGb
3. slow query using =ANY(VALUES): https://explain.depesz.com/s/cYrn

I have found the "ANY" operator to be slow in general. It is almost always faster to use the "<@" operator:

```

-- more intuitive:
select
count(*)
from
testarray
where
'test' = ANY (myarray)
;

-- faster:
select
count(*)
from
testarray
where
ARRAY['test'::varchar] <@ myarray
;

```

It is just one of those things, like replacing "OR" with "UNION ALL" whenever possible too, that just make queries faster in PostgreSQL without a ton of effort or fuss.

Re: =ANY(ARRAY) vs =ANY(ARRAY(expr)) performance

From

Guillaume Cottenceau

Date:

14 November 2022, 14:22:27

Rick Otten <rottenwindfish 'at' gmail.com> writes:

>  I was able to reproduce a similar issue with using `= ANY(VALUES)` instead of `= ANY(ARRAY)`:
>
>  1. fast query using =ANY(ARRAY): https://explain.depesz.com/s/dwP8
>  2. slow query using =ANY(ARRAY(expr)): https://explain.depesz.com/s/3hGb
>  3. slow query using =ANY(VALUES): https://explain.depesz.com/s/cYrn
>
>  I have found the "ANY" operator to be slow in general.  It is almost always faster to use the "<@" operator:
> ```
> -- more intuitive:
> select
>   count(*)
> from
>   testarray
> where
>   'test' = ANY (myarray)
> ;
>
> -- faster:
> select
>   count(*)
> from
>   testarray
> where
>   ARRAY['test'::varchar] <@ myarray
> ;
> ```
> It is just one of those things, like replacing "OR" with "UNION ALL" whenever possible too, that just make queries
fasterin PostgreSQL without a
 
> ton of effort or fuss.

depends^^

db=> select count(*) from table where uid = any( string_to_array('11290331,11290332,11290333,11290431',',')::int[]);
 count 
-------
     4
(1 row)

Time: 0.837 ms
db=> select count(*) from table where uid = any( string_to_array('11290331,11290332,11290333,11290431',',')::int[]);
 count 
-------
     4
(1 row)

Time: 0.854 ms
db=> select count(*) from table where array[uid] <@ string_to_array('11290331,11290332,11290333,11290431',',')::int[];
 count 
-------
     4
(1 row)

Time: 52.335 ms
db=> select count(*) from table where array[uid] <@ string_to_array('11290331,11290332,11290333,11290431',',')::int[];
 count 
-------
     4
(1 row)

Time: 44.176 ms


-- 
Guillaume Cottenceau