I'm working with large tables containing arrays of integers, indexed with "gin__int_ops" GIN indexes offered by the "intarray" extension. The goal I'm trying to achieve is to do a "nested loop semi join" using the array inclusion operation (@>) as join condition but in an indexed way. (Basically an INNER JOIN without the duplicate rows and without needing to use columns from the joined table.)
2) Configuration
The queries are run on a PostgreSQL v14 server with 32GB RAM and 8 vCPUs on a 64 bit ARM Neoverse architecture (m6g.2xlarge AWS RDS instance). PostgreSQL's configuration uses the following key values:
work_mem = 8GB (only set for this query)
shared_buffers = 8GB
effective_cache_size = 22GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
jit = on
3) Tables
The "light_pages_attributes" contains about 2 million rows, each with an "attributes" column containing on average 20 integers.
CREATE TABLE light_pages_attributes ( id INTEGER NOT NULL, "attributes" INTEGER[] NOT NULL ) ; CREATE INDEX light_pages_attributes_attributes ON light_pages_attributes USING gin ( attributes gin__int_ops ) ;
The "light_pages_views" contains about 25 million rows, each with a "page_ids" column containing on average 20 integers as well.
CREATE TABLE light_pages_views ( vector_id BIGINT NOT NULL, page_ids INTEGER[] NOT NULL ) ; CREATE INDEX light_pages_views_page_ids ON light_pages_views USING gin ( page_ids gin__int_ops ) ;
4) Query
The query I'm trying to optimise is the following:
BEGIN;
SET LOCAL work_mem = '8GB';
CREATE TEMPORARY VIEW urls AS ( SELECT ARRAY[lpa.id] AS page_id FROM light_pages_attributes AS lpa WHERE lpa."attributes" @> ARRAY[189376] ); EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS, TIMING ) SELECT COUNT(*) FROM light_pages_views AS lpv WHERE EXISTS ( SELECT 1 FROM urls AS u WHERE lpv.page_ids @> u.page_id ) ;
COMMIT;
The last query does not finish after waiting for more than 15 minutes. (The temporary view creation is very fast and required due to the same query in a CTE greatly reducing performance (by more than 5 min.) due to the optimisation barrier I'm guessing.) This alternative query, which should be far slower due to the fact that it generates duplicate lines through the INNER JOIN, is in fact much faster, 1 min. and 39 s.:
EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS, TIMING ) SELECT COUNT(*) FROM ( SELECT 1 FROM light_pages_views AS lpv INNER JOIN urls AS u ON lpv.page_ids @> u.page_id GROUP BY lpv.vector_id ) AS t ;
From:
David Rowley Date: Subject:
Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.