Thread: Optimizing `WHERE x IN` query
Hi! I am attempting to replicate YouTube's subscription feed. Each user has a list of channel IDs (as text[]) that they are subscribed to. The `users` table looks like: ``` =# \d users Table "public.users" Column | Type | Collation | Nullable | Default -------------------+--------------------------+-----------+---------- +--------- email | text | | not null | subscriptions | text[] | | | feed_needs_update | boolean | | | Indexes: "users_pkey" PRIMARY KEY, btree (email) "email_unique_idx" UNIQUE, btree (lower(email)) "users_email_key" UNIQUE CONSTRAINT, btree (email) ``` For storing videos, I have a table `channel_videos` from which I want to select all videos where the video's `ucid` (channel ID) is in the user's subscriptions. `channel_videos` looks like: ``` =# \d channel_videos Table "public.channel_videos" Column | Type | Collation | Nullable | Default --------------------+--------------------------+-----------+---------- +--------- id | text | | not null | title | text | | | published | timestamp with time zone | | | updated | timestamp with time zone | | | ucid | text | | | author | text | | | views | bigint | | | Indexes: "channel_videos_id_key" UNIQUE CONSTRAINT, btree (id) "channel_videos_published_idx" btree (published) "channel_videos_ucid_idx" btree (ucid) ``` In case it might help with indexing, a UCID always begins with `UC`, then 22 random characters in [a-zA-Z0-9_-] (Base64-urlsafe). Currently, the query I'm using to generate a user's feed is: ``` SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM users WHERE email = $1) ORDER BY published DESC; ``` This works great when `subscriptions` and `channel_videos` are both fairly small. Unfortunately, at this point `channel_videos` contains roughly 28M rows. Attempting to generate a feed for a user with 177 subscriptions takes around 40-70s, here's the relevant `EXPLAIN (ANALYZE, BUFFERS)`: ``` =# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM users WHERE email = 'omarroth') ORDER BY published DESC; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=478207.59..478506.73 rows=119656 width=142) (actual time=68599.562..68613.824 rows=46456 loops=1) Sort Key: channel_videos.published DESC Sort Method: external merge Disk: 6352kB Buffers: shared hit=456 read=13454 dirtied=13 written=456, temp read=794 written=797 -> Nested Loop (cost=55.94..459526.49 rows=119656 width=142) (actual time=0.555..68531.550 rows=46456 loops=1) Buffers: shared hit=453 read=13454 dirtied=13 written=456 -> HashAggregate (cost=10.18..11.18 rows=100 width=32) (actual time=0.417..0.850 rows=177 loops=1) Group Key: unnest(users.subscriptions) Buffers: shared hit=39 read=7 -> ProjectSet (cost=0.41..8.93 rows=100 width=32) (actual time=0.305..0.363 rows=177 loops=1) Buffers: shared hit=39 read=7 -> Index Scan using users_email_key on users (cost=0.41..8.43 rows=1 width=127) (actual time=0.049..0.087 rows=1 loops=1) Index Cond: (email = 'omarroth'::text) Buffers: shared hit=5 read=4 -> Bitmap Heap Scan on channel_videos (cost=45.76..4583.18 rows=1197 width=142) (actual time=28.835..387.068 rows=262 loops=177) Recheck Cond: (ucid = (unnest(users.subscriptions))) Heap Blocks: exact=12808 Buffers: shared hit=414 read=13447 dirtied=13 written=456 -> Bitmap Index Scan on channel_videos_ucid_idx (cost=0.00..45.46 rows=1197 width=0) (actual time=22.255..22.255 rows=263 loops=177) Index Cond: (ucid = (unnest(users.subscriptions))) Buffers: shared hit=291 read=762 written=27 Planning Time: 1.463 ms Execution Time: 68619.316 ms (23 rows) ``` Since a feed is expected to be fairly consistent to what would appear on YouTube, `channel_videos` receives fairly frequent UPDATEs and INSERTs and is vacuumed fairly frequently: ``` =# SELECT * FROM pg_stat_user_tables WHERE relname = 'channel_videos'; relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count -------+------------+----------------+----------+--------------+---------- +---------------+-----------+-----------+-----------+--------------- +------------+------------+--------------------- +------------------------------+----------------- +-------------------------------+------------------------------- +--------------+------------------+---------------+------------------- 16444 | public | channel_videos | 3 | 6346042 | 28294649 | 6605201260 | 218485 | 13280741 | 2413 | 11241541 | 24601363 | 585451 | 763200 | 2019-07-05 17:57:21.34215+00 | | 2019-07-05 17:59:21.013308+00 | 2019-07-04 14:54:02.845591+00 | 4 | 0 | 4 | 2 (1 row) The machine running Postgres has 4 vCPUs, 8GB of RAM (which I expect is the problem), and 160GB SSD. `uname -srv`: # uname -srv Linux 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019 ``` I am running Postgres 11.4, `SELECT version()`: PostgreSQL 11.4 (Ubuntu 11.4-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit (1 row)
Omar Roth schrieb am 07.07.2019 um 15:43: > Currently, the query I'm using to generate a user's feed is: > > ``` > SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM > users WHERE email = $1) ORDER BY published DESC; > ``` You could try an EXISTS query without unnest: select cv.* from channel_videos cv where exists ucid (select * from users u where cv.ucid = any(u.subscriptions) and u.email = $1); Did you try if a properly normalized model performs better?
The suggested query indeed appears to be faster. Thank you. > Did you try if a properly normalized model performs better? I've tested the below schema, which doesn't appear to perform much better but has a couple other advantages for my application: ``` create table subscriptions ( email text references users(email), ucid text, primary key (email, ucid) ); explain (analyze, buffers) select cv.* from channel_videos cv, subscriptions s where cv.ucid = s.ucid and s.email = $1; ``` Is there something else here I'm missing?
Le 07/07/2019 à 16:33, Thomas Kellerer a écrit : > Omar Roth schrieb am 07.07.2019 um 15:43: >> Currently, the query I'm using to generate a user's feed is: >> >> ``` >> SELECT * FROM channel_videos WHERE ucid IN (SELECT >> unnest(subscriptions) FROM >> users WHERE email = $1) ORDER BY published DESC; >> ``` > > You could try an EXISTS query without unnest: > > select cv.* > from channel_videos cv > where exists ucid (select * > from users u > where cv.ucid = any(u.subscriptions) > and u.email = $1); > > Did you try if a properly normalized model performs better? > > Hi We had big performance issues with queries like that, and we modified them to use && (see https://www.postgresql.org/docs/current/functions-array.html ), resulting in a big perf boost so, with your model, the query could be ``` select cv.* from channel_videos cv inner join user u on cv.ucid && u.subscription where u.email = $1; ``` or ``` select cv.* from channel_videos cv inner join ( select subscription from user where email = $1) as u on cv.ucid && u.subscription ; ``` (disclaimer, I didn't try this queries, they may contain typos) Regards Nicolas
> We had big performance issues with queries like that, and we modified > them to use && (see > https://www.postgresql.org/docs/current/functions-array.html ), > resulting in a big perf boost Much appreciated! Unfortunately I'm having trouble turning your suggestions into a working query. `cv.ucid && u.subscriptions` doesn't appear to work in my case since the comparison is `text && text[]`. Something like: ``` explain (analyze, buffers) select cv.* from channel_videos cv inner join (select subscriptions from users where email = $1) as u on array[cv.ucid] && u.subscriptions; ``` does work but unfortunately doesn't appear to be faster than the current query. Cheers
Did you create a GIN index on subscriptions column to support the && operator?