Thread: Optimizing `WHERE x IN` query

Optimizing `WHERE x IN` query

From
Omar Roth
Date:
Hi!

I am attempting to replicate YouTube's subscription feed. Each user has a list
of channel IDs (as text[]) that they are subscribed to. The `users` table
looks like:

```
=# \d users
                             Table "public.users"
      Column       |           Type           | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------
+---------
 email             | text                     |           | not null |
 subscriptions     | text[]                   |           |          |
 feed_needs_update | boolean                  |           |          |
Indexes:
    "users_pkey" PRIMARY KEY, btree (email)
    "email_unique_idx" UNIQUE, btree (lower(email))
    "users_email_key" UNIQUE CONSTRAINT, btree (email)
```

For storing videos, I have a table `channel_videos` from which I want to
select all videos where the video's `ucid` (channel ID) is in the user's
subscriptions. `channel_videos` looks like:

```
=# \d channel_videos
                         Table "public.channel_videos"
       Column       |           Type           | Collation | Nullable |
Default
--------------------+--------------------------+-----------+----------
+---------
 id                 | text                     |           | not null |
 title              | text                     |           |          |
 published          | timestamp with time zone |           |          |
 updated            | timestamp with time zone |           |          |
 ucid               | text                     |           |          |
 author             | text                     |           |          |
 views              | bigint                   |           |          |
Indexes:
    "channel_videos_id_key" UNIQUE CONSTRAINT, btree (id)
    "channel_videos_published_idx" btree (published)
    "channel_videos_ucid_idx" btree (ucid)
```

In case it might help with indexing, a UCID always begins with `UC`, then 22
random characters in [a-zA-Z0-9_-] (Base64-urlsafe).

Currently, the query I'm using to generate a user's feed is:

```
SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM
users WHERE email = $1) ORDER BY published DESC;
```

This works great when `subscriptions` and `channel_videos` are both fairly
small. Unfortunately, at this point `channel_videos` contains roughly 28M
rows. Attempting to generate a feed for a user with 177 subscriptions takes
around 40-70s, here's the relevant `EXPLAIN (ANALYZE, BUFFERS)`:

```
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM channel_videos WHERE ucid IN
(SELECT unnest(subscriptions) FROM users WHERE email = 'omarroth') ORDER BY
published DESC;
                                                                      QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=478207.59..478506.73 rows=119656 width=142) (actual
time=68599.562..68613.824 rows=46456 loops=1)
   Sort Key: channel_videos.published DESC
   Sort Method: external merge  Disk: 6352kB
   Buffers: shared hit=456 read=13454 dirtied=13 written=456, temp read=794
written=797
   ->  Nested Loop  (cost=55.94..459526.49 rows=119656 width=142) (actual
time=0.555..68531.550 rows=46456 loops=1)
         Buffers: shared hit=453 read=13454 dirtied=13 written=456
         ->  HashAggregate  (cost=10.18..11.18 rows=100 width=32) (actual
time=0.417..0.850 rows=177 loops=1)
               Group Key: unnest(users.subscriptions)
               Buffers: shared hit=39 read=7
               ->  ProjectSet  (cost=0.41..8.93 rows=100 width=32) (actual
time=0.305..0.363 rows=177 loops=1)
                     Buffers: shared hit=39 read=7
                     ->  Index Scan using users_email_key on users
(cost=0.41..8.43 rows=1 width=127) (actual time=0.049..0.087 rows=1 loops=1)
                           Index Cond: (email = 'omarroth'::text)
                           Buffers: shared hit=5 read=4
         ->  Bitmap Heap Scan on channel_videos  (cost=45.76..4583.18
rows=1197 width=142) (actual time=28.835..387.068 rows=262 loops=177)
               Recheck Cond: (ucid = (unnest(users.subscriptions)))
               Heap Blocks: exact=12808
               Buffers: shared hit=414 read=13447 dirtied=13 written=456
               ->  Bitmap Index Scan on channel_videos_ucid_idx
(cost=0.00..45.46 rows=1197 width=0) (actual time=22.255..22.255 rows=263
loops=177)
                     Index Cond: (ucid = (unnest(users.subscriptions)))
                     Buffers: shared hit=291 read=762 written=27
 Planning Time: 1.463 ms
 Execution Time: 68619.316 ms
(23 rows)

```

Since a feed is expected to be fairly consistent to what would appear on
YouTube, `channel_videos` receives fairly frequent UPDATEs and INSERTs and is
vacuumed fairly frequently:

```
=# SELECT * FROM pg_stat_user_tables WHERE relname = 'channel_videos';
 relid | schemaname |    relname     | seq_scan | seq_tup_read | idx_scan |
idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup
| n_dead_tup | n_mod_since_analyze |         last_vacuum          |
last_autovacuum |         last_analyze          |       last_autoanalyze
| vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
-------+------------+----------------+----------+--------------+----------
+---------------+-----------+-----------+-----------+---------------
+------------+------------+---------------------
+------------------------------+-----------------
+-------------------------------+-------------------------------
+--------------+------------------+---------------+-------------------
 16444 | public     | channel_videos |        3 |      6346042 | 28294649 |
6605201260 |    218485 |  13280741 |      2413 |      11241541 |   24601363 |
585451 |              763200 | 2019-07-05 17:57:21.34215+00 |
| 2019-07-05 17:59:21.013308+00 | 2019-07-04 14:54:02.845591+00 |            4
|                0 |             4 |                 2
(1 row)
The machine running Postgres has 4 vCPUs, 8GB of RAM (which I expect is the
problem), and 160GB SSD. `uname -srv`:

# uname -srv
Linux 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019
```
I am running Postgres 11.4, `SELECT version()`:

 PostgreSQL 11.4 (Ubuntu 11.4-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
(1 row)





Re: Optimizing `WHERE x IN` query

From
Thomas Kellerer
Date:
Omar Roth schrieb am 07.07.2019 um 15:43:
> Currently, the query I'm using to generate a user's feed is:
>
> ```
> SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM
> users WHERE email = $1) ORDER BY published DESC;
> ```

You could try an EXISTS query without unnest:

select cv.*
from channel_videos cv
where exists ucid (select *
                    from users u
                    where cv.ucid = any(u.subscriptions)
                      and u.email = $1);

Did you try if a properly normalized model performs better?




Re: Optimizing `WHERE x IN` query

From
Omar Roth
Date:
The suggested query indeed appears to be faster. Thank you.

> Did you try if a properly normalized model performs better?

I've tested the below schema, which doesn't appear to perform much better but
has a couple other advantages for my application:

```
create table subscriptions (
    email text references users(email),
    ucid text,
    primary key (email, ucid)
);

explain (analyze, buffers) select cv.* from channel_videos cv, subscriptions s
where cv.ucid = s.ucid and s.email = $1;

```

Is there something else here I'm missing?





Re: Optimizing `WHERE x IN` query

From
Nicolas Charles
Date:
Le 07/07/2019 à 16:33, Thomas Kellerer a écrit :
> Omar Roth schrieb am 07.07.2019 um 15:43:
>> Currently, the query I'm using to generate a user's feed is:
>>
>> ```
>> SELECT * FROM channel_videos WHERE ucid IN (SELECT 
>> unnest(subscriptions) FROM
>> users WHERE email = $1) ORDER BY published DESC;
>> ```
>
> You could try an EXISTS query without unnest:
>
> select cv.*
> from channel_videos cv
> where exists ucid (select *
>                    from users u
>                    where cv.ucid = any(u.subscriptions)
>                      and u.email = $1);
>
> Did you try if a properly normalized model performs better?
>
>
Hi


We had big performance issues with queries like that, and we modified 
them to use && (see 
https://www.postgresql.org/docs/current/functions-array.html ), 
resulting in a big perf boost

so, with your model, the query could be

```
select cv.*
from channel_videos cv

inner join user u on cv.ucid && u.subscription

where u.email = $1;
```

or

```
select cv.*
from channel_videos cv

inner join ( select subscription  from user where email = $1) as u on 
cv.ucid && u.subscription ;

```

(disclaimer, I didn't try this queries, they may contain typos)


Regards

Nicolas




Re: Optimizing `WHERE x IN` query

From
Omar Roth
Date:
> We had big performance issues with queries like that, and we modified
> them to use && (see
> https://www.postgresql.org/docs/current/functions-array.html ),
> resulting in a big perf boost

Much appreciated! Unfortunately I'm having trouble turning your suggestions
into a working query.

`cv.ucid && u.subscriptions`  doesn't appear to work in my case since the
comparison is `text && text[]`. Something like:

```
explain (analyze, buffers) select cv.* from channel_videos cv inner join
(select subscriptions from users where email = $1) as u on array[cv.ucid] &&
u.subscriptions;
```

does work but unfortunately doesn't appear to be faster than the current
query.

Cheers





Re: Optimizing `WHERE x IN` query

From
Michael Lewis
Date:
Did you create a GIN index on subscriptions column to support the && operator?