Thread: Optimize query for listing un-read messages

Optimize query for listing un-read messages

From

Andreas Joseph Krogh

Date:

02 May 2014, 18:49:51

Hi all,

I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.

The schema is as follows:

drop table if exists message_property;
drop table if exists message;
drop table if exists person;

create table person(
id serial primary key,
username varchar not null unique
);

create table message(
id serial primary key,
subject varchar
);

create table message_property(
    message_id integer not null references message(id),
    person_id integer not null references person(id),
    is_read boolean not null default false,
    unique(message_id, person_id)
);

insert into person(username) values('user_' || generate_series(0, 999));

insert into message(subject) values('Subject ' || random() || generate_series(0, 999999));

insert into message_property(message_id, person_id, is_read) select id, 1, true from message order by id limit 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;

So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.

I have the following query to list all un-read messages for person with id=1:

SELECT
    m.id                          AS message_id,
    prop.person_id,
    coalesce(prop.is_read, FALSE) AS is_read,
    m.subject
FROM message m
    LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
      AND NOT EXISTS(SELECT
                         *
                     FROM message_property pr
                     WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
    ;

The problem is that it's not quite efficient and performs badly, explain analyze shows:

                                                                                         QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Anti Join (cost=1.27..148784.09 rows=5 width=40) (actual time=907.412..907.419 rows=10 loops=1)
   Merge Cond: (m.id = pr.message_id)
   Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
   Rows Removed by Join Filter: 5
   -> Merge Left Join (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.039..524.070 rows=1000000 loops=1)
         Merge Cond: (m.id = prop.message_id)
         -> Index Scan using message_pkey on message m (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..112.840 rows=1000000 loops=1)
         -> Index Scan using message_property_message_id_person_id_key on message_property prop (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.019..129.035 rows=999995 loops=1)
               Index Cond: (person_id = 1)
   -> Index Only Scan using message_property_message_id_person_id_key on message_property pr (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.023..138.327 rows=999995 loops=1)
         Index Cond: (person_id = 1)
         Heap Fetches: 999995
Total runtime: 907.480 ms
(13 rows)

Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.

Thanks!

Andreas Jospeh Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

andreas@visena.com

www.visena.com

Attachment

msg-1549446-3.png

Re: Optimize query for listing un-read messages

From

Alban Hertroys

Date:

03 May 2014, 09:49:12

On 01 May 2014, at 13:06, Andreas Joseph Krogh <andreas@visena.com> wrote:

> I have the following query to list all un-read messages for person with id=1:
>
> SELECT
>     m.id                          AS message_id,
>     prop.person_id,
>     coalesce(prop.is_read, FALSE) AS is_read,
>     m.subject
> FROM message m
>     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
> WHERE 1 = 1
>       AND NOT EXISTS(SELECT
>                          *
>                      FROM message_property pr
>                      WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
>     ;

Since most messages will have prop.is_read = TRUE, that part of the query suffers from low selectivity. Querying for
theopposite is probably much faster, which you may even be able to speed up more with a partial index on is_read =
FALSE.

> Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an
entryin message_property for a user is considered un-read. 

Do you really need to query message_property twice? I would think this would give the same results:

SELECT
    m.id                          AS message_id,
    prop.person_id,
    coalesce(prop.is_read, FALSE) AS is_read,
    m.subject
FROM message m
    LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
;


Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: Optimize query for listing un-read messages

From

Andreas Joseph Krogh

Date:

03 May 2014, 10:48:11

På lørdag 03. mai 2014 kl. 11:51:08, skrev Alban Hertroys <haramrae@gmail.com>:

On 01 May 2014, at 13:06, Andreas Joseph Krogh <andreas@visena.com> wrote:

> I have the following query to list all un-read messages for person with id=1:
>
> SELECT
> m.id AS message_id,
> prop.person_id,
> coalesce(prop.is_read, FALSE) AS is_read,
> m.subject
> FROM message m
> LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
> WHERE 1 = 1
> AND NOT EXISTS(SELECT
> *
> FROM message_property pr
> WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
> ;

Since most messages will have prop.is_read = TRUE, that part of the query suffers from low selectivity. Querying for the opposite is probably much faster, which you may even be able to speed up more with a partial index on is_read = FALSE.

> Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.

Do you really need to query message_property twice? I would think this would give the same results:

SELECT
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
;

That query doesn't produce the same reesult.

Andreas Jospeh Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

andreas@visena.com

www.visena.com

Attachment

msg-1549449-3.png

Re: Optimize query for listing un-read messages

From

Alban Hertroys

Date:

03 May 2014, 21:19:07

On 03 May 2014, at 12:45, Andreas Joseph Krogh <andreas@visena.com> wrote:

> Do you really need to query message_property twice? I would think this would give the same results:
>
> SELECT
>     m.id                          AS message_id,
>     prop.person_id,
>     coalesce(prop.is_read, FALSE) AS is_read,
>     m.subject
> FROM message m
>     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
> ;

Ah yes, of course that would match a bit too much. This however does give the same results:

SELECT
   m.id                          AS message_id,
   prop.person_id,
   coalesce(prop.is_read, FALSE) AS is_read,
   m.subject
FROM message m
   LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read IS NULL OR prop.is_read = FALSE
;

That shaves off half the time of the query here, namely one indexscan.

The remaining time appears to be spent finding the rows in “message" that do not have a corresponding
“message_property"for the given (message_id, person_id) tuple. It’s basically trying to find no needle in a haystack,
youwon’t know that there is no needle until you’ve searched the entire haystack. 

It does seem to help a bit to create separate indexes on message_property.message_id and  message_property.person_id;
thatreduces the sizes of the indexes that the database needs to match and merge other in order to find the missing
message_id’s.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: Optimize query for listing un-read messages

From

Andreas Joseph Krogh

Date:

03 May 2014, 21:31:38

På lørdag 03. mai 2014 kl. 23:21:21, skrev Alban Hertroys <haramrae@gmail.com>:

On 03 May 2014, at 12:45, Andreas Joseph Krogh <andreas@visena.com> wrote:

> Do you really need to query message_property twice? I would think this would give the same results:
>
> SELECT
> m.id AS message_id,
> prop.person_id,
> coalesce(prop.is_read, FALSE) AS is_read,
> m.subject
> FROM message m
> LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
> ;

Ah yes, of course that would match a bit too much. This however does give the same results:

SELECT
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read IS NULL OR prop.is_read = FALSE
;

That shaves off half the time of the query here, namely one indexscan.

The remaining time appears to be spent finding the rows in “message" that do not have a corresponding “message_property" for the given (message_id, person_id) tuple. It’s basically trying to find no needle in a haystack, you won’t know that there is no needle until you’ve searched the entire haystack.

It does seem to help a bit to create separate indexes on message_property.message_id and message_property.person_id; that reduces the sizes of the indexes that the database needs to match and merge other in order to find the missing message_id’s.

I think the consesus here is to create a caching-table, there's no way around it as PG is unable to index the difference between two sets.

Andreas Jospeh Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

andreas@visena.com

www.visena.com

Attachment

msg-1549452-3.png