Thread: Optimize query for listing un-read messages

Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
Hi all,
 
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
 
The schema is as follows:
 
drop table if exists message_property;
drop table if exists message;
drop table if exists person;
 
create table person(
    id serial primary key,
    username varchar not null unique
);
 
create table message(
    id serial primary key,
    subject varchar
);
 
create table message_property(
    message_id integer not null references message(id),
    person_id integer not null references person(id),
    is_read boolean not null default false,
    unique(message_id, person_id)
);
 
insert into person(username) values('user_' || generate_series(0, 999));
insert into message(subject) values('Subject ' || random() || generate_series(0, 999999));
insert into message_property(message_id, person_id, is_read) select id, 1, true from message order by id limit 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;
 
 
So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.
 
I have the following query to list all un-read messages for person with id=1:
 
SELECT
    m.id                          AS message_id,
    prop.person_id,
    coalesce(prop.is_read, FALSE) AS is_read,
    m.subject
FROM message m
    LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
      AND NOT EXISTS(SELECT
                         *
                     FROM message_property pr
                     WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
    ;
 
The problem is that it's not quite efficient and performs badly, explain analyze shows:
                                                                                         QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=1.27..148784.09 rows=5 width=40) (actual time=907.412..907.419 rows=10 loops=1)
   Merge Cond: (m.id = pr.message_id)
   Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
   Rows Removed by Join Filter: 5
   ->  Merge Left Join  (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.039..524.070 rows=1000000 loops=1)
         Merge Cond: (m.id = prop.message_id)
         ->  Index Scan using message_pkey on message m  (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..112.840 rows=1000000 loops=1)
         ->  Index Scan using message_property_message_id_person_id_key on message_property prop  (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.019..129.035 rows=999995 loops=1)
               Index Cond: (person_id = 1)
   ->  Index Only Scan using message_property_message_id_person_id_key on message_property pr  (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.023..138.327 rows=999995 loops=1)
         Index Cond: (person_id = 1)
         Heap Fetches: 999995
 Total runtime: 907.480 ms
(13 rows)
 
 
Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.
 
Thanks!
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment

Re: Optimize query for listing un-read messages

From
Alban Hertroys
Date:
On 01 May 2014, at 13:06, Andreas Joseph Krogh <andreas@visena.com> wrote:

> I have the following query to list all un-read messages for person with id=1:
>
> SELECT
>     m.id                          AS message_id,
>     prop.person_id,
>     coalesce(prop.is_read, FALSE) AS is_read,
>     m.subject
> FROM message m
>     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
> WHERE 1 = 1
>       AND NOT EXISTS(SELECT
>                          *
>                      FROM message_property pr
>                      WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
>     ;

Since most messages will have prop.is_read = TRUE, that part of the query suffers from low selectivity. Querying for
theopposite is probably much faster, which you may even be able to speed up more with a partial index on is_read =
FALSE.

> Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an
entryin message_property for a user is considered un-read. 

Do you really need to query message_property twice? I would think this would give the same results:

SELECT
    m.id                          AS message_id,
    prop.person_id,
    coalesce(prop.is_read, FALSE) AS is_read,
    m.subject
FROM message m
    LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
;


Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På lørdag 03. mai 2014 kl. 11:51:08, skrev Alban Hertroys <haramrae@gmail.com>:
On 01 May 2014, at 13:06, Andreas Joseph Krogh <andreas@visena.com> wrote:

> I have the following query to list all un-read messages for person with id=1:

> SELECT
>     m.id                          AS message_id,
>     prop.person_id,
>     coalesce(prop.is_read, FALSE) AS is_read,
>     m.subject
> FROM message m
>     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
> WHERE 1 = 1
>       AND NOT EXISTS(SELECT
>                          *
>                      FROM message_property pr
>                      WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
>     ;

Since most messages will have prop.is_read = TRUE, that part of the query suffers from low selectivity. Querying for the opposite is probably much faster, which you may even be able to speed up more with a partial index on is_read = FALSE.

> Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.

Do you really need to query message_property twice? I would think this would give the same results:

SELECT
    m.id                          AS message_id,
    prop.person_id,
    coalesce(prop.is_read, FALSE) AS is_read,
    m.subject
FROM message m
    LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
;
 
That query doesn't produce the same reesult.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Alban Hertroys
Date:
On 03 May 2014, at 12:45, Andreas Joseph Krogh <andreas@visena.com> wrote:

> Do you really need to query message_property twice? I would think this would give the same results:
>
> SELECT
>     m.id                          AS message_id,
>     prop.person_id,
>     coalesce(prop.is_read, FALSE) AS is_read,
>     m.subject
> FROM message m
>     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
> ;

Ah yes, of course that would match a bit too much. This however does give the same results:

SELECT
   m.id                          AS message_id,
   prop.person_id,
   coalesce(prop.is_read, FALSE) AS is_read,
   m.subject
FROM message m
   LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read IS NULL OR prop.is_read = FALSE
;

That shaves off half the time of the query here, namely one indexscan.

The remaining time appears to be spent finding the rows in “message" that do not have a corresponding
“message_property"for the given (message_id, person_id) tuple. It’s basically trying to find no needle in a haystack,
youwon’t know that there is no needle until you’ve searched the entire haystack. 

It does seem to help a bit to create separate indexes on message_property.message_id and  message_property.person_id;
thatreduces the sizes of the indexes that the database needs to match and merge other in order to find the missing
message_id’s.


Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På lørdag 03. mai 2014 kl. 23:21:21, skrev Alban Hertroys <haramrae@gmail.com>:

On 03 May 2014, at 12:45, Andreas Joseph Krogh <andreas@visena.com> wrote:

> Do you really need to query message_property twice? I would think this would give the same results:
>
> SELECT
>     m.id                          AS message_id,
>     prop.person_id,
>     coalesce(prop.is_read, FALSE) AS is_read,
>     m.subject
> FROM message m
>     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
> ;

Ah yes, of course that would match a bit too much. This however does give the same results:

SELECT
   m.id                          AS message_id,
   prop.person_id,
   coalesce(prop.is_read, FALSE) AS is_read,
   m.subject
FROM message m
   LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read IS NULL OR prop.is_read = FALSE
;

That shaves off half the time of the query here, namely one indexscan.

The remaining time appears to be spent finding the rows in “message" that do not have a corresponding “message_property" for the given (message_id, person_id) tuple. It’s basically trying to find no needle in a haystack, you won’t know that there is no needle until you’ve searched the entire haystack.

It does seem to help a bit to create separate indexes on message_property.message_id and  message_property.person_id; that reduces the sizes of the indexes that the database needs to match and merge other in order to find the missing message_id’s.
 
I think the consesus here is to create a caching-table, there's no way around it as PG is unable to index the difference between two sets.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment