Thread: Optimize query for listing un-read messages
Hi all,
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
The schema is as follows:
drop table if exists message_property;
drop table if exists message;
drop table if exists person;
drop table if exists message;
drop table if exists person;
create table person(
id serial primary key,
username varchar not null unique
);
id serial primary key,
username varchar not null unique
);
create table message(
id serial primary key,
subject varchar
);
id serial primary key,
subject varchar
);
create table message_property(
message_id integer not null references message(id),
person_id integer not null references person(id),
is_read boolean not null default false,
unique(message_id, person_id)
);
message_id integer not null references message(id),
person_id integer not null references person(id),
is_read boolean not null default false,
unique(message_id, person_id)
);
insert into person(username) values('user_' || generate_series(0, 999));
insert into message(subject) values('Subject ' || random() || generate_series(0, 999999));
insert into message_property(message_id, person_id, is_read) select id, 1, true from message order by id limit 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;
So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.
I have the following query to list all un-read messages for person with id=1:
SELECT
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
AND NOT EXISTS(SELECT
*
FROM message_property pr
WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
;
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
AND NOT EXISTS(SELECT
*
FROM message_property pr
WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
;
The problem is that it's not quite efficient and performs badly, explain analyze shows:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Anti Join (cost=1.27..148784.09 rows=5 width=40) (actual time=907.412..907.419 rows=10 loops=1)
Merge Cond: (m.id = pr.message_id)
Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
Rows Removed by Join Filter: 5
-> Merge Left Join (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.039..524.070 rows=1000000 loops=1)
Merge Cond: (m.id = prop.message_id)
-> Index Scan using message_pkey on message m (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..112.840 rows=1000000 loops=1)
-> Index Scan using message_property_message_id_person_id_key on message_property prop (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.019..129.035 rows=999995 loops=1)
Index Cond: (person_id = 1)
-> Index Only Scan using message_property_message_id_person_id_key on message_property pr (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.023..138.327 rows=999995 loops=1)
Index Cond: (person_id = 1)
Heap Fetches: 999995
Total runtime: 907.480 ms
(13 rows)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Anti Join (cost=1.27..148784.09 rows=5 width=40) (actual time=907.412..907.419 rows=10 loops=1)
Merge Cond: (m.id = pr.message_id)
Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
Rows Removed by Join Filter: 5
-> Merge Left Join (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.039..524.070 rows=1000000 loops=1)
Merge Cond: (m.id = prop.message_id)
-> Index Scan using message_pkey on message m (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..112.840 rows=1000000 loops=1)
-> Index Scan using message_property_message_id_person_id_key on message_property prop (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.019..129.035 rows=999995 loops=1)
Index Cond: (person_id = 1)
-> Index Only Scan using message_property_message_id_person_id_key on message_property pr (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.023..138.327 rows=999995 loops=1)
Index Cond: (person_id = 1)
Heap Fetches: 999995
Total runtime: 907.480 ms
(13 rows)
Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.
Thanks!
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
On 01 May 2014, at 13:06, Andreas Joseph Krogh <andreas@visena.com> wrote: > I have the following query to list all un-read messages for person with id=1: > > SELECT > m.id AS message_id, > prop.person_id, > coalesce(prop.is_read, FALSE) AS is_read, > m.subject > FROM message m > LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 > WHERE 1 = 1 > AND NOT EXISTS(SELECT > * > FROM message_property pr > WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE) > ; Since most messages will have prop.is_read = TRUE, that part of the query suffers from low selectivity. Querying for theopposite is probably much faster, which you may even be able to speed up more with a partial index on is_read = FALSE. > Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entryin message_property for a user is considered un-read. Do you really need to query message_property twice? I would think this would give the same results: SELECT m.id AS message_id, prop.person_id, coalesce(prop.is_read, FALSE) AS is_read, m.subject FROM message m LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE ; Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
På lørdag 03. mai 2014 kl. 11:51:08, skrev Alban Hertroys <haramrae@gmail.com>:
On 01 May 2014, at 13:06, Andreas Joseph Krogh <andreas@visena.com> wrote:
> I have the following query to list all un-read messages for person with id=1:
>
> SELECT
> m.id AS message_id,
> prop.person_id,
> coalesce(prop.is_read, FALSE) AS is_read,
> m.subject
> FROM message m
> LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
> WHERE 1 = 1
> AND NOT EXISTS(SELECT
> *
> FROM message_property pr
> WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
> ;
Since most messages will have prop.is_read = TRUE, that part of the query suffers from low selectivity. Querying for the opposite is probably much faster, which you may even be able to speed up more with a partial index on is_read = FALSE.
> Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.
Do you really need to query message_property twice? I would think this would give the same results:
SELECT
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
;
That query doesn't produce the same reesult.
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
On 03 May 2014, at 12:45, Andreas Joseph Krogh <andreas@visena.com> wrote: > Do you really need to query message_property twice? I would think this would give the same results: > > SELECT > m.id AS message_id, > prop.person_id, > coalesce(prop.is_read, FALSE) AS is_read, > m.subject > FROM message m > LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE > ; Ah yes, of course that would match a bit too much. This however does give the same results: SELECT m.id AS message_id, prop.person_id, coalesce(prop.is_read, FALSE) AS is_read, m.subject FROM message m LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 WHERE prop.is_read IS NULL OR prop.is_read = FALSE ; That shaves off half the time of the query here, namely one indexscan. The remaining time appears to be spent finding the rows in “message" that do not have a corresponding “message_property"for the given (message_id, person_id) tuple. It’s basically trying to find no needle in a haystack, youwon’t know that there is no needle until you’ve searched the entire haystack. It does seem to help a bit to create separate indexes on message_property.message_id and message_property.person_id; thatreduces the sizes of the indexes that the database needs to match and merge other in order to find the missing message_id’s. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
På lørdag 03. mai 2014 kl. 23:21:21, skrev Alban Hertroys <haramrae@gmail.com>:
On 03 May 2014, at 12:45, Andreas Joseph Krogh <andreas@visena.com> wrote:
> Do you really need to query message_property twice? I would think this would give the same results:
>
> SELECT
> m.id AS message_id,
> prop.person_id,
> coalesce(prop.is_read, FALSE) AS is_read,
> m.subject
> FROM message m
> LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1 AND prop.is_read = FALSE
> ;
Ah yes, of course that would match a bit too much. This however does give the same results:
SELECT
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read IS NULL OR prop.is_read = FALSE
;
That shaves off half the time of the query here, namely one indexscan.
The remaining time appears to be spent finding the rows in “message" that do not have a corresponding “message_property" for the given (message_id, person_id) tuple. It’s basically trying to find no needle in a haystack, you won’t know that there is no needle until you’ve searched the entire haystack.
It does seem to help a bit to create separate indexes on message_property.message_id and message_property.person_id; that reduces the sizes of the indexes that the database needs to match and merge other in order to find the missing message_id’s.
I think the consesus here is to create a caching-table, there's no way around it as PG is unable to index the difference between two sets.
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963