Thread: Optimize query for listing un-read messages
Hi all,
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
The problem is that it's not quite efficient and performs badly, explain analyze shows:
Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.
Thanks!
I'm using PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
The schema is as follows:
drop table if exists message_property;
drop table if exists message;
drop table if exists person;
drop table if exists message;
drop table if exists person;
create table person(
id serial primary key,
username varchar not null unique
);
id serial primary key,
username varchar not null unique
);
create table message(
id serial primary key,
subject varchar
);
id serial primary key,
subject varchar
);
create table message_property(
message_id integer not null references message(id),
person_id integer not null references person(id),
is_read boolean not null default false,
unique(message_id, person_id)
);
message_id integer not null references message(id),
person_id integer not null references person(id),
is_read boolean not null default false,
unique(message_id, person_id)
);
insert into person(username) values('user_' || generate_series(0, 999));
insert into message(subject) values('Subject ' || random() || generate_series(0, 999999));
insert into message_property(message_id, person_id, is_read) select id, 1, true from message order by id limit 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;
insert into message(subject) values('Subject ' || random() || generate_series(0, 999999));
insert into message_property(message_id, person_id, is_read) select id, 1, true from message order by id limit 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;
analyze;
So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.
I have the following query to list all un-read messages for person with id=1:
SELECT
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
AND NOT EXISTS(SELECT
*
FROM message_property pr
WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
;
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
AND NOT EXISTS(SELECT
*
FROM message_property pr
WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
;
The problem is that it's not quite efficient and performs badly, explain analyze shows:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Anti Join (cost=1.27..148784.09 rows=5 width=40) (actual time=918.906..918.913 rows=10 loops=1)
Merge Cond: (m.id = pr.message_id)
Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
Rows Removed by Join Filter: 5
-> Merge Left Join (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.040..530.748 rows=1000000 loops=1)
Merge Cond: (m.id = prop.message_id)
-> Index Scan using message_pkey on message m (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..115.829 rows=1000000 loops=1)
-> Index Scan using message_property_message_id_person_id_key on message_property prop (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.020..130.728 rows=999995 loops=1)
Index Cond: (person_id = 1)
-> Index Only Scan using message_property_message_id_person_id_key on message_property pr (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.024..140.349 rows=999995 loops=1)
Index Cond: (person_id = 1)
Heap Fetches: 999995
Total runtime: 918.975 ms
(13 rows)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Anti Join (cost=1.27..148784.09 rows=5 width=40) (actual time=918.906..918.913 rows=10 loops=1)
Merge Cond: (m.id = pr.message_id)
Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
Rows Removed by Join Filter: 5
-> Merge Left Join (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.040..530.748 rows=1000000 loops=1)
Merge Cond: (m.id = prop.message_id)
-> Index Scan using message_pkey on message m (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..115.829 rows=1000000 loops=1)
-> Index Scan using message_property_message_id_person_id_key on message_property prop (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.020..130.728 rows=999995 loops=1)
Index Cond: (person_id = 1)
-> Index Only Scan using message_property_message_id_person_id_key on message_property pr (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.024..140.349 rows=999995 loops=1)
Index Cond: (person_id = 1)
Heap Fetches: 999995
Total runtime: 918.975 ms
(13 rows)
Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.
Thanks!
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Dear Andreas,
For me, putting both "LEFT OUTER JOIN" and "NOT EXISTS" is a bad idea. SELECT ... FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read = TRUE
Regards,
2014-05-01 14:51 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
Hi all,I'm using PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.The schema is as follows:drop table if exists message_property;
drop table if exists message;
drop table if exists person;create table person(
id serial primary key,
username varchar not null unique
);create table message(
id serial primary key,
subject varchar
);create table message_property(
message_id integer not null references message(id),
person_id integer not null references person(id),
is_read boolean not null default false,
unique(message_id, person_id)
);insert into person(username) values('user_' || generate_series(0, 999));
insert into message(subject) values('Subject ' || random() || generate_series(0, 999999));
insert into message_property(message_id, person_id, is_read) select id, 1, true from message order by id limit 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;analyze;So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.I have the following query to list all un-read messages for person with id=1:SELECT
m.id AS message_id,
prop.person_id,
coalesce(prop.is_read, FALSE) AS is_read,
m.subject
FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
AND NOT EXISTS(SELECT
*
FROM message_property pr
WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
;
The problem is that it's not quite efficient and performs badly, explain analyze shows:QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Anti Join (cost=1.27..148784.09 rows=5 width=40) (actual time=918.906..918.913 rows=10 loops=1)
Merge Cond: (m.id = pr.message_id)
Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
Rows Removed by Join Filter: 5
-> Merge Left Join (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.040..530.748 rows=1000000 loops=1)
Merge Cond: (m.id = prop.message_id)
-> Index Scan using message_pkey on message m (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..115.829 rows=1000000 loops=1)
-> Index Scan using message_property_message_id_person_id_key on message_property prop (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.020..130.728 rows=999995 loops=1)
Index Cond: (person_id = 1)
-> Index Only Scan using message_property_message_id_person_id_key on message_property pr (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.024..140.349 rows=999995 loops=1)
Index Cond: (person_id = 1)
Heap Fetches: 999995
Total runtime: 918.975 ms
(13 rows)
Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.
Thanks!
På søndag 04. mai 2014 kl. 14:06:35, skrev Brice André <brice@famille-andre.be>:
I would also ensure that an efficient index is used for the outer join. I would probably try at least a multi-column index on (message_id, person_id) for the property table. I would also maybe give a try to an index on (message_id, person_id, is_read), just to see if it improves performances.As the "LEFT OUTER JOIN" will put fields of non-existing right table to null, I would simply rewrite it :Dear Andreas,For me, putting both "LEFT OUTER JOIN" and "NOT EXISTS" is a bad idea.
SELECT ... FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read = TRUE
The problem is that your suggested query doesn't return the desired results as it effectively is an INNER JOIN because you have "WHERE prop.is_read=TRUE", defeating the whole purpose of a LEFT OUTER JOIN.
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
Yes, I was a bit too fast. but replace it with
WHERE NOT prop.is_read = TRUE2014-05-04 18:40 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På søndag 04. mai 2014 kl. 14:06:35, skrev Brice André <brice@famille-andre.be>:I would also ensure that an efficient index is used for the outer join. I would probably try at least a multi-column index on (message_id, person_id) for the property table. I would also maybe give a try to an index on (message_id, person_id, is_read), just to see if it improves performances.As the "LEFT OUTER JOIN" will put fields of non-existing right table to null, I would simply rewrite it :Dear Andreas,For me, putting both "LEFT OUTER JOIN" and "NOT EXISTS" is a bad idea.
SELECT ... FROM message m
LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE prop.is_read = TRUE
The problem is that your suggested query doesn't return the desired results as it effectively is an INNER JOIN because you have "WHERE prop.is_read=TRUE", defeating the whole purpose of a LEFT OUTER JOIN.
Attachment
På søndag 04. mai 2014 kl. 18:49:43, skrev Brice André <brice@famille-andre.be>:
and it should be OK.Yes, I was a bit too fast. but replace it withWHERE NOT prop.is_read = TRUE
No, that also will be treated as an INNER JOIN, because it kills tuples where prop is null. I need entries where prop IS NULL (hence the LEFT OUTER JOIN) because messages without an entry in message_property must be treated as unread, the same as messages with an entry in message_property where is_read=FALSE.
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
Forget my last answer : it was a stupid one... I tried to answer quickly, but with tiredness, it does not give good results.
For me, your problem of performance comes from the "WHERE NOT EXISTS (query)" because your query is executed on each result of the outer join.2014-05-04 18:53 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På søndag 04. mai 2014 kl. 18:49:43, skrev Brice André <brice@famille-andre.be>:and it should be OK.Yes, I was a bit too fast. but replace it withWHERE NOT prop.is_read = TRUE
No, that also will be treated as an INNER JOIN, because it kills tuples where prop is null. I need entries where prop IS NULL (hence the LEFT OUTER JOIN) because messages without an entry in message_property must be treated as unread, the same as messages with an entry in message_property where is_read=FALSE.
Attachment
På søndag 04. mai 2014 kl. 19:43:11, skrev Brice André <brice@famille-andre.be>:
The disadvantages of this solution are that the insertion of a new message or of a new message will be slower, and that your database size will be greater, but it should solve the problem of fast determining all read or unread messages of a dedicated user.To do so, you can use trigger mechanism on both the insertion of the message to create all message_property entries of that message, and on user insertion to create all message_properties of the user, so that you do not need to change anything outside your SQL design.If not, what I can propose your is to arrange yourself so that, for each couple (message, user) of your database, you have a corresponding entry in message_property, so that the first solution I proposed you (with an inner join) will work. And with multi-column indexes, it should be fast.I tried to figure out how you can avoid this with your current database design, but I did not found any solution. Maybe someone on the forum will have an idea.Forget my last answer : it was a stupid one... I tried to answer quickly, but with tiredness, it does not give good results.For me, your problem of performance comes from the "WHERE NOT EXISTS (query)" because your query is executed on each result of the outer join.
Yes, the reason it cannot be fast is because PG is unable to index the difference between two sets, so my schema, although a correct one, isn't index friendly so a caching-mechanism must be used for fast, indexed access. The solution is to redesign and have an entry in message_property for each combination of user/message.
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963