Thread: Optimize query for listing un-read messages

Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
Hi all,
 
I'm using PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu

I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
 
The schema is as follows:
 
drop table if exists message_property;
drop table if exists message;
drop table if exists person;
 
create table person(
    id serial primary key,
    username varchar not null unique
);
 
create table message(
    id serial primary key,
    subject varchar
);
 
create table message_property(
    message_id integer not null references message(id),
    person_id integer not null references person(id),
    is_read boolean not null default false,
    unique(message_id, person_id)
);
 
insert into person(username) values('user_' || generate_series(0, 999));
insert into message(subject) values('Subject ' || random() || generate_series(0, 999999));
insert into message_property(message_id, person_id, is_read) select id, 1, true from message order by id limit 999990;
insert into message_property(message_id, person_id, is_read) select id, 1, false from message order by id limit 5 offset 999990;
analyze;
 
So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.
 
I have the following query to list all un-read messages for person with id=1:
 
SELECT
    m.id                          AS message_id,
    prop.person_id,
    coalesce(prop.is_read, FALSE) AS is_read,
    m.subject
FROM message m
    LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND prop.person_id = 1
WHERE 1 = 1
      AND NOT EXISTS(SELECT
                         *
                     FROM message_property pr
                     WHERE pr.message_id = m.id AND pr.person_id = prop.person_id AND prop.is_read = TRUE)
    ;

The problem is that it's not quite efficient and performs badly, explain analyze shows:
                                                                                         QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=1.27..148784.09 rows=5 width=40) (actual time=918.906..918.913 rows=10 loops=1)
   Merge Cond: (m.id = pr.message_id)
   Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
   Rows Removed by Join Filter: 5
   ->  Merge Left Join  (cost=0.85..90300.76 rows=1000000 width=40) (actual time=0.040..530.748 rows=1000000 loops=1)
         Merge Cond: (m.id = prop.message_id)
         ->  Index Scan using message_pkey on message m  (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..115.829 rows=1000000 loops=1)
         ->  Index Scan using message_property_message_id_person_id_key on message_property prop  (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.020..130.728 rows=999995 loops=1)
               Index Cond: (person_id = 1)
   ->  Index Only Scan using message_property_message_id_person_id_key on message_property pr  (cost=0.42..40983.40 rows=999995 width=8) (actual time=0.024..140.349 rows=999995 loops=1)
         Index Cond: (person_id = 1)
         Heap Fetches: 999995
 Total runtime: 918.975 ms
(13 rows)
 

Does anyone have suggestions on how to optimize the query or schema? It's important that any message not having an entry in message_property for a user is considered un-read.

Thanks!
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment

Re: Optimize query for listing un-read messages

From
Jochem Berndsen
Date:
Hi Andreas,

[New to this list, forgive my ignorance.]

On 05/01/2014 01:26 PM, Andreas Joseph Krogh wrote:
> I'm using PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu
My machine has PostgreSQL 9.1.13 on x86_64-unknown-linux-gnu.
> I have a schema where I have lots of messages and some users who might
> have read some of them. When a message is read by a user I create an
> entry i a table message_property holding the property (is_read) for
> that user.
> The schema is as follows:
> drop table if exists message_property;
> drop table if exists message;
> drop table if exists person;
> create table person(
>     id serial primary key,
>     username varchar not null unique
> );
> create table message(
>     id serial primary key,
>     subject varchar
> );
> create table message_property(
>     message_id integer not null references message(id),
>     person_id integer not null references person(id),
>     is_read boolean not null default false,
>     unique(message_id, person_id)
> );
[snip]
> So, for person 1 there are 10 unread messages, out of a total 1mill. 5
> of those unread does not have an entry in message_property and 5 have
> an entry and is_read set to FALSE.
> I have the following query to list all un-read messages for person
> with id=1:
> SELECT
>     m.id                          AS message_id,
>     prop.person_id,
>     coalesce(prop.is_read, FALSE) AS is_read,
>     m.subject
> FROM message m
>     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id
> AND prop.person_id = 1
> WHERE 1 = 1
>       AND NOT EXISTS(SELECT
>                          *
>                      FROM message_property pr
>                      WHERE pr.message_id = m.id AND pr.person_id =
> prop.person_id AND prop.is_read = TRUE)
>     ;
>
> The problem is that it's not quite efficient and performs badly,
> explain analyze shows:
[snip]

> Does anyone have suggestions on how to optimize the query or schema?

I'm getting better performance with:

SELECT
m.id AS message_id,
1 AS person_id,
FALSE AS is_read,
m.subject
FROM message m
WHERE 1 = 1
AND NOT EXISTS(SELECT
     *
     FROM message_property pr
     WHERE pr.message_id = m.id AND pr.person_id = 1 AND pr.is_read);

You then lose the distinction between message_property with is_read =
FALSE, and nonexistent message_property for the message row.

If that is essential, I'm getting a roughly 2x speedup on my non-tuned
PostgreSQL with:
  SELECT
     m.id                          AS message_id,
     prop.person_id,
     coalesce(prop.is_read, FALSE) AS is_read,
     m.subject
FROM message m
     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND
prop.person_id = 1
WHERE not coalesce(prop.is_read, false);

HTH,
Jochem

--
Jochem Berndsen | jochem@functor.nl



Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På torsdag 01. mai 2014 kl. 20:35:07, skrev Jochem Berndsen <jochem@functor.nl>:

Hi Andreas,

[New to this list, forgive my ignorance.]
[snip]
I'm getting better performance with:

SELECT
m.id AS message_id,
1 AS person_id,
FALSE AS is_read,
m.subject
FROM message m
WHERE 1 = 1
AND NOT EXISTS(SELECT
     *
     FROM message_property pr
     WHERE pr.message_id = m.id AND pr.person_id = 1 AND pr.is_read);

You then lose the distinction between message_property with is_read =
FALSE, and nonexistent message_property for the message row.

If that is essential, I'm getting a roughly 2x speedup on my non-tuned
PostgreSQL with:
  SELECT
     m.id                          AS message_id,
     prop.person_id,
     coalesce(prop.is_read, FALSE) AS is_read,
     m.subject
FROM message m
     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND
prop.person_id = 1
WHERE not coalesce(prop.is_read, false);
 
 
Hi Jochem,
 
Thansk for looking at it. I'm still seing ~500ms being spent and I was hoping for a way to do this using index so one could achieve 1-10ms, but maybe that's impossible given the schema?
 
Is there a way to design an equivalent  schema to achieve <10ms execution-time?
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Pavel Stehule
Date:
Hello


2014-05-01 21:17 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På torsdag 01. mai 2014 kl. 20:35:07, skrev Jochem Berndsen <jochem@functor.nl>:

Hi Andreas,

[New to this list, forgive my ignorance.]
[snip]
I'm getting better performance with:

SELECT
m.id AS message_id,
1 AS person_id,
FALSE AS is_read,
m.subject
FROM message m
WHERE 1 = 1
AND NOT EXISTS(SELECT
     *
     FROM message_property pr
     WHERE pr.message_id = m.id AND pr.person_id = 1 AND pr.is_read);

You then lose the distinction between message_property with is_read =
FALSE, and nonexistent message_property for the message row.

If that is essential, I'm getting a roughly 2x speedup on my non-tuned
PostgreSQL with:
  SELECT
     m.id                          AS message_id,
     prop.person_id,
     coalesce(prop.is_read, FALSE) AS is_read,
     m.subject
FROM message m
     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND
prop.person_id = 1
WHERE not coalesce(prop.is_read, false);
 
 
Hi Jochem,
 
Thansk for looking at it. I'm still seing ~500ms being spent and I was hoping for a way to do this using index so one could achieve 1-10ms, but maybe that's impossible given the schema?
 
Is there a way to design an equivalent  schema to achieve <10ms execution-time?

I had a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html

Regards

Pavel
 
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
 

Attachment

Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel Stehule <pavel.stehule@gmail.com>:
Hello
[snip]
 
I had a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
 
I'm not getting good performance. Are you able to craft an example using my schema and partial index?
 
Thanks.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Pavel Stehule
Date:



2014-05-01 21:39 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel Stehule <pavel.stehule@gmail.com>:
Hello
[snip]
 
I had a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
 
I'm not getting good performance. Are you able to craft an example using my schema and partial index?

maybe some like

CREATE INDEX ON message_property (person_id, message_id) WHERE pr.is_read

When I am thinking about your schema, it is designed well, but it is not index friendly, so for some fast access you should to hold a cache (table) of unread messages.

Regards

Pavel
 
 
Thanks.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
 

Attachment

Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På torsdag 01. mai 2014 kl. 21:53:32, skrev Pavel Stehule <pavel.stehule@gmail.com>:
 
 
2014-05-01 21:39 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel Stehule <pavel.stehule@gmail.com>:
Hello
[snip]
 
I had a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
 
I'm not getting good performance. Are you able to craft an example using my schema and partial index?
 
maybe some like
 
CREATE INDEX ON message_property (person_id, message_id) WHERE pr.is_read
 
When I am thinking about your schema, it is designed well, but it is not index friendly, so for some fast access you should to hold a cache (table) of unread messages
 
Ah, that's what I was hoping to not having to do. In my system, messages arrive all the time and having to update a cache for all new messages for all users seems messy... Seems I could just as well create a message_property for all users when a new message arrives, so I can INNER JOIN it and get good performance. But that table will quickly grow *very* large...
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Pavel Stehule
Date:



2014-05-01 22:30 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På torsdag 01. mai 2014 kl. 21:53:32, skrev Pavel Stehule <pavel.stehule@gmail.com>:
 
 
2014-05-01 21:39 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel Stehule <pavel.stehule@gmail.com>:
Hello
[snip]
 
I had a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
 
I'm not getting good performance. Are you able to craft an example using my schema and partial index?
 
maybe some like
 
CREATE INDEX ON message_property (person_id, message_id) WHERE pr.is_read
 
When I am thinking about your schema, it is designed well, but it is not index friendly, so for some fast access you should to hold a cache (table) of unread messages
 
Ah, that's what I was hoping to not having to do. In my system, messages arrive all the time and having to update a cache for all new messages for all users seems messy... Seems I could just as well create a message_property for all users when a new message arrives, so I can INNER JOIN it and get good performance. But that table will quickly grow *very* large...

What you need is a JOIN index, that is not possible in Postgres.

I afraid so some "ugly" solutions is necessary (when you require extra fast access). You need a index (small index) and it require some existing set - you cannot do index on the difference two sets.

I expect so some flag on the relation "message" - like "it should not be not read" can helps little bit - and can be used in partial index as conditions. Other possibility is some variant of partitioning - you can divide a messages and users to distinct sets and then you decrease a number of possible combinations.

Regards

Pavel
 
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
 

Attachment

Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På torsdag 01. mai 2014 kl. 23:02:13, skrev Pavel Stehule <pavel.stehule@gmail.com>:
 
 
2014-05-01 22:30 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På torsdag 01. mai 2014 kl. 21:53:32, skrev Pavel Stehule <pavel.stehule@gmail.com>:
 
 
2014-05-01 21:39 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com>:
På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel Stehule <pavel.stehule@gmail.com>:
Hello
[snip]
 
I had a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
 
I'm not getting good performance. Are you able to craft an example using my schema and partial index?
 
maybe some like
 
CREATE INDEX ON message_property (person_id, message_id) WHERE pr.is_read
 
When I am thinking about your schema, it is designed well, but it is not index friendly, so for some fast access you should to hold a cache (table) of unread messages
 
Ah, that's what I was hoping to not having to do. In my system, messages arrive all the time and having to update a cache for all new messages for all users seems messy... Seems I could just as well create a message_property for all users when a new message arrives, so I can INNER JOIN it and get good performance. But that table will quickly grow *very* large...
 
What you need is a JOIN index, that is not possible in Postgres.
 
I afraid so some "ugly" solutions is necessary (when you require extra fast access). You need a index (small index) and it require some existing set - you cannot do index on the difference two sets.
 
I expect so some flag on the relation "message" - like "it should not be not read" can helps little bit - and can be used in partial index as conditions. Other possibility is some variant of partitioning - you can divide a messages and users to distinct sets and then you decrease a number of possible combinations.
 
Just curious:
Is such a JOIN index possible in other DBs, if so - which?
Can other DBs do index on difference between two sets?
Will PG ever have this, is it even possible?
 
In my real system the message_property holds other properties for a message also, so I have to LEFT OUTER JOIN with it to get the properties where they exist when listing messages. The system works by assuming that when an entry in message_property does not exist for a given message for a given user then the property is equal to "false".
 
I don't quite see how maintaining a message_properrty entry for all users, for all messages, is a good idea. That's quite some work to be done when adding/removing users etc.
 
Thanks for having this discussion.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
David G Johnston
Date:
How does something like:

WITH unreads AS (
SELECT messageid FROM message
EXCEPT
SELECT messageid FROM message_property WHERE personid=1 AND has_read
)
SELECT ...
FROM unreads
JOIN messages USING (messageid)
;

perform?

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Optimize-query-for-listing-un-read-messages-tp5802097p5802157.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På torsdag 01. mai 2014 kl. 23:19:55, skrev David G Johnston <david.g.johnston@gmail.com>:
How does something like:

WITH unreads AS (
SELECT messageid FROM message
EXCEPT
SELECT messageid FROM message_property WHERE personid=1 AND has_read
)
SELECT ...
FROM unreads
JOIN messages USING (messageid)
;

perform?
 
It actually performs worse.
 
The best query so far is:
 
SELECT
    m.id                          AS message_id,
    prop.person_id,
    coalesce(prop.is_read, FALSE) AS is_read,
    m.subject
FROM message m
    LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND
                                             prop.person_id = 1
WHERE coalesce(prop.is_read, false) = false;
 
Giving the plan:
                                                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=4.20..90300.76 rows=500000 width=40) (actual time=445.021..445.025 rows=10 loops=1)
   Merge Cond: (m.id = prop.message_id)
   Filter: (NOT COALESCE(prop.is_read, false))
   Rows Removed by Filter: 999990
   ->  Index Scan using message_pkey on message m  (cost=0.42..34317.43 rows=1000000 width=35) (actual time=0.014..113.314 rows=1000000 loops=1)
   ->  Index Scan using message_property_message_id_person_id_key on message_property prop  (cost=0.42..40983.40 rows=999995 width=9) (actual time=0.018..115.019 rows=999995 loops=1)
         Index Cond: (person_id = 1)
 Total runtime: 445.076 ms
(8 rows)
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Tomas Vondra
Date:
On 1.5.2014 23:19, Andreas Joseph Krogh wrote:
> På torsdag 01. mai 2014 kl. 23:02:13, skrev Pavel Stehule
> <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>>:
>
>
>
>     2014-05-01 22:30 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com
>     <mailto:andreas@visena.com>>:
>
>         På torsdag 01. mai 2014 kl. 21:53:32, skrev Pavel Stehule
>         <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>>:
>
>
>
>             2014-05-01 21:39 GMT+02:00 Andreas Joseph Krogh
>             <andreas@visena.com <mailto:andreas@visena.com>>:
>
>                 På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel
>                 Stehule <pavel.stehule@gmail.com
>                 <mailto:pavel.stehule@gmail.com>>:
>
>                     Hello
>                     [snip]
>
>                     I had a perfect success on similar use case with
>                     descent ordered partial index
>
>                     http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
>
>
>                 I'm not getting good performance. Are you able to craft
>                 an example using my schema and partial index?
>
>
>             maybe some like
>
>             CREATE INDEX ON message_property (person_id, message_id)
>             WHERE pr.is_read
>
>             When I am thinking about your schema, it is designed well,
>             but it is not index friendly, so for some fast access you
>             should to hold a cache (table) of unread messages
>
>
>         Ah, that's what I was hoping to not having to do. In my system,
>         messages arrive all the time and having to update a cache for
>         all new messages for all users seems messy... Seems I could just
>         as well create a message_property for all users when a new
>         message arrives, so I can INNER JOIN it and get good
>         performance. But that table will quickly grow *very* large...
>
>
>     What you need is a JOIN index, that is not possible in Postgres.
>
>     I afraid so some "ugly" solutions is necessary (when you require
>     extra fast access). You need a index (small index) and it require
>     some existing set - you cannot do index on the difference two sets.
>
>     I expect so some flag on the relation "message" - like "it should
>     not be not read" can helps little bit - and can be used in partial
>     index as conditions. Other possibility is some variant of
>     partitioning - you can divide a messages and users to distinct sets
>     and then you decrease a number of possible combinations.
>
>
> Just curious:
> Is such a JOIN index possible in other DBs, if so - which?
> Can other DBs do index on difference between two sets?
> Will PG ever have this, is it even possible?

I'm not aware of such database, but maybe it's possible at least for
some cases. But I'd expect that to significantly depend on the schema.
And I'm not aware of any such effort in case of PostgreSQL, do don't
hold your breath.

IMHO the problem with your schema is that while each 'read' message has
a matching row in message_property, 'undread' messages may or may not
have a matching row. Is there a particular reason for that?

If you could get rid of this, i.e. require that each pair (message,
recipient) has a row in message_propery (irrespectedly whether the
message was read or not), you can do this:

CREATE INDEX message_unread_idx
    ON message_property(message_id, person_id) WHERE (NOT is_read)

and then simply do the query like this:

SELECT
    m.id,
    prop.person_id,
    prop.is_read,
    m.subject
FROM messages m JOIN message_property p ON (m.id = p.message_id)
WHERE (NOT is_read) AND person_id = :pid

and I'd expect this to use the partial index, and being much faster.

regards
Tomas



Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På torsdag 01. mai 2014 kl. 23:45:49, skrev Tomas Vondra <tv@fuzzy.cz>:
On 1.5.2014 23:19, Andreas Joseph Krogh wrote:
> På torsdag 01. mai 2014 kl. 23:02:13, skrev Pavel Stehule
> <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>>:
>
>     
>     
>     2014-05-01 22:30 GMT+02:00 Andreas Joseph Krogh <andreas@visena.com
>     <mailto:andreas@visena.com>>:
>
>         På torsdag 01. mai 2014 kl. 21:53:32, skrev Pavel Stehule
>         <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>>:
>
>             
>             
>             2014-05-01 21:39 GMT+02:00 Andreas Joseph Krogh
>             <andreas@visena.com <mailto:andreas@visena.com>>:
>
>                 På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel
>                 Stehule <pavel.stehule@gmail.com
>                 <mailto:pavel.stehule@gmail.com>>:
>
>                     Hello
>                     [snip]
>                     
>                     I had a perfect success on similar use case with
>                     descent ordered partial index
>
>                     http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
>
>                 
>                 I'm not getting good performance. Are you able to craft
>                 an example using my schema and partial index?
>
>             
>             maybe some like
>             
>             CREATE INDEX ON message_property (person_id, message_id)
>             WHERE pr.is_read
>             
>             When I am thinking about your schema, it is designed well,
>             but it is not index friendly, so for some fast access you
>             should to hold a cache (table) of unread messages
>
>         
>         Ah, that's what I was hoping to not having to do. In my system,
>         messages arrive all the time and having to update a cache for
>         all new messages for all users seems messy... Seems I could just
>         as well create a message_property for all users when a new
>         message arrives, so I can INNER JOIN it and get good
>         performance. But that table will quickly grow *very* large...
>
>     
>     What you need is a JOIN index, that is not possible in Postgres.
>     
>     I afraid so some "ugly" solutions is necessary (when you require
>     extra fast access). You need a index (small index) and it require
>     some existing set - you cannot do index on the difference two sets.
>     
>     I expect so some flag on the relation "message" - like "it should
>     not be not read" can helps little bit - and can be used in partial
>     index as conditions. Other possibility is some variant of
>     partitioning - you can divide a messages and users to distinct sets
>     and then you decrease a number of possible combinations.
>

> Just curious:
> Is such a JOIN index possible in other DBs, if so - which?
> Can other DBs do index on difference between two sets?
> Will PG ever have this, is it even possible?

I'm not aware of such database, but maybe it's possible at least for
some cases. But I'd expect that to significantly depend on the schema.
And I'm not aware of any such effort in case of PostgreSQL, do don't
hold your breath.

IMHO the problem with your schema is that while each 'read' message has
a matching row in message_property, 'undread' messages may or may not
have a matching row. Is there a particular reason for that?
 
 
Yes. The point is that maintaining a message_property pair for all messages for all users in the system imposes quite a maintainance-headache. As the schema is now any new message is per definition un-read, and when a user reads it then it gets an entry with is_read=true in message_property. This table holds other properties too. This way I'm avoiding having to book-keep so much when a new message arrives and when a new user is created. A message in my system does not necessarily have only one recipient, it might have one, many or none, and might be visible to many.
 
If you could get rid of this, i.e. require that each pair (message,
recipient) has a row in message_propery (irrespectedly whether the
message was read or not), you can do this:

CREATE INDEX message_unread_idx
    ON message_property(message_id, person_id) WHERE (NOT is_read)

and then simply do the query like this:

SELECT
    m.id,
    prop.person_id,
    prop.is_read,
    m.subject
FROM messages m JOIN message_property p ON (m.id = p.message_id)
WHERE (NOT is_read) AND person_id = :pid

and I'd expect this to use the partial index, and being much faster.
 
I'm aware of the performance-gain using such a plain JOIN-query.
 
Thanks for your feedback.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Tomas Vondra
Date:
On 1.5.2014 23:58, Andreas Joseph Krogh wrote:
> På torsdag 01. mai 2014 kl. 23:45:49, skrev Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>>:
>
>     On 1.5.2014 23:19, Andreas Joseph Krogh wrote:
>     > Just curious:
>     > Is such a JOIN index possible in other DBs, if so - which?
>     > Can other DBs do index on difference between two sets?
>     > Will PG ever have this, is it even possible?
>
>     I'm not aware of such database, but maybe it's possible at least for
>     some cases. But I'd expect that to significantly depend on the schema.
>     And I'm not aware of any such effort in case of PostgreSQL, do don't
>     hold your breath.
>
>     IMHO the problem with your schema is that while each 'read' message has
>     a matching row in message_property, 'undread' messages may or may not
>     have a matching row. Is there a particular reason for that?
>
>
>
> Yes. The point is that maintaining a message_property pair for all
> messages for all users in the system imposes quite a
> maintainance-headache. As the schema is now any new message is per
> definition un-read, and when a user reads it then it gets an entry with
> is_read=true in message_property. This table holds other properties too.
> This way I'm avoiding having to book-keep so much when a new message
> arrives and when a new user is created. A message in my system does not
> necessarily have only one recipient, it might have one, many or none,
> and might be visible to many.

So how do you determine who's the recipient of a message? Or is that the
case that everyone can read everything (which is why you're displaying
them the unread messages, right)?

I understand you're trying to solve this without storing a row for each
possible message-person combination, but won't this eventually happen
anyway (with is_read=true for all rows)?

Tomas


Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På fredag 02. mai 2014 kl. 00:34:34, skrev Tomas Vondra <tv@fuzzy.cz>:
On 1.5.2014 23:58, Andreas Joseph Krogh wrote:
> På torsdag 01. mai 2014 kl. 23:45:49, skrev Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>>:
>
>     On 1.5.2014 23:19, Andreas Joseph Krogh wrote:
>     > Just curious:
>     > Is such a JOIN index possible in other DBs, if so - which?
>     > Can other DBs do index on difference between two sets?
>     > Will PG ever have this, is it even possible?
>
>     I'm not aware of such database, but maybe it's possible at least for
>     some cases. But I'd expect that to significantly depend on the schema.
>     And I'm not aware of any such effort in case of PostgreSQL, do don't
>     hold your breath.
>
>     IMHO the problem with your schema is that while each 'read' message has
>     a matching row in message_property, 'undread' messages may or may not
>     have a matching row. Is there a particular reason for that?
>


> Yes. The point is that maintaining a message_property pair for all
> messages for all users in the system imposes quite a
> maintainance-headache. As the schema is now any new message is per
> definition un-read, and when a user reads it then it gets an entry with
> is_read=true in message_property. This table holds other properties too.
> This way I'm avoiding having to book-keep so much when a new message
> arrives and when a new user is created. A message in my system does not
> necessarily have only one recipient, it might have one, many or none,
> and might be visible to many.

So how do you determine who's the recipient of a message? Or is that the
case that everyone can read everything (which is why you're displaying
them the unread messages, right)?
 
 
A message might have a recipient and might be read by others.
 
I understand you're trying to solve this without storing a row for each
possible message-person combination, but won't this eventually happen
anyway (with is_read=true for all rows)?
 
 
I will end up with that only if all users read all messages, which is not nearly the case.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
David G Johnston
Date:
Andreas Joseph Krogh-2 wrote
> I will end up with that only if
> all users read all messages, which is not nearly the case.

These observations probably won't help but...

You have what amounts to a mathematical "spare matrix" problem on your
hands...

Is there any way to expire messages so that dimension does not grow
unbounded?

Per-User caching does seem to be something that is going to be needed...

Depending on how many users are being tracked would storing the "reader_id"
in an indexed array improve matters?  " SELECT ... FROM message WHERE NOT (1
= ANY(reader_ids)) ; UPDATE message SET reader_ids = reader_ids || 1 WHERE
messageid = ..."  I'm not that familiar with how well indexes over arrays
work or which kind is needed (i.e. gin/gist).

HTH

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Optimize-query-for-listing-un-read-messages-tp5802097p5802170.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På fredag 02. mai 2014 kl. 00:55:25, skrev David G Johnston <david.g.johnston@gmail.com>:
Andreas Joseph Krogh-2 wrote
> I will end up with that only if
> all users read all messages, which is not nearly the case.

These observations probably won't help but...

You have what amounts to a mathematical "spare matrix" problem on your
hands...

Is there any way to expire messages so that dimension does not grow
unbounded?
 
 
No, unfortunately...
 
Per-User caching does seem to be something that is going to be needed...

Depending on how many users are being tracked would storing the "reader_id"
in an indexed array improve matters?  " SELECT ... FROM message WHERE NOT (1
= ANY(reader_ids)) ; UPDATE message SET reader_ids = reader_ids || 1 WHERE
messageid = ..."  I'm not that familiar with how well indexes over arrays
work or which kind is needed (i.e. gin/gist).
 
 
"is_read" is one of many properties being tracked for a message...
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
David G Johnston
Date:

Per-User caching does seem to be something that is going to be needed...

Depending on how many users are being tracked would storing the "reader_id"
in an indexed array improve matters?  " SELECT ... FROM message WHERE NOT (1
= ANY(reader_ids)) ; UPDATE message SET reader_ids = reader_ids || 1 WHERE
messageid = ..."  I'm not that familiar with how well indexes over arrays
work or which kind is needed (i.e. gin/gist).
 
 
"is_read" is one of many properties being tracked for a message...


​But you don't have to have all of them on the same table.  Once you've identified the messages in question performing a standard join onto a supplemental detail table should be fairly straight-forward.

Do these other properties have values when "is_read" is false or only when "is_read" is true?  Since you already allow for the possibility of a missing record (giving it the meaning of "not read")​ these other properties cannot currently exist in that situation.

David J.


View this message in context: Re: Optimize query for listing un-read messages
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På fredag 02. mai 2014 kl. 01:58:04, skrev David G Johnston <david.g.johnston@gmail.com>:

Per-User caching does seem to be something that is going to be needed...

Depending on how many users are being tracked would storing the "reader_id"
in an indexed array improve matters?  " SELECT ... FROM message WHERE NOT (1
= ANY(reader_ids)) ; UPDATE message SET reader_ids = reader_ids || 1 WHERE
messageid = ..."  I'm not that familiar with how well indexes over arrays
work or which kind is needed (i.e. gin/gist).
 
 
"is_read" is one of many properties being tracked for a message...
 
 
​But you don't have to have all of them on the same table.  Once you've identified the messages in question performing a standard join onto a supplemental detail table should be fairly straight-forward.
 
Do these other properties have values when "is_read" is false or only when "is_read" is true?  Since you already allow for the possibility of a missing record (giving it the meaning of "not read")​ these other properties cannot currently exist in that situation.
 
 
A message might hold a property (ie. is_important) when is_read is FALSE (it might be set back to is_read=FALSE after being read the first time).
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Craig James
Date:
On Thu, May 1, 2014 at 4:26 AM, Andreas Joseph Krogh <andreas@visena.com> wrote:
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
 
The schema is as follows:
[...]
 
create table person(
    id serial primary key,
    username varchar not null unique
);
 
create table message(
    id serial primary key,
    subject varchar
);
 
create table message_property(
    message_id integer not null references message(id),
    person_id integer not null references person(id),
    is_read boolean not null default false,
    unique(message_id, person_id)
);
[...]
 So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.

Here's a possible enhancement: add two columns, an indexed timestamp to the message table, and a "timestamp of the oldest message this user has NOT read" on the person table. If most users read messages in a timely fashion, this would (in most cases) narrow down the portion of the messages table to a tiny fraction of the total -- just those messages newer than the oldest message this user has not read.

When you sign up a new user, you can set his timestamp to the time the account was created, since presumably messages before that time don't apply.

Whether this will help depends a lot on actual use patterns, i.e. do users typically read all messages or do they leave a bunch of unread messages sitting around forever?

Craig

Re: Optimize query for listing un-read messages

From
Andreas Joseph Krogh
Date:
På fredag 02. mai 2014 kl. 02:17:58, skrev Craig James <cjames@emolecules.com>:
On Thu, May 1, 2014 at 4:26 AM, Andreas Joseph Krogh <andreas@visena.com> wrote:
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
 
The schema is as follows:
[...]
 
create table person(
    id serial primary key,
    username varchar not null unique
);
 
create table message(
    id serial primary key,
    subject varchar
);
 
create table message_property(
    message_id integer not null references message(id),
    person_id integer not null references person(id),
    is_read boolean not null default false,
    unique(message_id, person_id)
);
 
[...]
 So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.
 
Here's a possible enhancement: add two columns, an indexed timestamp to the message table, and a "timestamp of the oldest message this user has NOT read" on the person table. If most users read messages in a timely fashion, this would (in most cases) narrow down the portion of the messages table to a tiny fraction of the total -- just those messages newer than the oldest message this user has not read.
 
When you sign up a new user, you can set his timestamp to the time the account was created, since presumably messages before that time don't apply.
 
Whether this will help depends a lot on actual use patterns, i.e. do users typically read all messages or do they leave a bunch of unread messages sitting around forever?
 
Thanks fort the suggestion. A user must be able to read arbitrary old messages, and messages don't expire.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Optimize query for listing un-read messages

From
Vitalii Tymchyshyn
Date:
What statistics do you have on the data? I suppose most messages are read by low number of users, mostly 0 or one.
I can see two options to consider:
1) Use arrays to store information on which users have already read the message. You may need GIN/GIST index to search fast.
2) Introduce some kind of special column(s) for the cases when the message is unread by everybody or was read by at most one user. E.g. read_by columns with null value for unread, special value for read by many and real user if read by only one.
in this case your condition would be (read_by is null or read_by not in (current_user or special_value) or (read_by = special_value and not exists()). Note that optimizer may have problems with such a complex expression nd you may need to use "union all" instead on "or". Partial index(es) for null/special value may help.

Best regards, Vitalii Tymchyshyn


2014-05-02 10:20 GMT+03:00 Andreas Joseph Krogh <andreas@visena.com>:
På fredag 02. mai 2014 kl. 02:17:58, skrev Craig James <cjames@emolecules.com>:
On Thu, May 1, 2014 at 4:26 AM, Andreas Joseph Krogh <andreas@visena.com> wrote:
I have a schema where I have lots of messages and some users who might have read some of them. When a message is read by a user I create an entry i a table message_property holding the property (is_read) for that user.
 
The schema is as follows:
[...]
 
create table person(
    id serial primary key,
    username varchar not null unique
);
 
create table message(
    id serial primary key,
    subject varchar
);
 
create table message_property(
    message_id integer not null references message(id),
    person_id integer not null references person(id),
    is_read boolean not null default false,
    unique(message_id, person_id)
);
 
[...]
 So, for person 1 there are 10 unread messages, out of a total 1mill. 5 of those unread does not have an entry in message_property and 5 have an entry and is_read set to FALSE.
 
Here's a possible enhancement: add two columns, an indexed timestamp to the message table, and a "timestamp of the oldest message this user has NOT read" on the person table. If most users read messages in a timely fashion, this would (in most cases) narrow down the portion of the messages table to a tiny fraction of the total -- just those messages newer than the oldest message this user has not read.
 
When you sign up a new user, you can set his timestamp to the time the account was created, since presumably messages before that time don't apply.
 
Whether this will help depends a lot on actual use patterns, i.e. do users typically read all messages or do they leave a bunch of unread messages sitting around forever?
 
Thanks fort the suggestion. A user must be able to read arbitrary old messages, and messages don't expire.
 
--
Andreas Jospeh Krogh
CTO / Partner - Visena AS
 

Attachment