Thread: How to get good performance for very large lists/sets?

How to get good performance for very large lists/sets?

From
Richard Frith-Macdonald
Date:
I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.

I have a database which uses multiple  lists of items roughly like this:

CREATE TABLE List (
  ID SERIAL,
  Name VARCHAR ....
);

and a table containing individual entries in the lists:

CREATE TABLE ListEntry (
  ListID INT, /* Reference the List table */
  ItemID INT /* References an Item table */
) ;
CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);

Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in an
unpredictableway (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.

Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like find
allthe items in list A but not list B things can get very slow (particularly when both lists contain millions of common
items).

I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 

Is there a better way to manage list/set membership for many thousands of sets and many millions of items?



Re: How to get good performance for very large lists/sets?

From
Igor Neyman
Date:
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Richard
Frith-Macdonald
Sent: Monday, October 06, 2014 4:02 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] How to get good performance for very large lists/sets?

I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.

I have a database which uses multiple  lists of items roughly like this:

CREATE TABLE List (
  ID SERIAL,
  Name VARCHAR ....
);

and a table containing individual entries in the lists:

CREATE TABLE ListEntry (
  ListID INT, /* Reference the List table */
  ItemID INT /* References an Item table */
) ;
CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);

Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in an
unpredictableway (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.

Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like find
allthe items in list A but not list B things can get very slow (particularly when both lists contain millions of common
items).

I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 

Is there a better way to manage list/set membership for many thousands of sets and many millions of items?

--

You mean you are get sequential scans?
Index-only scans are not always quicker (you could try "turning off" seq scans by setting enable_seqscan=off).

Could you show your query, corresponding plans, and what don't you like about them?

Regards,
Igor Neyman





Re: How to get good performance for very large lists/sets?

From
Andy Colson
Date:
On 10/6/2014 3:02 AM, Richard Frith-Macdonald wrote:
> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>
> I have a database which uses multiple  lists of items roughly like this:
>
> CREATE TABLE List (
>    ID SERIAL,
>    Name VARCHAR ....
> );
>
> and a table containing individual entries in the lists:
>
> CREATE TABLE ListEntry (
>    ListID INT, /* Reference the List table */
>    ItemID INT /* References an Item table */
> ) ;
> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
>
> Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in an
unpredictableway (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.
>
> Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like find
allthe items in list A but not list B things can get very slow (particularly when both lists contain millions of common
items).
>
> I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
> Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 
>
> Is there a better way to manage list/set membership for many thousands of sets and many millions of items?
>
>
>


I seem to recall something about NOT IN() and nulls, but I dont recall
the details.

are you using:

select * where exists(select ...) and not exists(select ..)

or

select * where id in (select...) and id not in (select ...)



-Andy



Re: How to get good performance for very large lists/sets?

From
François Beausoleil
Date:
Le 2014-10-06 à 13:22, Andy Colson <andy@squeakycode.net> a écrit :

> On 10/6/2014 3:02 AM, Richard Frith-Macdonald wrote:
>> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>>
>> I have a database which uses multiple  lists of items roughly like this:
>>
>> CREATE TABLE List (
>>   ID SERIAL,
>>   Name VARCHAR ....
>> );
>>
>> and a table containing individual entries in the lists:
>>
>> CREATE TABLE ListEntry (
>>   ListID INT, /* Reference the List table */
>>   ItemID INT /* References an Item table */
>> ) ;
>> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
>>
>> Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in
anunpredictable way (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.
>>
>> Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like
findall the items in list A but not list B things can get very slow (particularly when both lists contain millions of
commonitems). 
>>
>> I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
>> Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 
>>
>> Is there a better way to manage list/set membership for many thousands of sets and many millions of items?
>
>
> I seem to recall something about NOT IN() and nulls, but I dont recall the details.
>
> are you using:
>
> select * where exists(select ...) and not exists(select ..)
>
> or
>
> select * where id in (select...) and id not in (select …)

Would

select * from …
except
select * from …

work better? The plan I get for SELECT EXCEPT SELECT ends up with a SetOp Except, while the SELECT WHERE exists() AND
NOTexists() plan gives me a Nested Loop Semi Join. The in / not in case gives me a simple Hash Join. 

My dataset is a parent table partitioned by market and week, so not exactly the same as Richard’s original request, and
it’sdebatable how much my data set would compare. 

Hope that helps!
François Beausoleil



Re: How to get good performance for very large lists/sets?

From
Igor Neyman
Date:

-----Original Message-----
From: Richard Frith-Macdonald [mailto:richard.frith-macdonald@brainstorm.co.uk]
Sent: Monday, October 06, 2014 1:53 PM
To: Igor Neyman
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] How to get good performance for very large lists/sets?

On 6 Oct 2014, at 17:54, Igor Neyman <ineyman@perceptron.com> wrote:
>
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org
>> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Richard
>> Frith-Macdonald
>> Sent: Monday, October 06, 2014 4:02 AM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] How to get good performance for very large lists/sets?
>>
>> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>>
>> I have a database which uses multiple  lists of items roughly like this:
>>
>> CREATE TABLE List (
>>  ID SERIAL,
>>  Name VARCHAR ....
>> );
>>
>> and a table containing individual entries in the lists:
>>
>> CREATE TABLE ListEntry (
>>  ListID INT, /* Reference the List table */  ItemID INT /* References
>> an Item table */
>> ) ;
>> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
>>
>> Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in
anunpredictable way (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.
>>
>> Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like
findall the items in list A but not list B things can get very slow (particularly when both lists contain millions of
commonitems). 
>>
>> I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
>> Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 
>>
>> Is there a better way to manage list/set membership for many thousands of sets and many millions of items?
>
> --
>
> You mean you are get sequential scans?
> Index-only scans are not always quicker (you could try "turning off" seq scans by setting enable_seqscan=off).
>
> Could you show your query, corresponding plans, and what don't you like about them?

I guess I didn't express myself well.

No I'm not particularly dissatisfied with any query plan;  have tried enabling/disabling different scan types to
experiment,and have been able to get better results from the query planner with such tweaks in some cases (ie with
specificdatasets), but not consistently.  Certainly the index is used quite often, and when it isn't the query planner
seemsto be making reasonable decisions. 
I've tried NOT IN, and NOT EXISTS and NOT EXISTS for different situations ...

My fundamental problem is huge datasets;  with hundreds of gigabytes of memory, I can have the lists basically in
memoryand these queries seem to be cpu-limited ... so I'm searching for a way to minimise the work the cpu has to do. 

So what I was wondering was whether this whole approach to set/list membership was the correct one to use or if there's
someother approach which can simply avoid the cpu having to look at so much data (which was why I wondered about
index-onlyscans). 
--

What is your RAM and what is your setting for effective_cache_size?
Oh, and PG version?

The way you write the query will probably affect more the way tables are joined, but not the choice between sequensial
orindex-only scan. 
I was getting better performance when using NOT EXISTS, e.g.:

select ptip1.item_id
from list_item ptip1
where ptip1.list_id = 109774
and not exists (SELECT 1 from list_item ptip2 where ptip2.list_id = 124600 and ptip1.item_id = ptip2.item_id);

which caused " Hash Anti Join" in execution plan.

Regards,
Igor Neyman




Re: How to get good performance for very large lists/sets?

From
Richard Frith-Macdonald
Date:
On 6 Oct 2014, at 17:54, Igor Neyman <ineyman@perceptron.com> wrote:
>
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Richard
Frith-Macdonald
>> Sent: Monday, October 06, 2014 4:02 AM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] How to get good performance for very large lists/sets?
>>
>> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>>
>> I have a database which uses multiple  lists of items roughly like this:
>>
>> CREATE TABLE List (
>>  ID SERIAL,
>>  Name VARCHAR ....
>> );
>>
>> and a table containing individual entries in the lists:
>>
>> CREATE TABLE ListEntry (
>>  ListID INT, /* Reference the List table */
>>  ItemID INT /* References an Item table */
>> ) ;
>> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
>>
>> Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in
anunpredictable way (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.
>>
>> Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like
findall the items in list A but not list B things can get very slow (particularly when both lists contain millions of
commonitems). 
>>
>> I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
>> Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 
>>
>> Is there a better way to manage list/set membership for many thousands of sets and many millions of items?
>
> --
>
> You mean you are get sequential scans?
> Index-only scans are not always quicker (you could try "turning off" seq scans by setting enable_seqscan=off).
>
> Could you show your query, corresponding plans, and what don't you like about them?

I guess I didn't express myself well.

No I'm not particularly dissatisfied with any query plan;  have tried enabling/disabling different scan types to
experiment,and have been able to get better results from the query planner with such tweaks in some cases (ie with
specificdatasets), but not consistently.  Certainly the index is used quite often, and when it isn't the query planner
seemsto be making reasonable decisions. 
I've tried NOT IN, and NOT EXISTS and NOT EXISTS for different situations ...

My fundamental problem is huge datasets;  with hundreds of gigabytes of memory, I can have the lists basically in
memoryand these queries seem to be cpu-limited ... so I'm searching for a way to minimise the work the cpu has to do. 

So what I was wondering was whether this whole approach to set/list membership was the correct one to use or if there's
someother approach which can simply avoid the cpu having to look at so much data (which was why I wondered about
index-onlyscans). 



Re: How to get good performance for very large lists/sets?

From
Andy Colson
Date:
On 10/6/2014 12:52 PM, Richard Frith-Macdonald wrote:
> On 6 Oct 2014, at 17:54, Igor Neyman <ineyman@perceptron.com> wrote:
>>
>>> -----Original Message-----
>>> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Richard
Frith-Macdonald
>>> Sent: Monday, October 06, 2014 4:02 AM
>>> To: pgsql-general@postgresql.org
>>> Subject: [GENERAL] How to get good performance for very large lists/sets?
>>>
>>> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>>>
>>> I have a database which uses multiple  lists of items roughly like this:
>>>
>>> CREATE TABLE List (
>>>   ID SERIAL,
>>>   Name VARCHAR ....
>>> );
>>>
>>> and a table containing individual entries in the lists:
>>>
>>> CREATE TABLE ListEntry (
>>>   ListID INT, /* Reference the List table */
>>>   ItemID INT /* References an Item table */
>>> ) ;
>>> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
>>>
>>> Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in
anunpredictable way (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.
>>>
>>> Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like
findall the items in list A but not list B things can get very slow (particularly when both lists contain millions of
commonitems). 
>>>
>>> I think that server won't use index-only scans because, even in cases where a particular list has not had any
recentchanges, the ListEntry table will almost always have had some change (for one of the other lists) since its last
vacuum.
>>> Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 
>>>
>>> Is there a better way to manage list/set membership for many thousands of sets and many millions of items?
>>
>> --
>>
>> You mean you are get sequential scans?
>> Index-only scans are not always quicker (you could try "turning off" seq scans by setting enable_seqscan=off).
>>
>> Could you show your query, corresponding plans, and what don't you like about them?
>
> I guess I didn't express myself well.
>
> No I'm not particularly dissatisfied with any query plan;  have tried enabling/disabling different scan types to
experiment,and have been able to get better results from the query planner with such tweaks in some cases (ie with
specificdatasets), but not consistently.  Certainly the index is used quite often, and when it isn't the query planner
seemsto be making reasonable decisions. 
> I've tried NOT IN, and NOT EXISTS and NOT EXISTS for different situations ...
>
> My fundamental problem is huge datasets;  with hundreds of gigabytes of memory, I can have the lists basically in
memoryand these queries seem to be cpu-limited ... so I'm searching for a way to minimise the work the cpu has to do. 
>
> So what I was wondering was whether this whole approach to set/list membership was the correct one to use or if
there'ssome other approach which can simply avoid the cpu having to look at so much data (which was why I wondered
aboutindex-only scans). 
>
>
>


Ohhh..

Um, completely left field, but, if your items are sequential in some
way, maybe there is some gross misuse of ranges you could use?

http://www.postgresql.org/docs/9.2/static/rangetypes.html


-Andy





Re: How to get good performance for very large lists/sets?

From
Jeff Janes
Date:
On Mon, Oct 6, 2014 at 1:02 AM, Richard Frith-Macdonald <richard.frith-macdonald@brainstorm.co.uk> wrote:
I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.

I have a database which uses multiple  lists of items roughly like this:

CREATE TABLE List (
  ID SERIAL,
  Name VARCHAR ....
);

and a table containing individual entries in the lists:

CREATE TABLE ListEntry (
  ListID INT, /* Reference the List table */
  ItemID INT /* References an Item table */
) ;
CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);

Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in an unpredictable way (in response to our customer's actions, not under our control).  Lists are also created by customer actions.

How is concurrency handled?  If one person adds something to a list at the same time as someone else removes something from the same list, is there any interaction or do they just pass through each other cleanly?  Are lists ever rewritten in their entirety, or just incrementally have items added and removed one at a time?

What is the distribution of the number of lists any given entry is in?
 

Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like find all the items in list A but not list B things can get very slow (particularly when both lists contain millions of common items).

One possibility would be to have an Entry table where each one has an array of lists which contain it, with a gin index.

Then "in A but not B" could use the gin index to visit each Item in A, and then could immediately see if it was also in B without needing to do additional index look ups.  However, maintaining the gin index could easily become a substantial bottleneck.

 

I think that server won't use index-only scans because, even in cases where a particular list has not had any recent changes, the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum.
Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be thousands (possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict with things like nightly backups.

Reasoning about this situation would be a lot easier if you gave example query, with the explain (analyze, buffers), and a description of what plan you want it to use instead.  It sounds like you might want a single index-only scan to supply both branches of a hash semi join simultaneously, which I don't think is implemented even if the visibility map is in good shape.

I don't think PostgreSQL will ever demote an index-only scan to a plain index scan just because it thinks the index-only part won't be useful enough.  But it might switch to an entirely different plan which is not eligible to use an index-only scan in the first place because, given the reduced usefulness of the index only scan, the other one looks cheaper.  

So what you could do is vacuum a quiescent test database (do it several times just for good measure as sometimes on some versions the first pass isn't enough to get the vm all set), and then run the query.  If you now get a index only scan, that means you might need to vacuum your production database more aggressively.  

Cheers,

Jeff

Re: How to get good performance for very large lists/sets?

From
Andy Colson
Date:
>>
>
>
> Ohhh..
>
> Um, completely left field, but, if your items are sequential in some
> way, maybe there is some gross misuse of ranges you could use?
>
> http://www.postgresql.org/docs/9.2/static/rangetypes.html
>
>
> -Andy
>
>

Another thought, for the case of "find all the items in list A but not
list B things can get very slow "

What if you selected everything from list B created yourself a bloom
filter, then selected everything out of list A.  (Meaning, take it out
of PG and do the compare yourself)

Or, maybe write yourself a stored proc that could do bloom filters.

-Andy


Re: How to get good performance for very large lists/sets?

From
Jim Nasby
Date:
On 10/6/14, 3:02 AM, Richard Frith-Macdonald wrote:
> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>
> I have a database which uses multiple  lists of items roughly like this:
>
> CREATE TABLE List (
>    ID SERIAL,
>    Name VARCHAR ....
> );
>
> and a table containing individual entries in the lists:
>
> CREATE TABLE ListEntry (
>    ListID INT, /* Reference the List table */
>    ItemID INT /* References an Item table */
> ) ;
> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
BTW, performance-wise, your best bet might be to forget about using a listentry table (BTW, I recommend not using
CamelCasefor database object naming) and instead put an array in the list table: 

CREATE TABLE list(
     list_id         serial     PRIMARY KEY
     , list_name     varchar    NOT NULL UNIQUE
     , list_items    int[]      NOT NULL||||

);

I think there's an extension/add-on that would let you enforce referrential integrity between list_items and the items
table,but I can't find it now. 
--
Jim Nasby, Data Architect, Blue Treble
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: How to get good performance for very large lists/sets?

From
Alban Hertroys
Date:
On 06 Oct 2014, at 10:02, Richard Frith-Macdonald <richard.frith-macdonald@brainstorm.co.uk> wrote:

> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>
> I have a database which uses multiple  lists of items roughly like this:
>
> CREATE TABLE List (
>  ID SERIAL,
>  Name VARCHAR ....
> );
>
> and a table containing individual entries in the lists:
>
> CREATE TABLE ListEntry (
>  ListID INT, /* Reference the List table */
>  ItemID INT /* References an Item table */
> ) ;
> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);

Don’t you have any PK’s? A UNIQUE INDEX is not the same as a PK, a PK does not allow NULLs for example.

For that matter, I’d ditch the serial column in List - it attributes to a larger index size which decreases the chances
thatthe index will fit in memory, making it less feasable to the query planner. IMHO, natural keys are to be preferred
hereover surrogate keys. That is assuming that List.Name is supposed to be unique. 

> Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in an
unpredictableway (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.


> I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
> Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 
>
> Is there a better way to manage list/set membership for many thousands of sets and many millions of items?

Another benefit of using natural keys is that you don’t need to fetch the actual List entries - the Names are right
therein your ListEntry table. You only you need to look records up in the List table when you want their details
(columnsother than Name). 

A possible drawback in this case is that the PK index on ListEntry would probably be larger.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.