Home > mailing lists

Re: How to get good performance for very large lists/sets? - Mailing list pgsql-general

From	Igor Neyman
Subject	Re: How to get good performance for very large lists/sets?
Date	October 6, 2014 18:25:57
Msg-id	A76B25F2823E954C9E45E32FA49D70ECAB2F7F24@mail.corp.perceptron.com Whole thread Raw
In response to	How to get good performance for very large lists/sets? (Richard Frith-Macdonald <richard.frith-macdonald@brainstorm.co.uk>)
List	pgsql-general

Tree view


-----Original Message-----
From: Richard Frith-Macdonald [mailto:richard.frith-macdonald@brainstorm.co.uk]
Sent: Monday, October 06, 2014 1:53 PM
To: Igor Neyman
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] How to get good performance for very large lists/sets?

On 6 Oct 2014, at 17:54, Igor Neyman <ineyman@perceptron.com> wrote:
>
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org
>> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Richard
>> Frith-Macdonald
>> Sent: Monday, October 06, 2014 4:02 AM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] How to get good performance for very large lists/sets?
>>
>> I'm wondering if anyone can help with advice on how to manage large lists/sets of items in a postgresql database.
>>
>> I have a database which uses multiple  lists of items roughly like this:
>>
>> CREATE TABLE List (
>>  ID SERIAL,
>>  Name VARCHAR ....
>> );
>>
>> and a table containing individual entries in the lists:
>>
>> CREATE TABLE ListEntry (
>>  ListID INT, /* Reference the List table */  ItemID INT /* References
>> an Item table */
>> ) ;
>> CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);
>>
>> Now, there are thousands of lists, many with millions of entries, and items are added to and removed from lists in
anunpredictable way (in response to our customer's actions, not under our control).  Lists are also created by customer
actions.
>>
>> Finding whether a particular item is in a particular list is reasonably fast, but when we need to do things like
findall the items in list A but not list B things can get very slow (particularly when both lists contain millions of
commonitems). 
>>
>> I think that server won't use index-only scans because, even in cases where a particular list has not had any recent
changes,the ListEntry table will almost always have had some change (for one of the other lists) since its last vacuum. 
>> Perhaps creating multiple ListEntry tables (one for each list) would allow better performance; but that would be
thousands(possibly tens of thousands) of tables, and allowing new tables to be created by our clients might conflict
withthings like nightly backups. 
>>
>> Is there a better way to manage list/set membership for many thousands of sets and many millions of items?
>
> --
>
> You mean you are get sequential scans?
> Index-only scans are not always quicker (you could try "turning off" seq scans by setting enable_seqscan=off).
>
> Could you show your query, corresponding plans, and what don't you like about them?

I guess I didn't express myself well.

No I'm not particularly dissatisfied with any query plan;  have tried enabling/disabling different scan types to
experiment,and have been able to get better results from the query planner with such tweaks in some cases (ie with
specificdatasets), but not consistently.  Certainly the index is used quite often, and when it isn't the query planner
seemsto be making reasonable decisions. 
I've tried NOT IN, and NOT EXISTS and NOT EXISTS for different situations ...

My fundamental problem is huge datasets;  with hundreds of gigabytes of memory, I can have the lists basically in
memoryand these queries seem to be cpu-limited ... so I'm searching for a way to minimise the work the cpu has to do. 

So what I was wondering was whether this whole approach to set/list membership was the correct one to use or if there's
someother approach which can simply avoid the cpu having to look at so much data (which was why I wondered about
index-onlyscans). 
--

What is your RAM and what is your setting for effective_cache_size?
Oh, and PG version?

The way you write the query will probably affect more the way tables are joined, but not the choice between sequensial
orindex-only scan. 
I was getting better performance when using NOT EXISTS, e.g.:

select ptip1.item_id
from list_item ptip1
where ptip1.list_id = 109774
and not exists (SELECT 1 from list_item ptip2 where ptip2.list_id = 124600 and ptip1.item_id = ptip2.item_id);

which caused " Hash Anti Join" in execution plan.

Regards,
Igor Neyman

pgsql-general by date:

From: François Beausoleil
Date: 06 October 2014, 17:52:45
Subject: Re: How to get good performance for very large lists/sets?

From: Richard Frith-Macdonald
Date: 06 October 2014, 18:51:49
Subject: Re: How to get good performance for very large lists/sets?

Re: How to get good performance for very large lists/sets? - Mailing list pgsql-general

Previous

Next