Re: Slow query with backwards index scan - Mailing list pgsql-performance

From Nis Jørgensen
Subject Re: Slow query with backwards index scan
Date
Msg-id f8lgac$t0b$1@sea.gmane.org
Whole thread Raw
In response to Re: Slow query with backwards index scan  (Tilmann Singer <tils-pgsql@tils.net>)
List pgsql-performance
Tilmann Singer skrev:
> * Nis Jørgensen <nis@superlativ.dk> [20070730 18:33]:
>> It seems to me the subselect plan would benefit quite a bit from not
>> returning all rows, but only the 10 latest for each user. I believe the
>> problem is similar to what is discussed for UNIONs here:
>>
>>
http://groups.google.dk/group/pgsql.general/browse_thread/thread/4f74d7faa8a5a608/367f5052b1bbf1c8?lnk=st&q=postgresql+limit++union&rnum=1&hl=en#367f5052b1bbf1c8
>>
>> which got implemented here:
>>
>>
http://groups.google.dk/group/pgsql.committers/browse_thread/thread/b1ac3c3026db096c/9b3e5bd2d612f565?lnk=st&q=postgresql+limit++union&rnum=26&hl=en#9b3e5bd2d612f565
>>
>> It seems to me the planner in this case would actually need to push the
>> limit into the nested loop, since we want the plan to take advantage of
>> the index (using a backwards index scan). I am ready to be corrected though.
>
> If I understand that correctly than this means that it would benefit
> the planning for something like
>
> SELECT FROM (q1 UNION ALL q2) ORDER BY ... LIMIT ...
>
> if any of q1 or q2 would satisfy the rows requested by limit early,
> instead of planning q1 and q2 without having the limit of the outer
> query an influence.
>
> Unfortunately I'm having problems making my q2 reasonably efficient in
> the first place, even before UNIONing it.

The "second branch" of your UNION is really equivalent to the following
pseudo_code:

contacts = SELECT contact_id FROM relations WHERE user_id = $id

sql = SELECT * FROM (
    SELECT * FROM lt WHERE user_id = contacts[0]
    UNION ALL
    SELECT * FROM lt WHERE user_id = contacts[1]
    .
    .
    .
) ORDER BY created_at LIMIT 10;

Currently, it seems the "subqueries" are fetching all rows.

Thus a plan which makes each of the subqueries aware of the LIMIT  might
be able to improve performance. Unlike the UNION case, it seems this
means making the subqueries aware that the plan is valid, not just
changing the cost estimate.


How does this "imperative approach" perform? I realize you probably
don't want to use this, but it would give us an idea whether we would be
able to get the speed we need by forcing this plan on pg.

>> If that doesn't work, you might have reached the point where you need to
>> use some kind of record-keeping system to keep track of which records to
>> look at.
>
> Yes, I'm considering that unfortunately.
>
> Seeing however that there are 2 different queries which result in very
> efficient plans for one of the 2 different cases, but not the other,
> makes me hope there is a way to tune the planner into always coming up
> with the right plan. I'm not sure if I explained the problem well
> enough and will see if I can come up with a reproduction case with
> generated data.

I think the problem is that Postgresql does not have the necessary
statistics to determine which of the two plans will perform well. There
are basically two unknowns in the query:

- How many uninteresting records do we have to scan through before get
to the interesting ones (if using plan 1).
- How many matching rows do we find in "relations"

The first one it is not surprising that pg cannot estimate.

I am a little surprised that pg is not able to estimate the second one
better. Increasing the statistics settings might help in getting a
different plan.

I am also slightly surprised that the two equivalent formulations of the
query yield such vastly different query plans. In my experience, pg is
quite good at coming up with similar query plans for equivalent queries.
 You might want to fiddle with DISTINCT or indexing to make sure that
they are indeed logically equivalent.

Anyway, it seems likely that you will at some point run into the query
for many matches at the end of the table - which none of your plans will
be good at supplying. So perhaps you can just as well prepare for it now.


Nis


pgsql-performance by date:

Previous
From: "Jignesh K. Shah"
Date:
Subject: Re: User concurrency thresholding: where do I look?
Next
From: Ron Mayer
Date:
Subject: Re: Questions on Tags table schema