Re: Query planner unaware of possibly best plan - Mailing list pgsql-performance

From Denes Daniel
Subject Re: Query planner unaware of possibly best plan
Date
Msg-id freemail.20070822143321.8034@fm03.freemail.hu
Whole thread Raw
In response to Re: Query planner unaware of possibly best plan  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The point here is that you've repeated the same example N times
> without actually making a case that it's interesting to support.  We
> have to think about the intellectual complexity that would be added
> to the planner to support this case, and the cycles that would be
> expended on every query (and wasted, for most queries) on trying to
> detect whether the case applies.  If it were simple and cheap to do,
> these arguments wouldn't hold much weight, but it doesn't look to
> me like either is the case.
>
> Another problem is that it's not clear there's much to be gained.
> Avoiding the sort step is only interesting if the query produces so
> many rows that a sort would be expensive ... but if that's the case, it
> seems unlikely that a nestloop indexscan plan would be the best
> choice anyway.
>
> So basically this looks like a lot of work for a narrow and questionable
> gain.  If you want it to happen you need to convince people that it's
> easier and more useful than it looks.
>
>             regards, tom lane




Well, I probably won't tell you that it's easy to do, because you know
the planner far far better than I do -- I'm just a user who knows almost
nothing about what's happening inside it. I just thought that if the
planner is examining a plan, where the outer scan is using a unique
index with all it's columns AND doing a nestloop join with the inner scan
returning the rows in some order, then the planner could say "this is
already ordered by (outer.unique, inner.some_order)" and wouldn't
make a sort at the end. Of course this is far from perfect, because what
about three/four/more table joins... maybe that can be derived from
this, I don't really know now.
Another situation that could be optimized this way is if I write "ORDER
BY outer.idxcol1, outer.idxcol2, outer.id, inner.something" where
 + there is a non-unique index on (outer.idxcol1, outer.idxcol2),
 + outer.id is a PRI KEY,
 + there is an index on (inner.outer_id, inner.something) too
but that's getting really complicated. Of course if the simpler case
would be working, then I could create a unique index on (outer.idxcol1,
outer.idxcol2, outer.id) and this would run optimized too.

I think what's common is:
 + outer scan produces uniquely sorted rows
 + nested loop
 + inner scan produces sorted rows
 + ORDER BY says outer.unique_sort, inner.sort
If this is the situation, the sort could be done separately. Even like:

->  Nested Loop
  -> Sort by outer.unique
     ->  Scan producing the required rows from outer
  -> Sort by inner.something
     ->  Scan producing the required rows from inner
         Filter or Index Cond for the join

And this is good for any query that needs data from two tables, but
wants to get the joined rows so that all rows for a single outer-row
are "packed together" (I mean they do not mix).


Now about if it's worth it. I tried a query on real data, with lots of rows.
Tables involved:
- threads: ~200K rows
- msgs: ~8M rows
I wanted to see the last msgs from the last threads, but did not want
to mix them. It's like PgSQL's mailing archive viewed by threads. I
wanted them paged, not all 8M msgs at once (LIMIT/OFFSET). Query is:

SELECT *
FROM threads AS thr JOIN msgs AS msg ON msg.thrid = thr.id
WHERE thr.forid = 1
ORDER BY thr.id DESC, msg.msgid DESC
LIMIT 100

thr.forid is a FKEY to forums. Say forum 1 is the pgsql-performance list.
Table msgs has only a PRI KEY: (thrid, msgid).
Table threads has a PRI KEY: (id) and a unique index: (forid, id).
First time I ran the query with seqscan, bitmapscan, hashjoin,
mergejoin disabled (just to get the nested loop plan with the needless
sort). Second time I ran it without disabling anything, but I modified
ORDER BY to only "thr.id DESC" (deleted "msg.msgid DESC"), to give me
the same plan as before, but without the sort.
PSQL input and output attached.

I think the 5000x speed would be worth it.


Regards,
Denes Daniel
---------------------------------




Játssz a megújult Kvízparton! Válassz több mint 400 kvíz közül minden témában!
________________________________________________________
http://cthandler.adverticum.net/?cturl=http%3A%2F%2Fkvizpart.hu%2F?fmalja


Attachment

pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: Low CPU Usage
Next
From: Carlos Moreno
Date:
Subject: Possible explanations for catastrophic performace deterioration?