Re: intermittant performance problem - Mailing list pgsql-general

From Alban Hertroys
Subject Re: intermittant performance problem
Date
Msg-id 8409C805-3A96-478D-BF57-BD0EE33F8A93@solfertje.student.utwente.nl
Whole thread Raw
In response to Re: intermittant performance problem  (Mike Charnoky <noky@nextbus.com>)
List pgsql-general
On Mar 27, 2009, at 8:38 PM, Mike Charnoky wrote:

> Hi Alban,
>
> I experimented with your sample() solution and was rather baffled by
> the results, as it was actually 2x-3x slower than doing an ORDER BY
> RANDOM() LIMIT n.  I even precalculated the size of the result set,
> so that only one sequential scan was required, but this only helped
> marginally.
>
> I tested on a table with 284 million rows, selecting 10 random
> samples from a subset which contained 25 million rows.  I tried 3
> tests 3 times:
> A: ORDER BY RANDOM() limit 10
> B: SAMPLE(query, 10)
> C: SAMPLE(query, 10, 25million)  (precalculate size)
>
> Here were my results (query times in millisec)
>
> Trial   1        2       3     avg  speedup
> --------------------------------------------
> A.  63705    78561   85252   75839  baseline
> B  196057   200537  220437  205677    -270%
> C. 122356   170480  190829  161221    -210%
>
> Have you done any similar sort of benchmarking?  I'm assuming you're
> getting much better performance...
>
>
> Mike

Hello Mike,

That is not quite the result I expected. I think I have some ideas
what may be causing this to perform so inadequate. Those pretty much
torch this approach though...
I CC-ed the list again so that people can chime in.

I think the base of he problem is that the random records we're trying
to fetch aren't being cached. I can think of a few causes for that.

Firstly, due to using SELECT COUNT(*) to determine the size of the
result set, none of the records therein are actually queried and thus
don't end up in any caches. Some probably do end up in the OS's disk
cache.
My original MOVE FORWARD ALL approach (written in SQL in a PHP
function) probably did cause records to be moved to the caches.
Unfortunately I don't have access to that code anymore (a former
employer), but it's not particularly hard to duplicate.

Secondly, I think your result set is larger than mine was (by a
magnitude of 100 or so if memory serves me correctly), so even if
records were moved to the cache it's quite probable that only a part
of the set remains in cache as it simply won't all fit in there.

And thirdly, as we are fetching random records, random records are
being cached. The chance that we hit them again in a subsequent query
gets smaller with increasing result set size. That explains why
subsequent queries of this kind are not showing any improvement.

So effectively we are requesting random records from disk, using
random seeks through the data files.
If parts of the result set would get cached I think that would result
in a significant speedup of this approach. That probably explains my
own results, which did show a significant improvement (down from
several seconds to something acceptable for a website front page),
provided my theory at point 1 is correct.

Now I don't know how a scrollable cursor fetches rows from a result
set when given a specific row number, but I wouldn't be surprised if
it has to look it up from the start of the result set scrolling
forward. There might be a list of record numbers and pointers to disk,
I don't know the internals of FETCH w.r.p. scrollable cursors, but
that would be up to the highest row number requested at best (if we
fetch rows 10 and 5 in that order, we do know something about row 5 as
we encountered it to get to row 10, but we know nothing about row 11
and up as we haven't encountered those yet). If we would know about
rows beyond that we would also know the number of rows in a given
result set, and I know we don't - that's why we query for it.

This makes looking up random records even worse, as although we do
know their row numbers, we probably can't translate that to a disk
position and need a sequential scan to fetch it.

These are mostly educated guesses of course, I hope someone on the
list can shed some light on these pessimistic looking theories :)

> Alban Hertroys wrote:
>> On Mar 25, 2009, at 5:09 PM, Mike Charnoky wrote:
>>> Due to the nature of the sampling (need to limit using several
>>> parameters with a WHERE clause), I can't just generate random
>>> numbers to select data that I need.  Looks like I am stuck using
>>> ORDER BY RANDOM().  The only other option at this point seems to
>>> be to implement TABLESAMPLE, probably starting with the academic
>>> work that Neil Conway published (http://neilconway.org/talks/hacking/
>>> )
>> I'm not sure I answered to this thread before, but ORDER BY RANDOM
>> is not the only way to get x random rows out of n rows.
>> Calculating random numbers isn't particularly cheap, so doing that
>> n times is going to cost a fair amount of CPU cycles and will
>> require a sequential scan over the table. If you search the
>> archives you're bound to stumble on a solution I suggested before
>> that only needs to call random() x times (instead of n times). It
>> still does a sequential scan (I think there's currently no way
>> around that unless quasi-random results are acceptable to you). My
>> solution involves a scrollable cursor that is used to navigate to x
>> random rows in the (otherwise unsorted) n rows in the result set.
>> I tried putting that functionality into a pl/pgsql function, but pl/
>> pgsql doesn't (yet) support the MOVE FORWARD ALL statement, which
>> you need to get the upper boundary of the random row number (equals
>> the number of rows in the result set).
>> An alternative solution is to wrap your query in SELECT COUNT(*)
>> FROM (...) AS resultset or something similar, but in that case the
>> query (which does a seqscan) has to be performed twice! Maybe other
>> PL-languages fair better there, but from the looks of it not even C-
>> functions can perform MOVE FORWARD ALL, so I guess they won't.
>> My last attempt used that approach, but it's obviously not optimal.
>> I'd much prefer to feed my function a query or a refcursor instead
>> of a string containing a query. Feeding it a string makes me itch.
>> Anyway, here's how far I got. It is in a usable state and I'm
>> interested how it performs on a real data set compared to ordering
>> by random() or other solutions.
>> It's at the moment probably more efficient to not use a stored
>> procedure but query the cursor from your application instead (saves
>> one of the two seqscans). That has it's own disadvantages of
>> course. I've used something like that (as a function in our PHP
>> application) on a medium-sized data set before, and it performed
>> adequately.
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll see there is no forest.
>>
>
> !DSPAM:3,49cd2b5e129741122515007!
>
>

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,49ce0228129741639950449!



pgsql-general by date:

Previous
From: Devrim GÜNDÜZ
Date:
Subject: New shapshot RPMs (Mar 27, 2009) are ready for testing
Next
From: Eugenio Modesti
Date:
Subject: problem with locale :