Thread: intermittant performance problem

intermittant performance problem

From
Mike Charnoky
Date:
Hello,

I'm looking for some insight on an intermittent PostgreSQL performance
problem that has been very troublesome.  Using PG 8.3.5 on a server
running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
SCSI 600GB array).

The problem in a nutshell is this: on some days, a nightly sampling
process (which reads data from one very large table and writes to
another) runs about 2 orders of magnitude slower and disk IO goes
through the roof.  The sampling process (which starts at 1am and usually
takes ~30 minutes) takes many hours to complete and eventually ends up
interfering with other processes that need to access the database.
Other processes which need to write to the db get backed up and
eventually data gets dropped (ie: in memory queues waiting for db writes
get filled up).

The problem only happens on maybe one or two random days during the
week.  There is generally no other db activity during this time
(pg_stat_activity shows mostly idle connections).  It seems as if db
cache is not being used properly and heavy disk usage results.  Not sure
how to test this assumption.

Details are as follows:

1) The db contains a "raw_data" table which stores on the order of 10-15
million rows per day.  A total of two weeks of data are stored, total
table size is about 40GB (with indices).
2) Every day, a process runs which samples data from the "raw_data"
table and stores it to the "sampled_data" table for long term storage.
The sampling is done across two strata: time of day (time range) and
name of item (text field).  That is, the day is divided into about 5
chunks to ensure that we sample enough data for each chunk of time, for
each item.
3) The sampling process happens in two passes.  The first pass looks at
the data in order to determine the sample size required for each (time
of day, item name).  This consists of running some aggregate queries
over the entire dataset to be sampled (std dev, count, etc).  Sample
sizes are calculated for all items at once for a given chunk of time.
The second pass actually performs random sampling of the data and stores
the samples in the "sampled_data" table.  It is this second pass of the
sampling process that is running about 2 orders of magnitude slower!
4) After the sampling process finishes, the oldest day's worth of data
is deleted from the "raw_data" table.

The sampling query which runs really slow on some days looks something
like this:

INSERT INTO sampled_data
   (item_name, timestmp, ... )
   SELECT item_name, timestmp, ... )
   FROM raw_data
   WHERE timestmp >= ? and timestmp < ?
   AND item_name=?
   AND some_data_field NOTNULL
   ORDER BY random()
   LIMIT ?;

We have done a great deal of PG tuning, including the autovacuum for the
"raw_data" table.  Autovacuum kicks like clockwork every day on that
table after the sampling process finishes (after one day's worth of data
is deleted from "raw_data" table, a roughly 7% change in size).

Changes made to postgresql.conf include:
max_connections = 250
shared_buffers = 1024MB
work_mem = 32MB
maintenance_work_mem = 256MB
max_fsm_pages = 10000000
max_fsm_relations = 30000
checkpoint_segments = 64
checkpoint_timeout = 10min
checkpoint_warning = 1min

Any pointers on how to troubleshoot this?


Mike

Re: intermittant performance problem

From
Scott Marlowe
Date:
On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@nextbus.com> wrote:
> Hello,
>
> I'm looking for some insight on an intermittent PostgreSQL performance
> problem that has been very troublesome.  Using PG 8.3.5 on a server
> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
> SCSI 600GB array).
>
> The problem in a nutshell is this: on some days, a nightly sampling
> process (which reads data from one very large table and writes to
> another) runs about 2 orders of magnitude slower and disk IO goes
> through the roof.  The sampling process (which starts at 1am and usually
> takes ~30 minutes) takes many hours to complete and eventually ends up
> interfering with other processes that need to access the database.
> Other processes which need to write to the db get backed up and
> eventually data gets dropped (ie: in memory queues waiting for db writes
> get filled up).
>
> The problem only happens on maybe one or two random days during the
> week.  There is generally no other db activity during this time
> (pg_stat_activity shows mostly idle connections).  It seems as if db
> cache is not being used properly and heavy disk usage results.  Not sure
> how to test this assumption.
>
> Details are as follows:
>
> 1) The db contains a "raw_data" table which stores on the order of 10-15
> million rows per day.  A total of two weeks of data are stored, total
> table size is about 40GB (with indices).
> 2) Every day, a process runs which samples data from the "raw_data"
> table and stores it to the "sampled_data" table for long term storage.
> The sampling is done across two strata: time of day (time range) and
> name of item (text field).  That is, the day is divided into about 5
> chunks to ensure that we sample enough data for each chunk of time, for
> each item.
> 3) The sampling process happens in two passes.  The first pass looks at
> the data in order to determine the sample size required for each (time
> of day, item name).  This consists of running some aggregate queries
> over the entire dataset to be sampled (std dev, count, etc).  Sample
> sizes are calculated for all items at once for a given chunk of time.
> The second pass actually performs random sampling of the data and stores
> the samples in the "sampled_data" table.  It is this second pass of the
> sampling process that is running about 2 orders of magnitude slower!
> 4) After the sampling process finishes, the oldest day's worth of data
> is deleted from the "raw_data" table.
>
> The sampling query which runs really slow on some days looks something
> like this:
>
> INSERT INTO sampled_data
>  (item_name, timestmp, ... )
>  SELECT item_name, timestmp, ... )
>  FROM raw_data
>  WHERE timestmp >= ? and timestmp < ?
>  AND item_name=?
>  AND some_data_field NOTNULL
>  ORDER BY random()
>  LIMIT ?;

Have you got any other method for doing the sampling that order by
random()?  order by random() is the most inefficient way you could
possibly do this.  If you know the range of say, ids:

select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ?

to get it.  Then use a random number generator to generate a list of
ids between those two ids, and select x rows from the database.

select * from rawtable where id in (1000 ids be here);

Will be WAY faster than order by random().

> Changes made to postgresql.conf include:
> max_connections = 250
> shared_buffers = 1024MB
> work_mem = 32MB

If you are married to order by random() then you might wanna crank up
work_mem while running that query.  I'd try something in the 128 to
512M range to start with.

> Any pointers on how to troubleshoot this?

Try methods that don't involve order by random().

Re: intermittant performance problem

From
Tom Lane
Date:
Mike Charnoky <noky@nextbus.com> writes:
> The sampling query which runs really slow on some days looks something
> like this:

> INSERT INTO sampled_data
>    (item_name, timestmp, ... )
>    SELECT item_name, timestmp, ... )
>    FROM raw_data
>    WHERE timestmp >= ? and timestmp < ?
>    AND item_name=?
>    AND some_data_field NOTNULL
>    ORDER BY random()
>    LIMIT ?;

Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
There's no good way to optimize "ORDER BY random()".  However, it seems
like the first thing you should do is modify the program so that it
issues an EXPLAIN for that right before actually doing the query, and
then you could see if the plan is different on the slow days.

> We have done a great deal of PG tuning, including the autovacuum for the
> "raw_data" table.  Autovacuum kicks like clockwork every day on that
> table after the sampling process finishes (after one day's worth of data
> is deleted from "raw_data" table, a roughly 7% change in size).

Also, are you sure you have ruled out the possibility that the problem
comes from autovac kicking in *while* the update is running?

            regards, tom lane

Re: intermittant performance problem

From
Mike Charnoky
Date:
Yeah, I wish I didn't have to resort to using ORDER BY RANDOM().  I
really wanted to use something like TABLESAMPLE, but that is not
implemented in PostgreSQL.  Unfortunately, I cannot use use the random
sampling technique you mentioned, since I need to select samples across
various strata of the data (in this case, where "item_name=something"),
not just between timestamp ranges.  Guess I'll just have to try kicking
up the work_mem for that query.

Thanks so much for your input.


Mike

Scott Marlowe wrote:
> On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@nextbus.com> wrote:
>> Hello,
>>
>> I'm looking for some insight on an intermittent PostgreSQL performance
>> problem that has been very troublesome.  Using PG 8.3.5 on a server
>> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
>> SCSI 600GB array).
>>
>> The problem in a nutshell is this: on some days, a nightly sampling
>> process (which reads data from one very large table and writes to
>> another) runs about 2 orders of magnitude slower and disk IO goes
>> through the roof.  The sampling process (which starts at 1am and usually
>> takes ~30 minutes) takes many hours to complete and eventually ends up
>> interfering with other processes that need to access the database.
>> Other processes which need to write to the db get backed up and
>> eventually data gets dropped (ie: in memory queues waiting for db writes
>> get filled up).
>>
>> The problem only happens on maybe one or two random days during the
>> week.  There is generally no other db activity during this time
>> (pg_stat_activity shows mostly idle connections).  It seems as if db
>> cache is not being used properly and heavy disk usage results.  Not sure
>> how to test this assumption.
>>
>> Details are as follows:
>>
>> 1) The db contains a "raw_data" table which stores on the order of 10-15
>> million rows per day.  A total of two weeks of data are stored, total
>> table size is about 40GB (with indices).
>> 2) Every day, a process runs which samples data from the "raw_data"
>> table and stores it to the "sampled_data" table for long term storage.
>> The sampling is done across two strata: time of day (time range) and
>> name of item (text field).  That is, the day is divided into about 5
>> chunks to ensure that we sample enough data for each chunk of time, for
>> each item.
>> 3) The sampling process happens in two passes.  The first pass looks at
>> the data in order to determine the sample size required for each (time
>> of day, item name).  This consists of running some aggregate queries
>> over the entire dataset to be sampled (std dev, count, etc).  Sample
>> sizes are calculated for all items at once for a given chunk of time.
>> The second pass actually performs random sampling of the data and stores
>> the samples in the "sampled_data" table.  It is this second pass of the
>> sampling process that is running about 2 orders of magnitude slower!
>> 4) After the sampling process finishes, the oldest day's worth of data
>> is deleted from the "raw_data" table.
>>
>> The sampling query which runs really slow on some days looks something
>> like this:
>>
>> INSERT INTO sampled_data
>>  (item_name, timestmp, ... )
>>  SELECT item_name, timestmp, ... )
>>  FROM raw_data
>>  WHERE timestmp >= ? and timestmp < ?
>>  AND item_name=?
>>  AND some_data_field NOTNULL
>>  ORDER BY random()
>>  LIMIT ?;
>
> Have you got any other method for doing the sampling that order by
> random()?  order by random() is the most inefficient way you could
> possibly do this.  If you know the range of say, ids:
>
> select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ?
>
> to get it.  Then use a random number generator to generate a list of
> ids between those two ids, and select x rows from the database.
>
> select * from rawtable where id in (1000 ids be here);
>
> Will be WAY faster than order by random().
>
>> Changes made to postgresql.conf include:
>> max_connections = 250
>> shared_buffers = 1024MB
>> work_mem = 32MB
>
> If you are married to order by random() then you might wanna crank up
> work_mem while running that query.  I'd try something in the 128 to
> 512M range to start with.
>
>> Any pointers on how to troubleshoot this?
>
> Try methods that don't involve order by random().
>

Re: intermittant performance problem

From
Mike Charnoky
Date:
The random sampling query is normally pretty snappy.  It usually takes
on the order of 1 second to sample a few thousand rows of data out of a
few million.  The sampling is consistently quick, too.  However, on some
days, the sampling starts off quick, then when the process starts
sampling from a different subset of data (different range of times for
the same day), the sampling query takes a couple minutes.

Regarding the concurrent vacuuming, this is definitely not happening.  I
always check pg_stat_activity whenever the sampling process starts to
lag behind.  I have never seen a vacuum running during this time.

Interesting idea to issue the EXPLAIN first... I will see if I can
instrument the sampling program to do this.

Thanks for your help Tom.


Mike

Tom Lane wrote:
> Mike Charnoky <noky@nextbus.com> writes:
>> The sampling query which runs really slow on some days looks something
>> like this:
>
>> INSERT INTO sampled_data
>>    (item_name, timestmp, ... )
>>    SELECT item_name, timestmp, ... )
>>    FROM raw_data
>>    WHERE timestmp >= ? and timestmp < ?
>>    AND item_name=?
>>    AND some_data_field NOTNULL
>>    ORDER BY random()
>>    LIMIT ?;
>
> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
> There's no good way to optimize "ORDER BY random()".  However, it seems
> like the first thing you should do is modify the program so that it
> issues an EXPLAIN for that right before actually doing the query, and
> then you could see if the plan is different on the slow days.
>
>> We have done a great deal of PG tuning, including the autovacuum for the
>> "raw_data" table.  Autovacuum kicks like clockwork every day on that
>> table after the sampling process finishes (after one day's worth of data
>> is deleted from "raw_data" table, a roughly 7% change in size).
>
> Also, are you sure you have ruled out the possibility that the problem
> comes from autovac kicking in *while* the update is running?
>
>             regards, tom lane
>

Re: intermittant performance problem

From
Scott Marlowe
Date:
On Mon, Mar 9, 2009 at 8:21 PM, Mike Charnoky <noky@nextbus.com> wrote:
> The random sampling query is normally pretty snappy.  It usually takes on
> the order of 1 second to sample a few thousand rows of data out of a few
> million.  The sampling is consistently quick, too.  However, on some days,
> the sampling starts off quick, then when the process starts sampling from a
> different subset of data (different range of times for the same day), the
> sampling query takes a couple minutes.

Then definitely look at saving explain plans before execution to
compare fast to slow runs.  This definitely sounds like ocassionally
bad query plans to me so far.

> Regarding the concurrent vacuuming, this is definitely not happening.  I
> always check pg_stat_activity whenever the sampling process starts to lag
> behind.  I have never seen a vacuum running during this time.

And if autovac is getting in the ways, try adjusting the various
autovac options. spefically autovacuum_vacuum_cost_delay set to 10 or
20 (mS).

>
> Interesting idea to issue the EXPLAIN first... I will see if I can
> instrument the sampling program to do this.
>
> Thanks for your help Tom.
>
>
> Mike
>
> Tom Lane wrote:
>>
>> Mike Charnoky <noky@nextbus.com> writes:
>>>
>>> The sampling query which runs really slow on some days looks something
>>> like this:
>>
>>> INSERT INTO sampled_data
>>>   (item_name, timestmp, ... )
>>>   SELECT item_name, timestmp, ... )
>>>   FROM raw_data
>>>   WHERE timestmp >= ? and timestmp < ?
>>>   AND item_name=?
>>>   AND some_data_field NOTNULL
>>>   ORDER BY random()
>>>   LIMIT ?;
>>
>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
>> There's no good way to optimize "ORDER BY random()".  However, it seems
>> like the first thing you should do is modify the program so that it
>> issues an EXPLAIN for that right before actually doing the query, and
>> then you could see if the plan is different on the slow days.
>>
>>> We have done a great deal of PG tuning, including the autovacuum for the
>>> "raw_data" table.  Autovacuum kicks like clockwork every day on that
>>> table after the sampling process finishes (after one day's worth of data
>>> is deleted from "raw_data" table, a roughly 7% change in size).
>>
>> Also, are you sure you have ruled out the possibility that the problem
>> comes from autovac kicking in *while* the update is running?
>>
>>                        regards, tom lane
>>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



--
When fascism comes to America, it will be the intolerant selling it as
diversity.

Re: intermittant performance problem

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Mike Charnoky <noky@nextbus.com> writes:
>> The sampling query which runs really slow on some days looks something
>> like this:
>
>> INSERT INTO sampled_data
>>    (item_name, timestmp, ... )
>>    SELECT item_name, timestmp, ... )
>>    FROM raw_data
>>    WHERE timestmp >= ? and timestmp < ?
>>    AND item_name=?
>>    AND some_data_field NOTNULL
>>    ORDER BY random()
>>    LIMIT ?;
>
> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
> There's no good way to optimize "ORDER BY random()".

This seems kind of unlikely but does the parameter to the LIMIT vary a lot? If
it's small enough to fit all the chosen records in work_mem then you'll avoid
a disk-sort and do a top-k scan. If it overflows work_mem then it'll fail over
to do a full disk sort of all the records picked from raw_data.

It does seem much more likely that whatever index you have it using on
timestmp or item_name or some_data_field is sometimes being used and sometimes
not. Perhaps it's switching from an index on one of those columns to an index
on some other column and that's what's throwing it off.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: intermittant performance problem

From
Mike Charnoky
Date:
Gregory Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>> Mike Charnoky <noky@nextbus.com> writes:
>>> The sampling query which runs really slow on some days looks something
>>> like this:
>>> INSERT INTO sampled_data
>>>    (item_name, timestmp, ... )
>>>    SELECT item_name, timestmp, ... )
>>>    FROM raw_data
>>>    WHERE timestmp >= ? and timestmp < ?
>>>    AND item_name=?
>>>    AND some_data_field NOTNULL
>>>    ORDER BY random()
>>>    LIMIT ?;
>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
>> There's no good way to optimize "ORDER BY random()".
>
> This seems kind of unlikely but does the parameter to the LIMIT vary a lot? If
> it's small enough to fit all the chosen records in work_mem then you'll avoid
> a disk-sort and do a top-k scan. If it overflows work_mem then it'll fail over
> to do a full disk sort of all the records picked from raw_data.
>
> It does seem much more likely that whatever index you have it using on
> timestmp or item_name or some_data_field is sometimes being used and sometimes
> not. Perhaps it's switching from an index on one of those columns to an index
> on some other column and that's what's throwing it off.

The parameter used for the LIMIT does not vary too much.  It is
typically a couple thousand records that are selected.

Judging by the disk IO monitoring we have in place, it does seem like a
full disk-sort is being done when the query runs slow.  Would you expect
this action to totally hose overall database performance?

I'm instrumenting the EXPLAIN now, I'll see what this turns up over the
course of the week and will check back if I'm still stumped.  Thanks.


Mike

Re: intermittant performance problem

From
Mike Charnoky
Date:
Scott Marlowe wrote:
> On Mon, Mar 9, 2009 at 8:21 PM, Mike Charnoky <noky@nextbus.com> wrote:
>> The random sampling query is normally pretty snappy.  It usually takes on
>> the order of 1 second to sample a few thousand rows of data out of a few
>> million.  The sampling is consistently quick, too.  However, on some days,
>> the sampling starts off quick, then when the process starts sampling from a
>> different subset of data (different range of times for the same day), the
>> sampling query takes a couple minutes.
>
> Then definitely look at saving explain plans before execution to
> compare fast to slow runs.  This definitely sounds like ocassionally
> bad query plans to me so far.
>
>>
>> Tom Lane wrote:
>>> Mike Charnoky <noky@nextbus.com> writes:
>>>> The sampling query which runs really slow on some days looks something
>>>> like this:
>>>> INSERT INTO sampled_data
>>>>   (item_name, timestmp, ... )
>>>>   SELECT item_name, timestmp, ... )
>>>>   FROM raw_data
>>>>   WHERE timestmp >= ? and timestmp < ?
>>>>   AND item_name=?
>>>>   AND some_data_field NOTNULL
>>>>   ORDER BY random()
>>>>   LIMIT ?;
>>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
>>> There's no good way to optimize "ORDER BY random()".  However, it seems
>>> like the first thing you should do is modify the program so that it
>>> issues an EXPLAIN for that right before actually doing the query, and
>>> then you could see if the plan is different on the slow days.

The problem came up over the weekend so I took a look at the info from
EXPLAIN.  The query plans were quite different on the days when the
problem happened.  I began to suspect that autoanalyze was not happening
daily like the autovacuums were, and sure enough it was only running
about every other day.  In fact, I saw that autoanalyze happened once
during the sampling process, and the sampling happened much faster
afterward.

We're tuning the autoanalyze parameters so it runs more frequently.  Is
it OK to run ANALYZE manually before I begin the sampling process?  Or
is there a possibility this will collide with an autoanalyze and result
in problems?  I seem to remember this was a problem in the past, though
it may have been before PG8.3...


Mike

Re: intermittant performance problem

From
Mike Charnoky
Date:
Mike Charnoky wrote:
> Scott Marlowe wrote:
>> On Mon, Mar 9, 2009 at 8:21 PM, Mike Charnoky <noky@nextbus.com> wrote:
>>> The random sampling query is normally pretty snappy.  It usually takes on
>>> the order of 1 second to sample a few thousand rows of data out of a few
>>> million.  The sampling is consistently quick, too.  However, on some days,
>>> the sampling starts off quick, then when the process starts sampling from a
>>> different subset of data (different range of times for the same day), the
>>> sampling query takes a couple minutes.
>>
>> Then definitely look at saving explain plans before execution to
>> compare fast to slow runs.  This definitely sounds like ocassionally
>> bad query plans to me so far.
>>>
>>> Tom Lane wrote:
>>>> Mike Charnoky <noky@nextbus.com> writes:
>>>>> The sampling query which runs really slow on some days looks something
>>>>> like this:
>>>>> INSERT INTO sampled_data
>>>>>   (item_name, timestmp, ... )
>>>>>   SELECT item_name, timestmp, ... )
>>>>>   FROM raw_data
>>>>>   WHERE timestmp >= ? and timestmp < ?
>>>>>   AND item_name=?
>>>>>   AND some_data_field NOTNULL
>>>>>   ORDER BY random()
>>>>>   LIMIT ?;
>>>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(.
>>>> There's no good way to optimize "ORDER BY random()".  However, it seems
>>>> like the first thing you should do is modify the program so that it
>>>> issues an EXPLAIN for that right before actually doing the query, and
>>>> then you could see if the plan is different on the slow days.

I'm at the end of my rope here.  Tried the following:

* Manually run ANALYZE on the table before running the sampling query.
This does not help, there are still times when the sampling runs slower
by two orders of magnitude and disk IO is through the roof.
* Bump work_mem for the query to 128MB (up from 32MB).  This did not
help.  Also, no temp files were created in $PG/data/base/pgsql_tmp/, so
work_mem does not seem to be an issue.
* EXPLAINs look nearly identical whether the query runs quickly or slowly

The thing that gets me is, why does this query totally hose the entire
database?  Other clients have a very hard time writing to the db when
this sampling query is running slow, disk IO is maxxed out.  This just
doesn't seem right.  Why would a single pg backend strangle db
performance to such an extent?  Aren't there ways to throttle this back?

Due to the nature of the sampling (need to limit using several
parameters with a WHERE clause), I can't just generate random numbers to
select data that I need.  Looks like I am stuck using ORDER BY RANDOM().
  The only other option at this point seems to be to implement
TABLESAMPLE, probably starting with the academic work that Neil Conway
published (http://neilconway.org/talks/hacking/)


Mike

Re: intermittant performance problem

From
Alban Hertroys
Date:
On Mar 25, 2009, at 5:09 PM, Mike Charnoky wrote:

> Due to the nature of the sampling (need to limit using several
> parameters with a WHERE clause), I can't just generate random
> numbers to select data that I need.  Looks like I am stuck using
> ORDER BY RANDOM().  The only other option at this point seems to be
> to implement TABLESAMPLE, probably starting with the academic work
> that Neil Conway published (http://neilconway.org/talks/hacking/)


I'm not sure I answered to this thread before, but ORDER BY RANDOM is
not the only way to get x random rows out of n rows.
Calculating random numbers isn't particularly cheap, so doing that n
times is going to cost a fair amount of CPU cycles and will require a
sequential scan over the table. If you search the archives you're
bound to stumble on a solution I suggested before that only needs to
call random() x times (instead of n times). It still does a sequential
scan (I think there's currently no way around that unless quasi-random
results are acceptable to you). My solution involves a scrollable
cursor that is used to navigate to x random rows in the (otherwise
unsorted) n rows in the result set.

I tried putting that functionality into a pl/pgsql function, but pl/
pgsql doesn't (yet) support the MOVE FORWARD ALL statement, which you
need to get the upper boundary of the random row number (equals the
number of rows in the result set).
An alternative solution is to wrap your query in SELECT COUNT(*) FROM
(...) AS resultset or something similar, but in that case the query
(which does a seqscan) has to be performed twice! Maybe other PL-
languages fair better there, but from the looks of it not even C-
functions can perform MOVE FORWARD ALL, so I guess they won't.

My last attempt used that approach, but it's obviously not optimal.
I'd much prefer to feed my function a query or a refcursor instead of
a string containing a query. Feeding it a string makes me itch.
Anyway, here's how far I got. It is in a usable state and I'm
interested how it performs on a real data set compared to ordering by
random() or other solutions.



!DSPAM:737,49cb5930129747428277249!



It's at the moment probably more efficient to not use a stored
procedure but query the cursor from your application instead (saves
one of the two seqscans). That has it's own disadvantages of course.
I've used something like that (as a function in our PHP application)
on a medium-sized data set before, and it performed adequately.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.



!DSPAM:737,49cb5930129747428277249!

Attachment

Re: intermittant performance problem

From
Alban Hertroys
Date:
On Mar 27, 2009, at 8:38 PM, Mike Charnoky wrote:

> Hi Alban,
>
> I experimented with your sample() solution and was rather baffled by
> the results, as it was actually 2x-3x slower than doing an ORDER BY
> RANDOM() LIMIT n.  I even precalculated the size of the result set,
> so that only one sequential scan was required, but this only helped
> marginally.
>
> I tested on a table with 284 million rows, selecting 10 random
> samples from a subset which contained 25 million rows.  I tried 3
> tests 3 times:
> A: ORDER BY RANDOM() limit 10
> B: SAMPLE(query, 10)
> C: SAMPLE(query, 10, 25million)  (precalculate size)
>
> Here were my results (query times in millisec)
>
> Trial   1        2       3     avg  speedup
> --------------------------------------------
> A.  63705    78561   85252   75839  baseline
> B  196057   200537  220437  205677    -270%
> C. 122356   170480  190829  161221    -210%
>
> Have you done any similar sort of benchmarking?  I'm assuming you're
> getting much better performance...
>
>
> Mike

Hello Mike,

That is not quite the result I expected. I think I have some ideas
what may be causing this to perform so inadequate. Those pretty much
torch this approach though...
I CC-ed the list again so that people can chime in.

I think the base of he problem is that the random records we're trying
to fetch aren't being cached. I can think of a few causes for that.

Firstly, due to using SELECT COUNT(*) to determine the size of the
result set, none of the records therein are actually queried and thus
don't end up in any caches. Some probably do end up in the OS's disk
cache.
My original MOVE FORWARD ALL approach (written in SQL in a PHP
function) probably did cause records to be moved to the caches.
Unfortunately I don't have access to that code anymore (a former
employer), but it's not particularly hard to duplicate.

Secondly, I think your result set is larger than mine was (by a
magnitude of 100 or so if memory serves me correctly), so even if
records were moved to the cache it's quite probable that only a part
of the set remains in cache as it simply won't all fit in there.

And thirdly, as we are fetching random records, random records are
being cached. The chance that we hit them again in a subsequent query
gets smaller with increasing result set size. That explains why
subsequent queries of this kind are not showing any improvement.

So effectively we are requesting random records from disk, using
random seeks through the data files.
If parts of the result set would get cached I think that would result
in a significant speedup of this approach. That probably explains my
own results, which did show a significant improvement (down from
several seconds to something acceptable for a website front page),
provided my theory at point 1 is correct.

Now I don't know how a scrollable cursor fetches rows from a result
set when given a specific row number, but I wouldn't be surprised if
it has to look it up from the start of the result set scrolling
forward. There might be a list of record numbers and pointers to disk,
I don't know the internals of FETCH w.r.p. scrollable cursors, but
that would be up to the highest row number requested at best (if we
fetch rows 10 and 5 in that order, we do know something about row 5 as
we encountered it to get to row 10, but we know nothing about row 11
and up as we haven't encountered those yet). If we would know about
rows beyond that we would also know the number of rows in a given
result set, and I know we don't - that's why we query for it.

This makes looking up random records even worse, as although we do
know their row numbers, we probably can't translate that to a disk
position and need a sequential scan to fetch it.

These are mostly educated guesses of course, I hope someone on the
list can shed some light on these pessimistic looking theories :)

> Alban Hertroys wrote:
>> On Mar 25, 2009, at 5:09 PM, Mike Charnoky wrote:
>>> Due to the nature of the sampling (need to limit using several
>>> parameters with a WHERE clause), I can't just generate random
>>> numbers to select data that I need.  Looks like I am stuck using
>>> ORDER BY RANDOM().  The only other option at this point seems to
>>> be to implement TABLESAMPLE, probably starting with the academic
>>> work that Neil Conway published (http://neilconway.org/talks/hacking/
>>> )
>> I'm not sure I answered to this thread before, but ORDER BY RANDOM
>> is not the only way to get x random rows out of n rows.
>> Calculating random numbers isn't particularly cheap, so doing that
>> n times is going to cost a fair amount of CPU cycles and will
>> require a sequential scan over the table. If you search the
>> archives you're bound to stumble on a solution I suggested before
>> that only needs to call random() x times (instead of n times). It
>> still does a sequential scan (I think there's currently no way
>> around that unless quasi-random results are acceptable to you). My
>> solution involves a scrollable cursor that is used to navigate to x
>> random rows in the (otherwise unsorted) n rows in the result set.
>> I tried putting that functionality into a pl/pgsql function, but pl/
>> pgsql doesn't (yet) support the MOVE FORWARD ALL statement, which
>> you need to get the upper boundary of the random row number (equals
>> the number of rows in the result set).
>> An alternative solution is to wrap your query in SELECT COUNT(*)
>> FROM (...) AS resultset or something similar, but in that case the
>> query (which does a seqscan) has to be performed twice! Maybe other
>> PL-languages fair better there, but from the looks of it not even C-
>> functions can perform MOVE FORWARD ALL, so I guess they won't.
>> My last attempt used that approach, but it's obviously not optimal.
>> I'd much prefer to feed my function a query or a refcursor instead
>> of a string containing a query. Feeding it a string makes me itch.
>> Anyway, here's how far I got. It is in a usable state and I'm
>> interested how it performs on a real data set compared to ordering
>> by random() or other solutions.
>> It's at the moment probably more efficient to not use a stored
>> procedure but query the cursor from your application instead (saves
>> one of the two seqscans). That has it's own disadvantages of
>> course. I've used something like that (as a function in our PHP
>> application) on a medium-sized data set before, and it performed
>> adequately.
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll see there is no forest.
>>
>
> !DSPAM:3,49cd2b5e129741122515007!
>
>

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,49ce0228129741639950449!



Re: intermittant performance problem

From
marcin mank
Date:
I think (a part of) Your problem is that order by random() is O(N
logN) complexity, while You are after O(N) .

The solution (in pseudocode)

random_sample(resultset,K):
  result := first K rows from resultset
  resultset.scrollto(K+1)
  p = K+1
  while(resultset.hasMoreRows())
        row = resultset.current()
        resultset.advance()
        if(random() < K/p)
              replace a random element in result with row
        p = p+1
  return result


the invariant being that at each step result contains a random sample
of K elements.

It should be fairly easy to implement in plpgsql.

Greetings
Marcin

Re: intermittant performance problem

From
marcin mank
Date:
On Sun, Mar 29, 2009 at 10:24 AM, marcin mank <marcin.mank@gmail.com> wrote:
> I think (a part of) Your problem is that order by random() is O(N
> logN) complexity, while You are after O(N) .
>
> The solution (in pseudocode)
>

[snip]

OK, I may be guiding You the wrong way

select g,g,g,g from generate_series(1,25000000) as g order by random() limit 10

executes in under thirty seconds, so I don`t think the sort is a problem.

Greetings
Marcin