Re: intermittant performance problem - Mailing list pgsql-general

From Mike Charnoky
Subject Re: intermittant performance problem
Date
Msg-id 49B5CC2D.4070905@nextbus.com
Whole thread Raw
In response to Re: intermittant performance problem  (Scott Marlowe <scott.marlowe@gmail.com>)
List pgsql-general
Yeah, I wish I didn't have to resort to using ORDER BY RANDOM().  I
really wanted to use something like TABLESAMPLE, but that is not
implemented in PostgreSQL.  Unfortunately, I cannot use use the random
sampling technique you mentioned, since I need to select samples across
various strata of the data (in this case, where "item_name=something"),
not just between timestamp ranges.  Guess I'll just have to try kicking
up the work_mem for that query.

Thanks so much for your input.


Mike

Scott Marlowe wrote:
> On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@nextbus.com> wrote:
>> Hello,
>>
>> I'm looking for some insight on an intermittent PostgreSQL performance
>> problem that has been very troublesome.  Using PG 8.3.5 on a server
>> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10
>> SCSI 600GB array).
>>
>> The problem in a nutshell is this: on some days, a nightly sampling
>> process (which reads data from one very large table and writes to
>> another) runs about 2 orders of magnitude slower and disk IO goes
>> through the roof.  The sampling process (which starts at 1am and usually
>> takes ~30 minutes) takes many hours to complete and eventually ends up
>> interfering with other processes that need to access the database.
>> Other processes which need to write to the db get backed up and
>> eventually data gets dropped (ie: in memory queues waiting for db writes
>> get filled up).
>>
>> The problem only happens on maybe one or two random days during the
>> week.  There is generally no other db activity during this time
>> (pg_stat_activity shows mostly idle connections).  It seems as if db
>> cache is not being used properly and heavy disk usage results.  Not sure
>> how to test this assumption.
>>
>> Details are as follows:
>>
>> 1) The db contains a "raw_data" table which stores on the order of 10-15
>> million rows per day.  A total of two weeks of data are stored, total
>> table size is about 40GB (with indices).
>> 2) Every day, a process runs which samples data from the "raw_data"
>> table and stores it to the "sampled_data" table for long term storage.
>> The sampling is done across two strata: time of day (time range) and
>> name of item (text field).  That is, the day is divided into about 5
>> chunks to ensure that we sample enough data for each chunk of time, for
>> each item.
>> 3) The sampling process happens in two passes.  The first pass looks at
>> the data in order to determine the sample size required for each (time
>> of day, item name).  This consists of running some aggregate queries
>> over the entire dataset to be sampled (std dev, count, etc).  Sample
>> sizes are calculated for all items at once for a given chunk of time.
>> The second pass actually performs random sampling of the data and stores
>> the samples in the "sampled_data" table.  It is this second pass of the
>> sampling process that is running about 2 orders of magnitude slower!
>> 4) After the sampling process finishes, the oldest day's worth of data
>> is deleted from the "raw_data" table.
>>
>> The sampling query which runs really slow on some days looks something
>> like this:
>>
>> INSERT INTO sampled_data
>>  (item_name, timestmp, ... )
>>  SELECT item_name, timestmp, ... )
>>  FROM raw_data
>>  WHERE timestmp >= ? and timestmp < ?
>>  AND item_name=?
>>  AND some_data_field NOTNULL
>>  ORDER BY random()
>>  LIMIT ?;
>
> Have you got any other method for doing the sampling that order by
> random()?  order by random() is the most inefficient way you could
> possibly do this.  If you know the range of say, ids:
>
> select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ?
>
> to get it.  Then use a random number generator to generate a list of
> ids between those two ids, and select x rows from the database.
>
> select * from rawtable where id in (1000 ids be here);
>
> Will be WAY faster than order by random().
>
>> Changes made to postgresql.conf include:
>> max_connections = 250
>> shared_buffers = 1024MB
>> work_mem = 32MB
>
> If you are married to order by random() then you might wanna crank up
> work_mem while running that query.  I'd try something in the 128 to
> 512M range to start with.
>
>> Any pointers on how to troubleshoot this?
>
> Try methods that don't involve order by random().
>

pgsql-general by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Connection Refused Error
Next
From: Mike Charnoky
Date:
Subject: Re: intermittant performance problem