Re: intermittant performance problem - Mailing list pgsql-general
From | Mike Charnoky |
---|---|
Subject | Re: intermittant performance problem |
Date | |
Msg-id | 49B5CC2D.4070905@nextbus.com Whole thread Raw |
In response to | Re: intermittant performance problem (Scott Marlowe <scott.marlowe@gmail.com>) |
List | pgsql-general |
Yeah, I wish I didn't have to resort to using ORDER BY RANDOM(). I really wanted to use something like TABLESAMPLE, but that is not implemented in PostgreSQL. Unfortunately, I cannot use use the random sampling technique you mentioned, since I need to select samples across various strata of the data (in this case, where "item_name=something"), not just between timestamp ranges. Guess I'll just have to try kicking up the work_mem for that query. Thanks so much for your input. Mike Scott Marlowe wrote: > On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@nextbus.com> wrote: >> Hello, >> >> I'm looking for some insight on an intermittent PostgreSQL performance >> problem that has been very troublesome. Using PG 8.3.5 on a server >> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10 >> SCSI 600GB array). >> >> The problem in a nutshell is this: on some days, a nightly sampling >> process (which reads data from one very large table and writes to >> another) runs about 2 orders of magnitude slower and disk IO goes >> through the roof. The sampling process (which starts at 1am and usually >> takes ~30 minutes) takes many hours to complete and eventually ends up >> interfering with other processes that need to access the database. >> Other processes which need to write to the db get backed up and >> eventually data gets dropped (ie: in memory queues waiting for db writes >> get filled up). >> >> The problem only happens on maybe one or two random days during the >> week. There is generally no other db activity during this time >> (pg_stat_activity shows mostly idle connections). It seems as if db >> cache is not being used properly and heavy disk usage results. Not sure >> how to test this assumption. >> >> Details are as follows: >> >> 1) The db contains a "raw_data" table which stores on the order of 10-15 >> million rows per day. A total of two weeks of data are stored, total >> table size is about 40GB (with indices). >> 2) Every day, a process runs which samples data from the "raw_data" >> table and stores it to the "sampled_data" table for long term storage. >> The sampling is done across two strata: time of day (time range) and >> name of item (text field). That is, the day is divided into about 5 >> chunks to ensure that we sample enough data for each chunk of time, for >> each item. >> 3) The sampling process happens in two passes. The first pass looks at >> the data in order to determine the sample size required for each (time >> of day, item name). This consists of running some aggregate queries >> over the entire dataset to be sampled (std dev, count, etc). Sample >> sizes are calculated for all items at once for a given chunk of time. >> The second pass actually performs random sampling of the data and stores >> the samples in the "sampled_data" table. It is this second pass of the >> sampling process that is running about 2 orders of magnitude slower! >> 4) After the sampling process finishes, the oldest day's worth of data >> is deleted from the "raw_data" table. >> >> The sampling query which runs really slow on some days looks something >> like this: >> >> INSERT INTO sampled_data >> (item_name, timestmp, ... ) >> SELECT item_name, timestmp, ... ) >> FROM raw_data >> WHERE timestmp >= ? and timestmp < ? >> AND item_name=? >> AND some_data_field NOTNULL >> ORDER BY random() >> LIMIT ?; > > Have you got any other method for doing the sampling that order by > random()? order by random() is the most inefficient way you could > possibly do this. If you know the range of say, ids: > > select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ? > > to get it. Then use a random number generator to generate a list of > ids between those two ids, and select x rows from the database. > > select * from rawtable where id in (1000 ids be here); > > Will be WAY faster than order by random(). > >> Changes made to postgresql.conf include: >> max_connections = 250 >> shared_buffers = 1024MB >> work_mem = 32MB > > If you are married to order by random() then you might wanna crank up > work_mem while running that query. I'd try something in the 128 to > 512M range to start with. > >> Any pointers on how to troubleshoot this? > > Try methods that don't involve order by random(). >
pgsql-general by date: