Thread: intermittant performance problem
Hello, I'm looking for some insight on an intermittent PostgreSQL performance problem that has been very troublesome. Using PG 8.3.5 on a server running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10 SCSI 600GB array). The problem in a nutshell is this: on some days, a nightly sampling process (which reads data from one very large table and writes to another) runs about 2 orders of magnitude slower and disk IO goes through the roof. The sampling process (which starts at 1am and usually takes ~30 minutes) takes many hours to complete and eventually ends up interfering with other processes that need to access the database. Other processes which need to write to the db get backed up and eventually data gets dropped (ie: in memory queues waiting for db writes get filled up). The problem only happens on maybe one or two random days during the week. There is generally no other db activity during this time (pg_stat_activity shows mostly idle connections). It seems as if db cache is not being used properly and heavy disk usage results. Not sure how to test this assumption. Details are as follows: 1) The db contains a "raw_data" table which stores on the order of 10-15 million rows per day. A total of two weeks of data are stored, total table size is about 40GB (with indices). 2) Every day, a process runs which samples data from the "raw_data" table and stores it to the "sampled_data" table for long term storage. The sampling is done across two strata: time of day (time range) and name of item (text field). That is, the day is divided into about 5 chunks to ensure that we sample enough data for each chunk of time, for each item. 3) The sampling process happens in two passes. The first pass looks at the data in order to determine the sample size required for each (time of day, item name). This consists of running some aggregate queries over the entire dataset to be sampled (std dev, count, etc). Sample sizes are calculated for all items at once for a given chunk of time. The second pass actually performs random sampling of the data and stores the samples in the "sampled_data" table. It is this second pass of the sampling process that is running about 2 orders of magnitude slower! 4) After the sampling process finishes, the oldest day's worth of data is deleted from the "raw_data" table. The sampling query which runs really slow on some days looks something like this: INSERT INTO sampled_data (item_name, timestmp, ... ) SELECT item_name, timestmp, ... ) FROM raw_data WHERE timestmp >= ? and timestmp < ? AND item_name=? AND some_data_field NOTNULL ORDER BY random() LIMIT ?; We have done a great deal of PG tuning, including the autovacuum for the "raw_data" table. Autovacuum kicks like clockwork every day on that table after the sampling process finishes (after one day's worth of data is deleted from "raw_data" table, a roughly 7% change in size). Changes made to postgresql.conf include: max_connections = 250 shared_buffers = 1024MB work_mem = 32MB maintenance_work_mem = 256MB max_fsm_pages = 10000000 max_fsm_relations = 30000 checkpoint_segments = 64 checkpoint_timeout = 10min checkpoint_warning = 1min Any pointers on how to troubleshoot this? Mike
On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@nextbus.com> wrote: > Hello, > > I'm looking for some insight on an intermittent PostgreSQL performance > problem that has been very troublesome. Using PG 8.3.5 on a server > running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10 > SCSI 600GB array). > > The problem in a nutshell is this: on some days, a nightly sampling > process (which reads data from one very large table and writes to > another) runs about 2 orders of magnitude slower and disk IO goes > through the roof. The sampling process (which starts at 1am and usually > takes ~30 minutes) takes many hours to complete and eventually ends up > interfering with other processes that need to access the database. > Other processes which need to write to the db get backed up and > eventually data gets dropped (ie: in memory queues waiting for db writes > get filled up). > > The problem only happens on maybe one or two random days during the > week. There is generally no other db activity during this time > (pg_stat_activity shows mostly idle connections). It seems as if db > cache is not being used properly and heavy disk usage results. Not sure > how to test this assumption. > > Details are as follows: > > 1) The db contains a "raw_data" table which stores on the order of 10-15 > million rows per day. A total of two weeks of data are stored, total > table size is about 40GB (with indices). > 2) Every day, a process runs which samples data from the "raw_data" > table and stores it to the "sampled_data" table for long term storage. > The sampling is done across two strata: time of day (time range) and > name of item (text field). That is, the day is divided into about 5 > chunks to ensure that we sample enough data for each chunk of time, for > each item. > 3) The sampling process happens in two passes. The first pass looks at > the data in order to determine the sample size required for each (time > of day, item name). This consists of running some aggregate queries > over the entire dataset to be sampled (std dev, count, etc). Sample > sizes are calculated for all items at once for a given chunk of time. > The second pass actually performs random sampling of the data and stores > the samples in the "sampled_data" table. It is this second pass of the > sampling process that is running about 2 orders of magnitude slower! > 4) After the sampling process finishes, the oldest day's worth of data > is deleted from the "raw_data" table. > > The sampling query which runs really slow on some days looks something > like this: > > INSERT INTO sampled_data > (item_name, timestmp, ... ) > SELECT item_name, timestmp, ... ) > FROM raw_data > WHERE timestmp >= ? and timestmp < ? > AND item_name=? > AND some_data_field NOTNULL > ORDER BY random() > LIMIT ?; Have you got any other method for doing the sampling that order by random()? order by random() is the most inefficient way you could possibly do this. If you know the range of say, ids: select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ? to get it. Then use a random number generator to generate a list of ids between those two ids, and select x rows from the database. select * from rawtable where id in (1000 ids be here); Will be WAY faster than order by random(). > Changes made to postgresql.conf include: > max_connections = 250 > shared_buffers = 1024MB > work_mem = 32MB If you are married to order by random() then you might wanna crank up work_mem while running that query. I'd try something in the 128 to 512M range to start with. > Any pointers on how to troubleshoot this? Try methods that don't involve order by random().
Mike Charnoky <noky@nextbus.com> writes: > The sampling query which runs really slow on some days looks something > like this: > INSERT INTO sampled_data > (item_name, timestmp, ... ) > SELECT item_name, timestmp, ... ) > FROM raw_data > WHERE timestmp >= ? and timestmp < ? > AND item_name=? > AND some_data_field NOTNULL > ORDER BY random() > LIMIT ?; Hmph, I'd expect that that would run pretty slowly *all* the time :-(. There's no good way to optimize "ORDER BY random()". However, it seems like the first thing you should do is modify the program so that it issues an EXPLAIN for that right before actually doing the query, and then you could see if the plan is different on the slow days. > We have done a great deal of PG tuning, including the autovacuum for the > "raw_data" table. Autovacuum kicks like clockwork every day on that > table after the sampling process finishes (after one day's worth of data > is deleted from "raw_data" table, a roughly 7% change in size). Also, are you sure you have ruled out the possibility that the problem comes from autovac kicking in *while* the update is running? regards, tom lane
Yeah, I wish I didn't have to resort to using ORDER BY RANDOM(). I really wanted to use something like TABLESAMPLE, but that is not implemented in PostgreSQL. Unfortunately, I cannot use use the random sampling technique you mentioned, since I need to select samples across various strata of the data (in this case, where "item_name=something"), not just between timestamp ranges. Guess I'll just have to try kicking up the work_mem for that query. Thanks so much for your input. Mike Scott Marlowe wrote: > On Mon, Mar 9, 2009 at 1:55 PM, Mike Charnoky <noky@nextbus.com> wrote: >> Hello, >> >> I'm looking for some insight on an intermittent PostgreSQL performance >> problem that has been very troublesome. Using PG 8.3.5 on a server >> running CentOS 5 2.6.18-8.el5 (Dual Xeon 2.00 GHz, 4 GB RAM, RAID-10 >> SCSI 600GB array). >> >> The problem in a nutshell is this: on some days, a nightly sampling >> process (which reads data from one very large table and writes to >> another) runs about 2 orders of magnitude slower and disk IO goes >> through the roof. The sampling process (which starts at 1am and usually >> takes ~30 minutes) takes many hours to complete and eventually ends up >> interfering with other processes that need to access the database. >> Other processes which need to write to the db get backed up and >> eventually data gets dropped (ie: in memory queues waiting for db writes >> get filled up). >> >> The problem only happens on maybe one or two random days during the >> week. There is generally no other db activity during this time >> (pg_stat_activity shows mostly idle connections). It seems as if db >> cache is not being used properly and heavy disk usage results. Not sure >> how to test this assumption. >> >> Details are as follows: >> >> 1) The db contains a "raw_data" table which stores on the order of 10-15 >> million rows per day. A total of two weeks of data are stored, total >> table size is about 40GB (with indices). >> 2) Every day, a process runs which samples data from the "raw_data" >> table and stores it to the "sampled_data" table for long term storage. >> The sampling is done across two strata: time of day (time range) and >> name of item (text field). That is, the day is divided into about 5 >> chunks to ensure that we sample enough data for each chunk of time, for >> each item. >> 3) The sampling process happens in two passes. The first pass looks at >> the data in order to determine the sample size required for each (time >> of day, item name). This consists of running some aggregate queries >> over the entire dataset to be sampled (std dev, count, etc). Sample >> sizes are calculated for all items at once for a given chunk of time. >> The second pass actually performs random sampling of the data and stores >> the samples in the "sampled_data" table. It is this second pass of the >> sampling process that is running about 2 orders of magnitude slower! >> 4) After the sampling process finishes, the oldest day's worth of data >> is deleted from the "raw_data" table. >> >> The sampling query which runs really slow on some days looks something >> like this: >> >> INSERT INTO sampled_data >> (item_name, timestmp, ... ) >> SELECT item_name, timestmp, ... ) >> FROM raw_data >> WHERE timestmp >= ? and timestmp < ? >> AND item_name=? >> AND some_data_field NOTNULL >> ORDER BY random() >> LIMIT ?; > > Have you got any other method for doing the sampling that order by > random()? order by random() is the most inefficient way you could > possibly do this. If you know the range of say, ids: > > select max(id), min(id) from rawtable where timestmp >= ? and timestmp < ? > > to get it. Then use a random number generator to generate a list of > ids between those two ids, and select x rows from the database. > > select * from rawtable where id in (1000 ids be here); > > Will be WAY faster than order by random(). > >> Changes made to postgresql.conf include: >> max_connections = 250 >> shared_buffers = 1024MB >> work_mem = 32MB > > If you are married to order by random() then you might wanna crank up > work_mem while running that query. I'd try something in the 128 to > 512M range to start with. > >> Any pointers on how to troubleshoot this? > > Try methods that don't involve order by random(). >
The random sampling query is normally pretty snappy. It usually takes on the order of 1 second to sample a few thousand rows of data out of a few million. The sampling is consistently quick, too. However, on some days, the sampling starts off quick, then when the process starts sampling from a different subset of data (different range of times for the same day), the sampling query takes a couple minutes. Regarding the concurrent vacuuming, this is definitely not happening. I always check pg_stat_activity whenever the sampling process starts to lag behind. I have never seen a vacuum running during this time. Interesting idea to issue the EXPLAIN first... I will see if I can instrument the sampling program to do this. Thanks for your help Tom. Mike Tom Lane wrote: > Mike Charnoky <noky@nextbus.com> writes: >> The sampling query which runs really slow on some days looks something >> like this: > >> INSERT INTO sampled_data >> (item_name, timestmp, ... ) >> SELECT item_name, timestmp, ... ) >> FROM raw_data >> WHERE timestmp >= ? and timestmp < ? >> AND item_name=? >> AND some_data_field NOTNULL >> ORDER BY random() >> LIMIT ?; > > Hmph, I'd expect that that would run pretty slowly *all* the time :-(. > There's no good way to optimize "ORDER BY random()". However, it seems > like the first thing you should do is modify the program so that it > issues an EXPLAIN for that right before actually doing the query, and > then you could see if the plan is different on the slow days. > >> We have done a great deal of PG tuning, including the autovacuum for the >> "raw_data" table. Autovacuum kicks like clockwork every day on that >> table after the sampling process finishes (after one day's worth of data >> is deleted from "raw_data" table, a roughly 7% change in size). > > Also, are you sure you have ruled out the possibility that the problem > comes from autovac kicking in *while* the update is running? > > regards, tom lane >
On Mon, Mar 9, 2009 at 8:21 PM, Mike Charnoky <noky@nextbus.com> wrote: > The random sampling query is normally pretty snappy. It usually takes on > the order of 1 second to sample a few thousand rows of data out of a few > million. The sampling is consistently quick, too. However, on some days, > the sampling starts off quick, then when the process starts sampling from a > different subset of data (different range of times for the same day), the > sampling query takes a couple minutes. Then definitely look at saving explain plans before execution to compare fast to slow runs. This definitely sounds like ocassionally bad query plans to me so far. > Regarding the concurrent vacuuming, this is definitely not happening. I > always check pg_stat_activity whenever the sampling process starts to lag > behind. I have never seen a vacuum running during this time. And if autovac is getting in the ways, try adjusting the various autovac options. spefically autovacuum_vacuum_cost_delay set to 10 or 20 (mS). > > Interesting idea to issue the EXPLAIN first... I will see if I can > instrument the sampling program to do this. > > Thanks for your help Tom. > > > Mike > > Tom Lane wrote: >> >> Mike Charnoky <noky@nextbus.com> writes: >>> >>> The sampling query which runs really slow on some days looks something >>> like this: >> >>> INSERT INTO sampled_data >>> (item_name, timestmp, ... ) >>> SELECT item_name, timestmp, ... ) >>> FROM raw_data >>> WHERE timestmp >= ? and timestmp < ? >>> AND item_name=? >>> AND some_data_field NOTNULL >>> ORDER BY random() >>> LIMIT ?; >> >> Hmph, I'd expect that that would run pretty slowly *all* the time :-(. >> There's no good way to optimize "ORDER BY random()". However, it seems >> like the first thing you should do is modify the program so that it >> issues an EXPLAIN for that right before actually doing the query, and >> then you could see if the plan is different on the slow days. >> >>> We have done a great deal of PG tuning, including the autovacuum for the >>> "raw_data" table. Autovacuum kicks like clockwork every day on that >>> table after the sampling process finishes (after one day's worth of data >>> is deleted from "raw_data" table, a roughly 7% change in size). >> >> Also, are you sure you have ruled out the possibility that the problem >> comes from autovac kicking in *while* the update is running? >> >> regards, tom lane >> > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > -- When fascism comes to America, it will be the intolerant selling it as diversity.
Tom Lane <tgl@sss.pgh.pa.us> writes: > Mike Charnoky <noky@nextbus.com> writes: >> The sampling query which runs really slow on some days looks something >> like this: > >> INSERT INTO sampled_data >> (item_name, timestmp, ... ) >> SELECT item_name, timestmp, ... ) >> FROM raw_data >> WHERE timestmp >= ? and timestmp < ? >> AND item_name=? >> AND some_data_field NOTNULL >> ORDER BY random() >> LIMIT ?; > > Hmph, I'd expect that that would run pretty slowly *all* the time :-(. > There's no good way to optimize "ORDER BY random()". This seems kind of unlikely but does the parameter to the LIMIT vary a lot? If it's small enough to fit all the chosen records in work_mem then you'll avoid a disk-sort and do a top-k scan. If it overflows work_mem then it'll fail over to do a full disk sort of all the records picked from raw_data. It does seem much more likely that whatever index you have it using on timestmp or item_name or some_data_field is sometimes being used and sometimes not. Perhaps it's switching from an index on one of those columns to an index on some other column and that's what's throwing it off. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Gregory Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> Mike Charnoky <noky@nextbus.com> writes: >>> The sampling query which runs really slow on some days looks something >>> like this: >>> INSERT INTO sampled_data >>> (item_name, timestmp, ... ) >>> SELECT item_name, timestmp, ... ) >>> FROM raw_data >>> WHERE timestmp >= ? and timestmp < ? >>> AND item_name=? >>> AND some_data_field NOTNULL >>> ORDER BY random() >>> LIMIT ?; >> Hmph, I'd expect that that would run pretty slowly *all* the time :-(. >> There's no good way to optimize "ORDER BY random()". > > This seems kind of unlikely but does the parameter to the LIMIT vary a lot? If > it's small enough to fit all the chosen records in work_mem then you'll avoid > a disk-sort and do a top-k scan. If it overflows work_mem then it'll fail over > to do a full disk sort of all the records picked from raw_data. > > It does seem much more likely that whatever index you have it using on > timestmp or item_name or some_data_field is sometimes being used and sometimes > not. Perhaps it's switching from an index on one of those columns to an index > on some other column and that's what's throwing it off. The parameter used for the LIMIT does not vary too much. It is typically a couple thousand records that are selected. Judging by the disk IO monitoring we have in place, it does seem like a full disk-sort is being done when the query runs slow. Would you expect this action to totally hose overall database performance? I'm instrumenting the EXPLAIN now, I'll see what this turns up over the course of the week and will check back if I'm still stumped. Thanks. Mike
Scott Marlowe wrote: > On Mon, Mar 9, 2009 at 8:21 PM, Mike Charnoky <noky@nextbus.com> wrote: >> The random sampling query is normally pretty snappy. It usually takes on >> the order of 1 second to sample a few thousand rows of data out of a few >> million. The sampling is consistently quick, too. However, on some days, >> the sampling starts off quick, then when the process starts sampling from a >> different subset of data (different range of times for the same day), the >> sampling query takes a couple minutes. > > Then definitely look at saving explain plans before execution to > compare fast to slow runs. This definitely sounds like ocassionally > bad query plans to me so far. > >> >> Tom Lane wrote: >>> Mike Charnoky <noky@nextbus.com> writes: >>>> The sampling query which runs really slow on some days looks something >>>> like this: >>>> INSERT INTO sampled_data >>>> (item_name, timestmp, ... ) >>>> SELECT item_name, timestmp, ... ) >>>> FROM raw_data >>>> WHERE timestmp >= ? and timestmp < ? >>>> AND item_name=? >>>> AND some_data_field NOTNULL >>>> ORDER BY random() >>>> LIMIT ?; >>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(. >>> There's no good way to optimize "ORDER BY random()". However, it seems >>> like the first thing you should do is modify the program so that it >>> issues an EXPLAIN for that right before actually doing the query, and >>> then you could see if the plan is different on the slow days. The problem came up over the weekend so I took a look at the info from EXPLAIN. The query plans were quite different on the days when the problem happened. I began to suspect that autoanalyze was not happening daily like the autovacuums were, and sure enough it was only running about every other day. In fact, I saw that autoanalyze happened once during the sampling process, and the sampling happened much faster afterward. We're tuning the autoanalyze parameters so it runs more frequently. Is it OK to run ANALYZE manually before I begin the sampling process? Or is there a possibility this will collide with an autoanalyze and result in problems? I seem to remember this was a problem in the past, though it may have been before PG8.3... Mike
Mike Charnoky wrote: > Scott Marlowe wrote: >> On Mon, Mar 9, 2009 at 8:21 PM, Mike Charnoky <noky@nextbus.com> wrote: >>> The random sampling query is normally pretty snappy. It usually takes on >>> the order of 1 second to sample a few thousand rows of data out of a few >>> million. The sampling is consistently quick, too. However, on some days, >>> the sampling starts off quick, then when the process starts sampling from a >>> different subset of data (different range of times for the same day), the >>> sampling query takes a couple minutes. >> >> Then definitely look at saving explain plans before execution to >> compare fast to slow runs. This definitely sounds like ocassionally >> bad query plans to me so far. >>> >>> Tom Lane wrote: >>>> Mike Charnoky <noky@nextbus.com> writes: >>>>> The sampling query which runs really slow on some days looks something >>>>> like this: >>>>> INSERT INTO sampled_data >>>>> (item_name, timestmp, ... ) >>>>> SELECT item_name, timestmp, ... ) >>>>> FROM raw_data >>>>> WHERE timestmp >= ? and timestmp < ? >>>>> AND item_name=? >>>>> AND some_data_field NOTNULL >>>>> ORDER BY random() >>>>> LIMIT ?; >>>> Hmph, I'd expect that that would run pretty slowly *all* the time :-(. >>>> There's no good way to optimize "ORDER BY random()". However, it seems >>>> like the first thing you should do is modify the program so that it >>>> issues an EXPLAIN for that right before actually doing the query, and >>>> then you could see if the plan is different on the slow days. I'm at the end of my rope here. Tried the following: * Manually run ANALYZE on the table before running the sampling query. This does not help, there are still times when the sampling runs slower by two orders of magnitude and disk IO is through the roof. * Bump work_mem for the query to 128MB (up from 32MB). This did not help. Also, no temp files were created in $PG/data/base/pgsql_tmp/, so work_mem does not seem to be an issue. * EXPLAINs look nearly identical whether the query runs quickly or slowly The thing that gets me is, why does this query totally hose the entire database? Other clients have a very hard time writing to the db when this sampling query is running slow, disk IO is maxxed out. This just doesn't seem right. Why would a single pg backend strangle db performance to such an extent? Aren't there ways to throttle this back? Due to the nature of the sampling (need to limit using several parameters with a WHERE clause), I can't just generate random numbers to select data that I need. Looks like I am stuck using ORDER BY RANDOM(). The only other option at this point seems to be to implement TABLESAMPLE, probably starting with the academic work that Neil Conway published (http://neilconway.org/talks/hacking/) Mike
On Mar 25, 2009, at 5:09 PM, Mike Charnoky wrote: > Due to the nature of the sampling (need to limit using several > parameters with a WHERE clause), I can't just generate random > numbers to select data that I need. Looks like I am stuck using > ORDER BY RANDOM(). The only other option at this point seems to be > to implement TABLESAMPLE, probably starting with the academic work > that Neil Conway published (http://neilconway.org/talks/hacking/) I'm not sure I answered to this thread before, but ORDER BY RANDOM is not the only way to get x random rows out of n rows. Calculating random numbers isn't particularly cheap, so doing that n times is going to cost a fair amount of CPU cycles and will require a sequential scan over the table. If you search the archives you're bound to stumble on a solution I suggested before that only needs to call random() x times (instead of n times). It still does a sequential scan (I think there's currently no way around that unless quasi-random results are acceptable to you). My solution involves a scrollable cursor that is used to navigate to x random rows in the (otherwise unsorted) n rows in the result set. I tried putting that functionality into a pl/pgsql function, but pl/ pgsql doesn't (yet) support the MOVE FORWARD ALL statement, which you need to get the upper boundary of the random row number (equals the number of rows in the result set). An alternative solution is to wrap your query in SELECT COUNT(*) FROM (...) AS resultset or something similar, but in that case the query (which does a seqscan) has to be performed twice! Maybe other PL- languages fair better there, but from the looks of it not even C- functions can perform MOVE FORWARD ALL, so I guess they won't. My last attempt used that approach, but it's obviously not optimal. I'd much prefer to feed my function a query or a refcursor instead of a string containing a query. Feeding it a string makes me itch. Anyway, here's how far I got. It is in a usable state and I'm interested how it performs on a real data set compared to ordering by random() or other solutions. !DSPAM:737,49cb5930129747428277249! It's at the moment probably more efficient to not use a stored procedure but query the cursor from your application instead (saves one of the two seqscans). That has it's own disadvantages of course. I've used something like that (as a function in our PHP application) on a medium-sized data set before, and it performed adequately. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,49cb5930129747428277249!
Attachment
On Mar 27, 2009, at 8:38 PM, Mike Charnoky wrote: > Hi Alban, > > I experimented with your sample() solution and was rather baffled by > the results, as it was actually 2x-3x slower than doing an ORDER BY > RANDOM() LIMIT n. I even precalculated the size of the result set, > so that only one sequential scan was required, but this only helped > marginally. > > I tested on a table with 284 million rows, selecting 10 random > samples from a subset which contained 25 million rows. I tried 3 > tests 3 times: > A: ORDER BY RANDOM() limit 10 > B: SAMPLE(query, 10) > C: SAMPLE(query, 10, 25million) (precalculate size) > > Here were my results (query times in millisec) > > Trial 1 2 3 avg speedup > -------------------------------------------- > A. 63705 78561 85252 75839 baseline > B 196057 200537 220437 205677 -270% > C. 122356 170480 190829 161221 -210% > > Have you done any similar sort of benchmarking? I'm assuming you're > getting much better performance... > > > Mike Hello Mike, That is not quite the result I expected. I think I have some ideas what may be causing this to perform so inadequate. Those pretty much torch this approach though... I CC-ed the list again so that people can chime in. I think the base of he problem is that the random records we're trying to fetch aren't being cached. I can think of a few causes for that. Firstly, due to using SELECT COUNT(*) to determine the size of the result set, none of the records therein are actually queried and thus don't end up in any caches. Some probably do end up in the OS's disk cache. My original MOVE FORWARD ALL approach (written in SQL in a PHP function) probably did cause records to be moved to the caches. Unfortunately I don't have access to that code anymore (a former employer), but it's not particularly hard to duplicate. Secondly, I think your result set is larger than mine was (by a magnitude of 100 or so if memory serves me correctly), so even if records were moved to the cache it's quite probable that only a part of the set remains in cache as it simply won't all fit in there. And thirdly, as we are fetching random records, random records are being cached. The chance that we hit them again in a subsequent query gets smaller with increasing result set size. That explains why subsequent queries of this kind are not showing any improvement. So effectively we are requesting random records from disk, using random seeks through the data files. If parts of the result set would get cached I think that would result in a significant speedup of this approach. That probably explains my own results, which did show a significant improvement (down from several seconds to something acceptable for a website front page), provided my theory at point 1 is correct. Now I don't know how a scrollable cursor fetches rows from a result set when given a specific row number, but I wouldn't be surprised if it has to look it up from the start of the result set scrolling forward. There might be a list of record numbers and pointers to disk, I don't know the internals of FETCH w.r.p. scrollable cursors, but that would be up to the highest row number requested at best (if we fetch rows 10 and 5 in that order, we do know something about row 5 as we encountered it to get to row 10, but we know nothing about row 11 and up as we haven't encountered those yet). If we would know about rows beyond that we would also know the number of rows in a given result set, and I know we don't - that's why we query for it. This makes looking up random records even worse, as although we do know their row numbers, we probably can't translate that to a disk position and need a sequential scan to fetch it. These are mostly educated guesses of course, I hope someone on the list can shed some light on these pessimistic looking theories :) > Alban Hertroys wrote: >> On Mar 25, 2009, at 5:09 PM, Mike Charnoky wrote: >>> Due to the nature of the sampling (need to limit using several >>> parameters with a WHERE clause), I can't just generate random >>> numbers to select data that I need. Looks like I am stuck using >>> ORDER BY RANDOM(). The only other option at this point seems to >>> be to implement TABLESAMPLE, probably starting with the academic >>> work that Neil Conway published (http://neilconway.org/talks/hacking/ >>> ) >> I'm not sure I answered to this thread before, but ORDER BY RANDOM >> is not the only way to get x random rows out of n rows. >> Calculating random numbers isn't particularly cheap, so doing that >> n times is going to cost a fair amount of CPU cycles and will >> require a sequential scan over the table. If you search the >> archives you're bound to stumble on a solution I suggested before >> that only needs to call random() x times (instead of n times). It >> still does a sequential scan (I think there's currently no way >> around that unless quasi-random results are acceptable to you). My >> solution involves a scrollable cursor that is used to navigate to x >> random rows in the (otherwise unsorted) n rows in the result set. >> I tried putting that functionality into a pl/pgsql function, but pl/ >> pgsql doesn't (yet) support the MOVE FORWARD ALL statement, which >> you need to get the upper boundary of the random row number (equals >> the number of rows in the result set). >> An alternative solution is to wrap your query in SELECT COUNT(*) >> FROM (...) AS resultset or something similar, but in that case the >> query (which does a seqscan) has to be performed twice! Maybe other >> PL-languages fair better there, but from the looks of it not even C- >> functions can perform MOVE FORWARD ALL, so I guess they won't. >> My last attempt used that approach, but it's obviously not optimal. >> I'd much prefer to feed my function a query or a refcursor instead >> of a string containing a query. Feeding it a string makes me itch. >> Anyway, here's how far I got. It is in a usable state and I'm >> interested how it performs on a real data set compared to ordering >> by random() or other solutions. >> It's at the moment probably more efficient to not use a stored >> procedure but query the cursor from your application instead (saves >> one of the two seqscans). That has it's own disadvantages of >> course. I've used something like that (as a function in our PHP >> application) on a medium-sized data set before, and it performed >> adequately. >> Alban Hertroys >> -- >> If you can't see the forest for the trees, >> cut the trees and you'll see there is no forest. >> > > !DSPAM:3,49cd2b5e129741122515007! > > Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,49ce0228129741639950449!
I think (a part of) Your problem is that order by random() is O(N logN) complexity, while You are after O(N) . The solution (in pseudocode) random_sample(resultset,K): result := first K rows from resultset resultset.scrollto(K+1) p = K+1 while(resultset.hasMoreRows()) row = resultset.current() resultset.advance() if(random() < K/p) replace a random element in result with row p = p+1 return result the invariant being that at each step result contains a random sample of K elements. It should be fairly easy to implement in plpgsql. Greetings Marcin
On Sun, Mar 29, 2009 at 10:24 AM, marcin mank <marcin.mank@gmail.com> wrote: > I think (a part of) Your problem is that order by random() is O(N > logN) complexity, while You are after O(N) . > > The solution (in pseudocode) > [snip] OK, I may be guiding You the wrong way select g,g,g,g from generate_series(1,25000000) as g order by random() limit 10 executes in under thirty seconds, so I don`t think the sort is a problem. Greetings Marcin