Re: tablesample performance - Mailing list pgsql-general
From | Andy Colson |
---|---|
Subject | Re: tablesample performance |
Date | |
Msg-id | 28f2c186-c5f6-3266-6527-eae6f96919e5@squeakycode.net Whole thread Raw |
In response to | Re: tablesample performance (Francisco Olarte <folarte@peoplecall.com>) |
Responses |
Re: tablesample performance
Re: tablesample performance |
List | pgsql-general |
On 10/18/2016 11:44 AM, Francisco Olarte wrote: > On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy@squeakycode.net> wrote: >> I wanted to report an awesome performance boost using tablesample. >> In my stored function I was getting a random row using: >> select one into x from ones order by random() limit 1; >> When the table was smaller it worked fine, but the performance has slowly >> gotten worse. This morning I was getting around 8 transactions a second. > > Which is not a surprise, as it has to at least read all the rows and > generate a random() for each one and keep track of the minimum. > >> I just replaced it with: >> select one into x from ones tablesample bernoulli(1) limit 1; > > This should be faster, but to me it seems it does a different thing. > This seems to select each row of the table with probability 1% and > return the first selected, i.e., something similar to > > select one into x from ones where random()>0.01 limit 1. > > Which has the ( diminishing with table size ) risk of selecting zero > rows and is going to select one of the first 100 or so rows with high > probability, unless I'm missing something. > > I say this because docs state ir returns a 'randomly chosen', sample, > not a 'randomly ORDERED' one, and the straightforward implementation > of sampling returns rows in the primitive scan order. I supose it > could be easily tested by selecting bernouilli(100), but have not > server access now to verify it. > > With a big table it seems: > > select one into x from ones where random()>0.01 order by random() limit 1 > or > select one into x from ones tablesample bernoulli(1) order by random() limit 1; > > Is more similar to what you originally did ( and the run time should > possibly be something in between ). > > > I would recomend you to execute the function and verify it does what > you want ( as you say it's fast, I would try selecting a several > thousands and eyeballing the result, if it does what I fear the > grouping should be obvious ). > > Maybe you do not mind it, in which case it's ok, but a one minute run > should let you know wahat you are exactly doing. > > Francisco Olarte. > Ah, yes, you're right, there is a bit of a difference there. Speed wise: 1) select one from ones order by random() limit 1; > about 360ms 2) select one from ones tablesample bernoulli(1) limit 1 ; > about 4ms 3) select one from ones tablesample bernoulli(1) order by random() limit 1; > about 80ms Using the third option in batch, I'm getting about 15 transactions a second. Oddly: select one from ones tablesample bernoulli(0.25) order by random() takes almost 80ms also. bernoulli(0.25) returns 3k rows bernoulli(1) returns 14k rows Thanks, -Andy
pgsql-general by date: