Thread: tablesample performance
I wanted to report an awesome performance boost using tablesample. In my stored function I was getting a random row using: select one into x from ones order by random() limit 1; When the table was smaller it worked fine, but the performance has slowly gotten worse. This morning I was getting around 8 transactions a second. I just replaced it with: select one into x from ones tablesample bernoulli(1) limit 1; And now I'm getting 376 transactions a second! Thank you dev's! Thank you PG 9.5! -Andy
On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy@squeakycode.net> wrote: > I wanted to report an awesome performance boost using tablesample. > In my stored function I was getting a random row using: > select one into x from ones order by random() limit 1; > When the table was smaller it worked fine, but the performance has slowly > gotten worse. This morning I was getting around 8 transactions a second. Which is not a surprise, as it has to at least read all the rows and generate a random() for each one and keep track of the minimum. > I just replaced it with: > select one into x from ones tablesample bernoulli(1) limit 1; This should be faster, but to me it seems it does a different thing. This seems to select each row of the table with probability 1% and return the first selected, i.e., something similar to select one into x from ones where random()>0.01 limit 1. Which has the ( diminishing with table size ) risk of selecting zero rows and is going to select one of the first 100 or so rows with high probability, unless I'm missing something. I say this because docs state ir returns a 'randomly chosen', sample, not a 'randomly ORDERED' one, and the straightforward implementation of sampling returns rows in the primitive scan order. I supose it could be easily tested by selecting bernouilli(100), but have not server access now to verify it. With a big table it seems: select one into x from ones where random()>0.01 order by random() limit 1 or select one into x from ones tablesample bernoulli(1) order by random() limit 1; Is more similar to what you originally did ( and the run time should possibly be something in between ). I would recomend you to execute the function and verify it does what you want ( as you say it's fast, I would try selecting a several thousands and eyeballing the result, if it does what I fear the grouping should be obvious ). Maybe you do not mind it, in which case it's ok, but a one minute run should let you know wahat you are exactly doing. Francisco Olarte.
On 10/18/2016 11:44 AM, Francisco Olarte wrote: > On Tue, Oct 18, 2016 at 5:06 PM, Andy Colson <andy@squeakycode.net> wrote: >> I wanted to report an awesome performance boost using tablesample. >> In my stored function I was getting a random row using: >> select one into x from ones order by random() limit 1; >> When the table was smaller it worked fine, but the performance has slowly >> gotten worse. This morning I was getting around 8 transactions a second. > > Which is not a surprise, as it has to at least read all the rows and > generate a random() for each one and keep track of the minimum. > >> I just replaced it with: >> select one into x from ones tablesample bernoulli(1) limit 1; > > This should be faster, but to me it seems it does a different thing. > This seems to select each row of the table with probability 1% and > return the first selected, i.e., something similar to > > select one into x from ones where random()>0.01 limit 1. > > Which has the ( diminishing with table size ) risk of selecting zero > rows and is going to select one of the first 100 or so rows with high > probability, unless I'm missing something. > > I say this because docs state ir returns a 'randomly chosen', sample, > not a 'randomly ORDERED' one, and the straightforward implementation > of sampling returns rows in the primitive scan order. I supose it > could be easily tested by selecting bernouilli(100), but have not > server access now to verify it. > > With a big table it seems: > > select one into x from ones where random()>0.01 order by random() limit 1 > or > select one into x from ones tablesample bernoulli(1) order by random() limit 1; > > Is more similar to what you originally did ( and the run time should > possibly be something in between ). > > > I would recomend you to execute the function and verify it does what > you want ( as you say it's fast, I would try selecting a several > thousands and eyeballing the result, if it does what I fear the > grouping should be obvious ). > > Maybe you do not mind it, in which case it's ok, but a one minute run > should let you know wahat you are exactly doing. > > Francisco Olarte. > Ah, yes, you're right, there is a bit of a difference there. Speed wise: 1) select one from ones order by random() limit 1; > about 360ms 2) select one from ones tablesample bernoulli(1) limit 1 ; > about 4ms 3) select one from ones tablesample bernoulli(1) order by random() limit 1; > about 80ms Using the third option in batch, I'm getting about 15 transactions a second. Oddly: select one from ones tablesample bernoulli(0.25) order by random() takes almost 80ms also. bernoulli(0.25) returns 3k rows bernoulli(1) returns 14k rows Thanks, -Andy
Andy Colson <andy@squeakycode.net> writes: > On 10/18/2016 11:44 AM, Francisco Olarte wrote: >> This should be faster, but to me it seems it does a different thing. > Ah, yes, you're right, there is a bit of a difference there. If you don't want to have an implicit bias towards earlier blocks, I don't think that either standard tablesample method is really what you want. The contrib/tsm_system_rows tablesample method is a lot closer, in that it will start at a randomly chosen block, but if you just do "tablesample system_rows(1)" then you will always get the first row in whichever block it lands on, so it's still not exactly unbiased. Maybe you could select "tablesample system_rows(100)" or so and then do the order-by-random trick on that sample. This would be a lot faster than selecting 100 random rows with either built-in sample method, since the rows it grabs will be consecutive. regards, tom lane
Andy: On Tue, Oct 18, 2016 at 7:17 PM, Andy Colson <andy@squeakycode.net> wrote: > Ah, yes, you're right, there is a bit of a difference there. > > Speed wise: > 1) select one from ones order by random() limit 1; >> about 360ms > 2) select one from ones tablesample bernoulli(1) limit 1 ; >> about 4ms > 3) select one from ones tablesample bernoulli(1) order by random() limit 1; >> about 80ms Expected. It would be nice if you had provided some tbale structure / size data. > > Using the third option in batch, I'm getting about 15 transactions a second. > > Oddly: > select one from ones tablesample bernoulli(0.25) order by random() > takes almost 80ms also. mmm, it depends a lot on you total rows and average rows per > bernoulli(0.25) returns 3k rows > bernoulli(1) returns 14k rows This hints at 1M4 rows (14k / 1%). If your rows are small and you have more than 400 rows per page I would expect that, as .25% sample would hit every page. Tome hinted you at an extension. Also, if you are in a function ( which can loop ) you can do a little trick, instead of bernouilli(1) use bernouilli (N/table_size). This way you will select very few rows and speed up the last phase. Anyway, I fear bernouilly must read all the table too, to be able to discard randomly, so you may not win nothing ( I would compare the query time against a simple 'count(one) query', to have a benchmark of how much time the server expends reading the table. I would bet for 'about 80 ms'. Francisco Olarte.
On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andy Colson <andy@squeakycode.net> writes: >> On 10/18/2016 11:44 AM, Francisco Olarte wrote: >>> This should be faster, but to me it seems it does a different thing. > >> Ah, yes, you're right, there is a bit of a difference there. > > If you don't want to have an implicit bias towards earlier blocks, > I don't think that either standard tablesample method is really what > you want. > > The contrib/tsm_system_rows tablesample method is a lot closer, in > that it will start at a randomly chosen block, but if you just do > "tablesample system_rows(1)" then you will always get the first row > in whichever block it lands on, so it's still not exactly unbiased. Is there a reason why we can't fix the behaviours of the three methods mentioned above by making them all start at a random block and a random item between min and max? It wasn't ever intended to be biased and bernoulli in particular ought to have a strict no bias. Happy to patch if we agree. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes: > On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If you don't want to have an implicit bias towards earlier blocks, >> I don't think that either standard tablesample method is really what >> you want. >> >> The contrib/tsm_system_rows tablesample method is a lot closer, in >> that it will start at a randomly chosen block, but if you just do >> "tablesample system_rows(1)" then you will always get the first row >> in whichever block it lands on, so it's still not exactly unbiased. > Is there a reason why we can't fix the behaviours of the three methods > mentioned above by making them all start at a random block and a > random item between min and max? The standard tablesample methods are constrained by other requirements, such as repeatability. I am not sure that loading this one on top of that is a good idea. The bias I referred to above is *not* the fault of the sample methods, rather it's the fault of using "LIMIT 1". It does seem like maybe it'd be nice for tsm_system_rows to start at a randomly chosen entry in the first block it visits, rather than always dumping that entire block. Then "tablesample system_rows(1)" would actually give you a pretty random row, and I think we aren't giving up any useful properties it has now. regards, tom lane
On 18 October 2016 at 22:06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndquadrant.com> writes: >> On 18 October 2016 at 19:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If you don't want to have an implicit bias towards earlier blocks, >>> I don't think that either standard tablesample method is really what >>> you want. >>> >>> The contrib/tsm_system_rows tablesample method is a lot closer, in >>> that it will start at a randomly chosen block, but if you just do >>> "tablesample system_rows(1)" then you will always get the first row >>> in whichever block it lands on, so it's still not exactly unbiased. > >> Is there a reason why we can't fix the behaviours of the three methods >> mentioned above by making them all start at a random block and a >> random item between min and max? > > The standard tablesample methods are constrained by other requirements, > such as repeatability. I am not sure that loading this one on top of > that is a good idea. The bias I referred to above is *not* the fault > of the sample methods, rather it's the fault of using "LIMIT 1". Hmm, yeh, that would make it a little too much of a special case. > It does seem like maybe it'd be nice for tsm_system_rows to start at a > randomly chosen entry in the first block it visits, rather than always > dumping that entire block. Then "tablesample system_rows(1)" would > actually give you a pretty random row, and I think we aren't giving up > any useful properties it has now. OK, will patch that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services