Thread: pgbench - allow to specify scale as a size
After Karel Moppel piece on pgbench scale/size conversion, it occured to me that having this as an option would be nice. https://www.cybertec-postgresql.com/en/a-formula-to-calculate-pgbench-scaling-factor-for-target-db-size/ Here is a attempt at extending --scale so that it can be given a size. pgbench -i --scale=124G ... The approximated database size is also shown in the end-of-run summary. -- Fabien.
Attachment
On 2018-02-17 10:20, Fabien COELHO wrote: > After Karel Moppel piece on pgbench scale/size conversion, it occured > to me that having this as an option would be nice. > > https://www.cybertec-postgresql.com/en/a-formula-to-calculate-pgbench-scaling-factor-for-target-db-size/ > > Here is a attempt at extending --scale so that it can be given a size. > > pgbench -i --scale=124G ... > > The approximated database size is also shown in the end-of-run summary. > [pgbench-scale-size-1.patch] Seem a nice addition but something isn't quite right; with '-s 50' (no unit) I get: 'scale 50 too small': $ pgbench -is 50 scale 50 too small, rounded to 1 dropping old tables... creating tables... generating data... 100000 of 100000 tuples (100%) done (elapsed 0.13 s, remaining 0.00 s) vacuuming... creating primary keys... done. echo '\dti+ pgbench_accounts*' | psql -qX List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-----------------------+-------+----------+------------------+---------+------------- public | pgbench_accounts | table | aardvark | | 13 MB | public | pgbench_accounts_pkey | index | aardvark | pgbench_accounts | 2208 kB | (2 rows) thanks, Erik Rijkers
> Seem a nice addition but something isn't quite right; with '-s 50' (no unit) > I get: 'scale 50 too small': Sigh. Indeed, it seems that I forgot to test some cases... Thanks for the debug. Here is an hopefully better attempt. I also upgraded the regression test to test more that "-s 1". I also tried to improve the documentation to point out that the result is an approximation. -- Fabien.
Attachment
Fabien COELHO <coelho@cri.ensmp.fr> writes: > Here is a attempt at extending --scale so that it can be given a size. I do not actually find this to be a good idea. It's going to be platform-dependent, or not very accurate, or both, and thereby contribute to confusion by making results less reproducible. Plus, what do we do if the backend changes table representation in some way that invalidates Kaarel's formula altogether? More confusion would be inevitable. regards, tom lane
On 17/02/18 11:26, Tom Lane wrote: > Fabien COELHO <coelho@cri.ensmp.fr> writes: >> Here is a attempt at extending --scale so that it can be given a size. > I do not actually find this to be a good idea. It's going to be > platform-dependent, or not very accurate, or both, and thereby > contribute to confusion by making results less reproducible. > > Plus, what do we do if the backend changes table representation in > some way that invalidates Kaarel's formula altogether? More confusion > would be inevitable. Why not then insert a "few" rows, measure size, truncate the table, compute the formula and then insert to the desired user requested size? (or insert what should be the minimum, scale 1, measure, and extrapolate what's missing). It doesn't sound too complicated to me, and targeting a size is something that I believe it's quite good for user. Álvaro -- Alvaro Hernandez ----------- OnGres
Alvaro Hernandez <aht@ongres.com> writes: > On 17/02/18 11:26, Tom Lane wrote: >> Fabien COELHO <coelho@cri.ensmp.fr> writes: >>> Here is a attempt at extending --scale so that it can be given a size. >> I do not actually find this to be a good idea. It's going to be >> platform-dependent, or not very accurate, or both, and thereby >> contribute to confusion by making results less reproducible. >> >> Plus, what do we do if the backend changes table representation in >> some way that invalidates Kaarel's formula altogether? More confusion >> would be inevitable. > Why not then insert a "few" rows, measure size, truncate the table, > compute the formula and then insert to the desired user requested size? > (or insert what should be the minimum, scale 1, measure, and extrapolate > what's missing). It doesn't sound too complicated to me, and targeting a > size is something that I believe it's quite good for user. Then you'd *really* have irreproducible results. regards, tom lane
On 17/02/18 12:17, Tom Lane wrote: > Alvaro Hernandez <aht@ongres.com> writes: >> On 17/02/18 11:26, Tom Lane wrote: >>> Fabien COELHO <coelho@cri.ensmp.fr> writes: >>>> Here is a attempt at extending --scale so that it can be given a size. >>> I do not actually find this to be a good idea. It's going to be >>> platform-dependent, or not very accurate, or both, and thereby >>> contribute to confusion by making results less reproducible. >>> >>> Plus, what do we do if the backend changes table representation in >>> some way that invalidates Kaarel's formula altogether? More confusion >>> would be inevitable. >> Why not then insert a "few" rows, measure size, truncate the table, >> compute the formula and then insert to the desired user requested size? >> (or insert what should be the minimum, scale 1, measure, and extrapolate >> what's missing). It doesn't sound too complicated to me, and targeting a >> size is something that I believe it's quite good for user. > Then you'd *really* have irreproducible results. > > regards, tom lane You also have irreproducible results today, according to your criteria. Either you agree on the number of rows but may not agree on the size (today), or you agree on the size but may not agree on the number of rows. Right now you can only pick the former, while I think people would significantly appreciate the latter. If neither is correct, let's at least provide the choice. Regards, Álvaro -- Alvaro Hernandez ----------- OnGres
Hello Tom, >> Here is a attempt at extending --scale so that it can be given a size. > > I do not actually find this to be a good idea. It's going to be > platform-dependent, or not very accurate, or both, and thereby > contribute to confusion by making results less reproducible. I have often wanted to have such an option for testing, with criterion like "within shared_buffers", "within memory", "twice the available memory", to look for behavioral changes in some performance tests. I you want reproducible (for some definition of reproducible) and accurate, you can always use scale with a number. The report provides the actual scale used anyway, so providing the size is just a convenience for the initialization phase. I agree that it cannot be really exact. Would it be more acceptable with some clear(er)/explicit caveat? > Plus, what do we do if the backend changes table representation in > some way that invalidates Kaarel's formula altogether? Then the formula (a simple linear regression, really) should have to be updated? > More confusion would be inevitable. There is no much confusion when the "scale" is reported. As for confusion, a performance tests is influenced by dozen of parameters anyway. Now if you do not want such a feature, you can mark it as rejected, and we will keep on trying to guess or look for the formula till the end of time:-) -- Fabien.
> Why not then insert a "few" rows, measure size, truncate the table, > compute the formula and then insert to the desired user requested size? (or > insert what should be the minimum, scale 1, measure, and extrapolate what's > missing). It doesn't sound too complicated to me, and targeting a size is > something that I believe it's quite good for user. The formula I used approximates the whole database, not just one table. There was one for the table, but this is only part of the issue. In particular, ISTM that index sizes should be included when caching is considered. Also, index sizes are probably in n ln(n), so some level of approximation is inevitable. Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 rows makes it not very precise wrt size anyway. -- Fabien.
On 17/02/18 12:37, Fabien COELHO wrote: > >> Why not then insert a "few" rows, measure size, truncate the >> table, compute the formula and then insert to the desired user >> requested size? (or insert what should be the minimum, scale 1, >> measure, and extrapolate what's missing). It doesn't sound too >> complicated to me, and targeting a size is >> something that I believe it's quite good for user. > > The formula I used approximates the whole database, not just one > table. There was one for the table, but this is only part of the > issue. In particular, ISTM that index sizes should be included when > caching is considered. > > Also, index sizes are probably in n ln(n), so some level of > approximation is inevitable. > > Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 > rows makes it not very precise wrt size anyway. > Sure, makes sense, so my second suggestion seems more reasonable: insert with scale 1, measure there (ok, you might need to crete indexes only to later drop them), and if computed scale > 1 then insert whatever is left to insert. Shouldn't be a big deal to me. I like the feature :) Álvaro -- Alvaro Hernandez ----------- OnGres
Hello Alvaro & Tom, >>> Why not then insert a "few" rows, measure size, truncate the table, >>> compute the formula and then insert to the desired user requested >>> size? (or insert what should be the minimum, scale 1, measure, and >>> extrapolate what's missing). It doesn't sound too complicated to me, >>> and targeting a size is something that I believe it's quite good for >>> user. >> >> The formula I used approximates the whole database, not just one table. >> There was one for the table, but this is only part of the issue. In >> particular, ISTM that index sizes should be included when caching is >> considered. >> >> Also, index sizes are probably in n ln(n), so some level of >> approximation is inevitable. >> >> Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 >> rows makes it not very precise wrt size anyway. > > Sure, makes sense, so my second suggestion seems more reasonable: insert > with scale 1, measure there (ok, you might need to crete indexes only to > later drop them), and if computed scale > 1 then insert whatever is left > to insert. Shouldn't be a big deal to me. I could implement that, even if it would lead to some approximation nevertheless: ISTM that the very large scale regression performed by Kaarel is significantly more precise than testing with scale 1 (typically a few MiB) and extrapolation that to hundreds of GiB. Maybe it could be done with kind of an open ended dichotomy, but creating and recreating index looks like an ugly solution, and what should be significant is the whole database size, including tellers & branches tables and all indexes, so I'm not convinced. Now as tellers & branches tables have basically the same structure as accounts, it could be just scaled by assuming that it would incur the same storage per row. Anyway, even if I do not like it, it could be better than nothing. The key point for me is that if Tom is dead set against the feature the patch is dead anyway. Tom, would Alvaro approach be more admissible to you that a fixed formula that would need updating, keeping in mind that such a feature implies some level approximation? -- Fabien.
On Sat, Feb 17, 2018 at 12:22:37PM -0500, Alvaro Hernandez wrote: > > > On 17/02/18 12:17, Tom Lane wrote: > > Alvaro Hernandez <aht@ongres.com> writes: > >> On 17/02/18 11:26, Tom Lane wrote: > >>> Fabien COELHO <coelho@cri.ensmp.fr> writes: > >>>> Here is a attempt at extending --scale so that it can be given a size. > >>> I do not actually find this to be a good idea. It's going to be > >>> platform-dependent, or not very accurate, or both, and thereby > >>> contribute to confusion by making results less reproducible. > >>> > >>> Plus, what do we do if the backend changes table representation in > >>> some way that invalidates Kaarel's formula altogether? More confusion > >>> would be inevitable. > >> Why not then insert a "few" rows, measure size, truncate the table, > >> compute the formula and then insert to the desired user requested size? > >> (or insert what should be the minimum, scale 1, measure, and extrapolate > >> what's missing). It doesn't sound too complicated to me, and targeting a > >> size is something that I believe it's quite good for user. > > Then you'd *really* have irreproducible results. > > > > regards, tom lane > > You also have irreproducible results today, according to your > criteria. Either you agree on the number of rows but may not agree on > the size (today), or you agree on the size but may not agree on the > number of rows. Right now you can only pick the former, while I think > people would significantly appreciate the latter. If neither is correct, > let's at least provide the choice. What if we consider using ascii (utf8?) text file sizes as a reference point, something independent from the database? I realize even if a flat file size can be used as a more consistent reference across platforms, it doesn't help with accurately determining the final data file sizes due to any architecture specific nuances or changes in the backend. But perhaps it might still offer a little more meaning to be able to say "50 gigabytes of bank account data" rather than "10 million rows of bank accounts" when thinking about size over cardinality. Regards, Mark -- Mark Wong http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
Hello Mark, > What if we consider using ascii (utf8?) text file sizes as a reference > point, something independent from the database? Why not. TPC-B basically specifies that rows (accounts, tellers, branches) are all padded to 100 bytes, thus we could consider (i.e. document) that --scale=SIZE refers to the amount of useful data hold, and warn that actual storage would add various overheads for page and row headers, free space at the end of pages, indexes... Then one scale step is 100,000 accounts, 10 tellers and 1 branch, i.e. 100,011 * 100 ~ 9.5 MiB of useful data per scale step. > I realize even if a flat file size can be used as a more consistent > reference across platforms, it doesn't help with accurately determining > the final data file sizes due to any architecture specific nuances or > changes in the backend. But perhaps it might still offer a little more > meaning to be able to say "50 gigabytes of bank account data" rather > than "10 million rows of bank accounts" when thinking about size over > cardinality. Yep. Now the overhead is really 60-65%. Although the specification is unambiguous, but we still need some maths to know whether it fits in buffers or memory... The point of Karel regression is to take this into account. Also, whether this option would be more admissible to Tom is still an open question. Tom? -- Fabien.
> Now the overhead is really 60-65%. Although the specification is unambiguous, > but we still need some maths to know whether it fits in buffers or memory... > The point of Karel regression is to take this into account. > > Also, whether this option would be more admissible to Tom is still an open > question. Tom? Here is a version with this approach: the documentation talks about "actual data size, without overheads", and points out that storage overheads are typically an additional 65%. -- Fabien.
Attachment
On 2/20/18 05:06, Fabien COELHO wrote: >> Now the overhead is really 60-65%. Although the specification is unambiguous, >> but we still need some maths to know whether it fits in buffers or memory... >> The point of Karel regression is to take this into account. >> >> Also, whether this option would be more admissible to Tom is still an open >> question. Tom? > > Here is a version with this approach: the documentation talks about > "actual data size, without overheads", and points out that storage > overheads are typically an additional 65%. I think when deciding on a size for a test database for benchmarking, you want to size it relative to RAM or other storage layers. So a feature that allows you to create a database of size N but it's actually not going to be anywhere near N seems pretty useless for that. (Also, we have, for better or worse, settled on a convention for byte unit prefixes in guc.c. Let's not introduce another one.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>> Now the overhead is really 60-65%. Although the specification is unambiguous, >>> but we still need some maths to know whether it fits in buffers or memory... >>> The point of Karel regression is to take this into account. >>> >>> Also, whether this option would be more admissible to Tom is still an open >>> question. Tom? >> >> Here is a version with this approach: the documentation talks about >> "actual data size, without overheads", and points out that storage >> overheads are typically an additional 65%. > > I think when deciding on a size for a test database for benchmarking, > you want to size it relative to RAM or other storage layers. So a > feature that allows you to create a database of size N but it's actually > not going to be anywhere near N seems pretty useless for that. Hmmm. At least the option say the size of the useful data, which should be what the user be really interested in:-) You have a developer point of view about the issue. From a performance point of view, ISTM that useful data storage size is an interesting measure, which allows to compare between (future) storage engines and show the impact of smaller overheads, for instance. The other option can only be some kind of approximation, and would require some kind of maintenance (eg a future zheap overhead would be different that the heap overhead, the overhead depends on the size itself, and it could also depend on other options). This has been rejected, and I agree with the rejection (incredible:-). So ISTM that the patch is dead because it is somehow necessarily imprecise. People will continue to do some wild guessing on how to translate scale to anything related to size. > (Also, we have, for better or worse, settled on a convention for byte > unit prefixes in guc.c. Let's not introduce another one.) Hmmm. Indeed for worse, as it is soooo much better to invent our own units than to reuse existing ones which were not confusing enough:-) - SI units: 1kB = 1000 bytes (*small* k) - IEC units: 1KiB = 1024 bytes - JEDEC units: 1KB = 1024 bytes (*capital* k) But postgres documentation uses "kB" for 1024 bytes, too bad:-( The gucs are about memory, which is measured in 1024, but the storage is usually measured in 1000, and this option was about storage, hence I felt it better to avoid confusion. Conclusion: mark the patch as rejected? -- Fabien.
On 3/4/18 04:09, Fabien COELHO wrote: > So ISTM that the patch is dead because it is somehow necessarily > imprecise. People will continue to do some wild guessing on how to > translate scale to anything related to size. I think so. > Conclusion: mark the patch as rejected? OK -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services