Thread: pgbench - allow to specify scale as a size

pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
After Karel Moppel piece on pgbench scale/size conversion, it occured to 
me that having this as an option would be nice.

https://www.cybertec-postgresql.com/en/a-formula-to-calculate-pgbench-scaling-factor-for-target-db-size/

Here is a attempt at extending --scale so that it can be given a size.

   pgbench -i --scale=124G ...

The approximated database size is also shown in the end-of-run summary.

-- 
Fabien.
Attachment

Re: pgbench - allow to specify scale as a size

From
Erik Rijkers
Date:
On 2018-02-17 10:20, Fabien COELHO wrote:
> After Karel Moppel piece on pgbench scale/size conversion, it occured
> to me that having this as an option would be nice.
> 
> https://www.cybertec-postgresql.com/en/a-formula-to-calculate-pgbench-scaling-factor-for-target-db-size/
> 
> Here is a attempt at extending --scale so that it can be given a size.
> 
>   pgbench -i --scale=124G ...
> 
> The approximated database size is also shown in the end-of-run summary.

> [pgbench-scale-size-1.patch]

Seem a nice addition but something isn't quite right; with '-s 50' (no 
unit)  I get: 'scale 50 too small':

$ pgbench -is 50
scale 50 too small, rounded to 1
dropping old tables...
creating tables...
generating data...
100000 of 100000 tuples (100%) done (elapsed 0.13 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done.

echo '\dti+ pgbench_accounts*' | psql -qX

                                       List of relations
  Schema |         Name          | Type  |  Owner   |      Table       |  
Size   | Description
--------+-----------------------+-------+----------+------------------+---------+-------------
  public | pgbench_accounts      | table | aardvark |                  | 
13 MB   |
  public | pgbench_accounts_pkey | index | aardvark | pgbench_accounts | 
2208 kB |
(2 rows)


thanks,

Erik Rijkers


Re: pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
> Seem a nice addition but something isn't quite right; with '-s 50' (no unit) 
> I get: 'scale 50 too small':

Sigh. Indeed, it seems that I forgot to test some cases... Thanks 
for the debug. Here is an hopefully better attempt.

I also upgraded the regression test to test more that "-s 1". I also tried 
to improve the documentation to point out that the result is an 
approximation.

-- 
Fabien.
Attachment

Re: pgbench - allow to specify scale as a size

From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes:
> Here is a attempt at extending --scale so that it can be given a size.

I do not actually find this to be a good idea.  It's going to be
platform-dependent, or not very accurate, or both, and thereby
contribute to confusion by making results less reproducible.

Plus, what do we do if the backend changes table representation in
some way that invalidates Kaarel's formula altogether?  More confusion
would be inevitable.

            regards, tom lane


Re: pgbench - allow to specify scale as a size

From
Alvaro Hernandez
Date:

On 17/02/18 11:26, Tom Lane wrote:
> Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> Here is a attempt at extending --scale so that it can be given a size.
> I do not actually find this to be a good idea.  It's going to be
> platform-dependent, or not very accurate, or both, and thereby
> contribute to confusion by making results less reproducible.
>
> Plus, what do we do if the backend changes table representation in
> some way that invalidates Kaarel's formula altogether?  More confusion
> would be inevitable.

     Why not then insert a "few" rows, measure size, truncate the table, 
compute the formula and then insert to the desired user requested size? 
(or insert what should be the minimum, scale 1, measure, and extrapolate 
what's missing). It doesn't sound too complicated to me, and targeting a 
size is something that I believe it's quite good for user.


     Álvaro


-- 

Alvaro Hernandez


-----------
OnGres



Re: pgbench - allow to specify scale as a size

From
Tom Lane
Date:
Alvaro Hernandez <aht@ongres.com> writes:
> On 17/02/18 11:26, Tom Lane wrote:
>> Fabien COELHO <coelho@cri.ensmp.fr> writes:
>>> Here is a attempt at extending --scale so that it can be given a size.

>> I do not actually find this to be a good idea.  It's going to be
>> platform-dependent, or not very accurate, or both, and thereby
>> contribute to confusion by making results less reproducible.
>> 
>> Plus, what do we do if the backend changes table representation in
>> some way that invalidates Kaarel's formula altogether?  More confusion
>> would be inevitable.

>      Why not then insert a "few" rows, measure size, truncate the table, 
> compute the formula and then insert to the desired user requested size? 
> (or insert what should be the minimum, scale 1, measure, and extrapolate 
> what's missing). It doesn't sound too complicated to me, and targeting a 
> size is something that I believe it's quite good for user.

Then you'd *really* have irreproducible results.

            regards, tom lane


Re: pgbench - allow to specify scale as a size

From
Alvaro Hernandez
Date:

On 17/02/18 12:17, Tom Lane wrote:
> Alvaro Hernandez <aht@ongres.com> writes:
>> On 17/02/18 11:26, Tom Lane wrote:
>>> Fabien COELHO <coelho@cri.ensmp.fr> writes:
>>>> Here is a attempt at extending --scale so that it can be given a size.
>>> I do not actually find this to be a good idea.  It's going to be
>>> platform-dependent, or not very accurate, or both, and thereby
>>> contribute to confusion by making results less reproducible.
>>>
>>> Plus, what do we do if the backend changes table representation in
>>> some way that invalidates Kaarel's formula altogether?  More confusion
>>> would be inevitable.
>>       Why not then insert a "few" rows, measure size, truncate the table,
>> compute the formula and then insert to the desired user requested size?
>> (or insert what should be the minimum, scale 1, measure, and extrapolate
>> what's missing). It doesn't sound too complicated to me, and targeting a
>> size is something that I believe it's quite good for user.
> Then you'd *really* have irreproducible results.
>
>             regards, tom lane

     You also have irreproducible results today, according to your 
criteria. Either you agree on the number of rows but may not agree on 
the size (today), or you agree on the size but may not agree on the 
number of rows. Right now you can only pick the former, while I think 
people would significantly appreciate the latter. If neither is correct, 
let's at least provide the choice.

     Regards,

     Álvaro


-- 

Alvaro Hernandez


-----------
OnGres



Re: pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
Hello Tom,

>> Here is a attempt at extending --scale so that it can be given a size.
>
> I do not actually find this to be a good idea.  It's going to be
> platform-dependent, or not very accurate, or both, and thereby
> contribute to confusion by making results less reproducible.

I have often wanted to have such an option for testing, with criterion 
like "within shared_buffers", "within memory", "twice the available 
memory", to look for behavioral changes in some performance tests.

I you want reproducible (for some definition of reproducible) and 
accurate, you can always use scale with a number. The report provides the 
actual scale used anyway, so providing the size is just a convenience for 
the initialization phase. I agree that it cannot be really exact.

Would it be more acceptable with some clear(er)/explicit caveat?

> Plus, what do we do if the backend changes table representation in
> some way that invalidates Kaarel's formula altogether?

Then the formula (a simple linear regression, really) should have to be 
updated?

> More confusion would be inevitable.

There is no much confusion when the "scale" is reported. As for confusion, 
a performance tests is influenced by dozen of parameters anyway.

Now if you do not want such a feature, you can mark it as rejected, and we 
will keep on trying to guess or look for the formula till the end of 
time:-)

-- 
Fabien.


Re: pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
>     Why not then insert a "few" rows, measure size, truncate the table, 
> compute the formula and then insert to the desired user requested size? (or 
> insert what should be the minimum, scale 1, measure, and extrapolate what's 
> missing). It doesn't sound too complicated to me, and targeting a size is
> something that I believe it's quite good for user.

The formula I used approximates the whole database, not just one table. 
There was one for the table, but this is only part of the issue. In 
particular, ISTM that index sizes should be included when caching is 
considered.

Also, index sizes are probably in n ln(n), so some level of approximation 
is inevitable.

Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 rows 
makes it not very precise wrt size anyway.

-- 
Fabien.

Re: pgbench - allow to specify scale as a size

From
Alvaro Hernandez
Date:

On 17/02/18 12:37, Fabien COELHO wrote:
>
>>     Why not then insert a "few" rows, measure size, truncate the 
>> table, compute the formula and then insert to the desired user 
>> requested size? (or insert what should be the minimum, scale 1, 
>> measure, and extrapolate what's missing). It doesn't sound too 
>> complicated to me, and targeting a size is
>> something that I believe it's quite good for user.
>
> The formula I used approximates the whole database, not just one 
> table. There was one for the table, but this is only part of the 
> issue. In particular, ISTM that index sizes should be included when 
> caching is considered.
>
> Also, index sizes are probably in n ln(n), so some level of 
> approximation is inevitable.
>
> Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 
> rows makes it not very precise wrt size anyway.
>

     Sure, makes sense, so my second suggestion seems more reasonable: 
insert with scale 1, measure there (ok, you might need to crete indexes 
only to later drop them), and if computed scale > 1 then insert whatever 
is left to insert. Shouldn't be a big deal to me.

     I like the feature :)

     Álvaro

-- 

Alvaro Hernandez


-----------
OnGres



Re: pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
Hello Alvaro & Tom,

>>> Why not then insert a "few" rows, measure size, truncate the table, 
>>> compute the formula and then insert to the desired user requested 
>>> size? (or insert what should be the minimum, scale 1, measure, and 
>>> extrapolate what's missing). It doesn't sound too complicated to me, 
>>> and targeting a size is something that I believe it's quite good for 
>>> user.
>> 
>> The formula I used approximates the whole database, not just one table. 
>> There was one for the table, but this is only part of the issue. In 
>> particular, ISTM that index sizes should be included when caching is 
>> considered.
>> 
>> Also, index sizes are probably in n ln(n), so some level of 
>> approximation is inevitable.
>> 
>> Moreover, the intrinsic granularity of TPC-B as multiple of 100,000 
>> rows makes it not very precise wrt size anyway.
>
> Sure, makes sense, so my second suggestion seems more reasonable: insert 
> with scale 1, measure there (ok, you might need to crete indexes only to 
> later drop them), and if computed scale > 1 then insert whatever is left 
> to insert. Shouldn't be a big deal to me.

I could implement that, even if it would lead to some approximation 
nevertheless: ISTM that the very large scale regression performed by 
Kaarel is significantly more precise than testing with scale 1 (typically 
a few MiB) and extrapolation that to hundreds of GiB.

Maybe it could be done with kind of an open ended dichotomy, but creating 
and recreating index looks like an ugly solution, and what should be 
significant is the whole database size, including tellers & branches 
tables and all indexes, so I'm not convinced. Now as tellers & branches 
tables have basically the same structure as accounts, it could be just 
scaled by assuming that it would incur the same storage per row.

Anyway, even if I do not like it, it could be better than nothing. The key 
point for me is that if Tom is dead set against the feature the patch is 
dead anyway.

Tom, would Alvaro approach be more admissible to you that a fixed formula 
that would need updating, keeping in mind that such a feature implies 
some level approximation?

-- 
Fabien.


Re: pgbench - allow to specify scale as a size

From
Mark Wong
Date:
On Sat, Feb 17, 2018 at 12:22:37PM -0500, Alvaro Hernandez wrote:
> 
> 
> On 17/02/18 12:17, Tom Lane wrote:
> > Alvaro Hernandez <aht@ongres.com> writes:
> >> On 17/02/18 11:26, Tom Lane wrote:
> >>> Fabien COELHO <coelho@cri.ensmp.fr> writes:
> >>>> Here is a attempt at extending --scale so that it can be given a size.
> >>> I do not actually find this to be a good idea.  It's going to be
> >>> platform-dependent, or not very accurate, or both, and thereby
> >>> contribute to confusion by making results less reproducible.
> >>>
> >>> Plus, what do we do if the backend changes table representation in
> >>> some way that invalidates Kaarel's formula altogether?  More confusion
> >>> would be inevitable.
> >>       Why not then insert a "few" rows, measure size, truncate the table,
> >> compute the formula and then insert to the desired user requested size?
> >> (or insert what should be the minimum, scale 1, measure, and extrapolate
> >> what's missing). It doesn't sound too complicated to me, and targeting a
> >> size is something that I believe it's quite good for user.
> > Then you'd *really* have irreproducible results.
> >
> >             regards, tom lane
> 
>      You also have irreproducible results today, according to your 
> criteria. Either you agree on the number of rows but may not agree on 
> the size (today), or you agree on the size but may not agree on the 
> number of rows. Right now you can only pick the former, while I think 
> people would significantly appreciate the latter. If neither is correct, 
> let's at least provide the choice.

What if we consider using ascii (utf8?) text file sizes as a reference
point, something independent from the database?

I realize even if a flat file size can be used as a more consistent
reference across platforms, it doesn't help with accurately determining
the final data file sizes due to any architecture specific nuances or
changes in the backend.  But perhaps it might still offer a little more
meaning to be able to say "50 gigabytes of bank account data" rather
than "10 million rows of bank accounts" when thinking about size over
cardinality.

Regards,
Mark

-- 
Mark Wong                                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services


Re: pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
Hello Mark,

> What if we consider using ascii (utf8?) text file sizes as a reference
> point, something independent from the database?

Why not.

TPC-B basically specifies that rows (accounts, tellers, branches) are all 
padded to 100 bytes, thus we could consider (i.e. document) that 
--scale=SIZE refers to the amount of useful data hold, and warn that 
actual storage would add various overheads for page and row headers, free 
space at the end of pages, indexes...

Then one scale step is 100,000 accounts, 10 tellers and 1 branch, i.e.
100,011 * 100 ~ 9.5 MiB of useful data per scale step.

> I realize even if a flat file size can be used as a more consistent
> reference across platforms, it doesn't help with accurately determining
> the final data file sizes due to any architecture specific nuances or
> changes in the backend.  But perhaps it might still offer a little more
> meaning to be able to say "50 gigabytes of bank account data" rather
> than "10 million rows of bank accounts" when thinking about size over
> cardinality.

Yep.

Now the overhead is really 60-65%. Although the specification is 
unambiguous, but we still need some maths to know whether it fits in 
buffers or memory... The point of Karel regression is to take this into 
account.

Also, whether this option would be more admissible to Tom is still an open 
question. Tom?

-- 
Fabien.


Re: pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
> Now the overhead is really 60-65%. Although the specification is unambiguous, 
> but we still need some maths to know whether it fits in buffers or memory... 
> The point of Karel regression is to take this into account.
>
> Also, whether this option would be more admissible to Tom is still an open 
> question. Tom?

Here is a version with this approach: the documentation talks about 
"actual data size, without overheads", and points out that storage 
overheads are typically an additional 65%.

-- 
Fabien.
Attachment

Re: pgbench - allow to specify scale as a size

From
Peter Eisentraut
Date:
On 2/20/18 05:06, Fabien COELHO wrote:
>> Now the overhead is really 60-65%. Although the specification is unambiguous, 
>> but we still need some maths to know whether it fits in buffers or memory... 
>> The point of Karel regression is to take this into account.
>>
>> Also, whether this option would be more admissible to Tom is still an open 
>> question. Tom?
> 
> Here is a version with this approach: the documentation talks about 
> "actual data size, without overheads", and points out that storage 
> overheads are typically an additional 65%.

I think when deciding on a size for a test database for benchmarking,
you want to size it relative to RAM or other storage layers.  So a
feature that allows you to create a database of size N but it's actually
not going to be anywhere near N seems pretty useless for that.

(Also, we have, for better or worse, settled on a convention for byte
unit prefixes in guc.c.  Let's not introduce another one.)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgbench - allow to specify scale as a size

From
Fabien COELHO
Date:
>>> Now the overhead is really 60-65%. Although the specification is unambiguous,
>>> but we still need some maths to know whether it fits in buffers or memory...
>>> The point of Karel regression is to take this into account.
>>>
>>> Also, whether this option would be more admissible to Tom is still an open
>>> question. Tom?
>>
>> Here is a version with this approach: the documentation talks about
>> "actual data size, without overheads", and points out that storage
>> overheads are typically an additional 65%.
>
> I think when deciding on a size for a test database for benchmarking,
> you want to size it relative to RAM or other storage layers.  So a
> feature that allows you to create a database of size N but it's actually
> not going to be anywhere near N seems pretty useless for that.

Hmmm.

At least the option say the size of the useful data, which should be what 
the user be really interested in:-) You have a developer point of view 
about the issue. From a performance point of view, ISTM that useful data 
storage size is an interesting measure, which allows to compare between 
(future) storage engines and show the impact of smaller overheads, for 
instance.

The other option can only be some kind of approximation, and would require 
some kind of maintenance (eg a future zheap overhead would be different 
that the heap overhead, the overhead depends on the size itself, and it 
could also depend on other options). This has been rejected, and I agree 
with the rejection (incredible:-).

So ISTM that the patch is dead because it is somehow necessarily 
imprecise. People will continue to do some wild guessing on how to 
translate scale to anything related to size.

> (Also, we have, for better or worse, settled on a convention for byte
> unit prefixes in guc.c.  Let's not introduce another one.)

Hmmm. Indeed for worse, as it is soooo much better to invent our own units 
than to reuse existing ones which were not confusing enough:-)

  - SI units: 1kB = 1000 bytes (*small* k)
  - IEC units: 1KiB = 1024 bytes
  - JEDEC units: 1KB = 1024 bytes (*capital* k)

But postgres documentation uses "kB" for 1024 bytes, too bad:-(

The gucs are about memory, which is measured in 1024, but the storage is 
usually measured in 1000, and this option was about storage, hence I felt 
it better to avoid confusion.

Conclusion: mark the patch as rejected?

-- 
Fabien.


Re: pgbench - allow to specify scale as a size

From
Peter Eisentraut
Date:
On 3/4/18 04:09, Fabien COELHO wrote:
> So ISTM that the patch is dead because it is somehow necessarily 
> imprecise. People will continue to do some wild guessing on how to 
> translate scale to anything related to size.

I think so.

> Conclusion: mark the patch as rejected?

OK

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services