Thread: pgbench-ycsb

pgbench-ycsb

From

a.bykov@postgrespro.ru

Date:

19 July 2018, 15:46:59

Hello, hackers.

It might be a good idea to give users an opportunity to test their
applications with pgbench under different real-life-like load. So that
they will be able to see what's going to happen on production.

YCSB (Yahoo! Cloud Serving Benchmark) was taken as a concept. YCSB tests
were originally designed to facilitate performance comparisons of
different cloud data serving systems and it takes into account different
application workloads like: 
workload A - assumes that application do a lot of reads(50%) and
updates(50%).
workload B - case when application do 95% of cases reads
and 5% updates 
workload C - models behavior of read-only application.
workload E - the workload of the applications which in 95% of cases
requests for several neighboring tuples and in 5% of cases - does
updates.

In the patch those workloads were implemented to be executed by pgbench:
pgbench -b ycsb-A

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

0001-pgbench-ycsb-v3.patch

Re: pgbench-ycsb

From

Fabien COELHO

Date:

19 July 2018, 16:35:44

Hello Anthony,

> applications with pgbench under different real-life-like load. So that
> they will be able to see what's going to happen on production.
>
> YCSB (Yahoo! Cloud Serving Benchmark) was taken as a concept. YCSB tests
> were originally designed to facilitate performance comparisons of
> different cloud data serving systems and it takes into account different
> application workloads like:
> workload A - assumes that application do a lot of reads(50%) and
> updates(50%).
> workload B - case when application do 95% of cases reads
> and 5% updates
> workload C - models behavior of read-only application.
> workload E - the workload of the applications which in 95% of cases
> requests for several neighboring tuples and in 5% of cases - does
> updates.
>
> In the patch those workloads were implemented to be executed by pgbench:
> pgbench -b ycsb-A

Could you provide a link to the specification?

I cannot find something simple, and I was kind of hoping to avoid diving 
into the source code of the java tool on github:-) In particular, I'm 
looking for a description of the expected underlying schema and its size 
(scale) parameters.

Patch does not include any documentation, nor help, nor tests. It should.

+               "\\set write_weight 0\n"
+               "\\set operation random(1,:total_weight)\n"
+               "\\if (:operation < :write_weight)\n"

This is dead code:-( A lot of copy-paste between the cases, that should be 
avoided if possible.

Note that pgbench already has a builtin weight management. I'd suggest 
that the implementation could reuse it instead of reimplementing them 
within these duplicated scripts.

Maybe add simple builtins (eg ycsb-read/write/...) for individual 
transactions and a new --load=ycsb-A which would set the various 
transactions with their expected weights.

A, B, C, E... What is missing to get the D bench as well?

-- 
Fabien.

Re: pgbench-ycsb

From

Dmitry Dolgov

Date:

19 July 2018, 16:50:59

> On Thu, 19 Jul 2018 at 15:36, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello Anthony,
>
> > applications with pgbench under different real-life-like load. So that
> > they will be able to see what's going to happen on production.
> >
> > YCSB (Yahoo! Cloud Serving Benchmark) was taken as a concept. YCSB tests
> > were originally designed to facilitate performance comparisons of
> > different cloud data serving systems and it takes into account different
> > application workloads like:
> > workload A - assumes that application do a lot of reads(50%) and
> > updates(50%).
> > workload B - case when application do 95% of cases reads
> > and 5% updates
> > workload C - models behavior of read-only application.
> > workload E - the workload of the applications which in 95% of cases
> > requests for several neighboring tuples and in 5% of cases - does
> > updates.
> >
> > In the patch those workloads were implemented to be executed by pgbench:
> > pgbench -b ycsb-A
>
> Could you provide a link to the specification?
>
> I cannot find something simple, and I was kind of hoping to avoid diving
> into the source code of the java tool on github:-) In particular, I'm
> looking for a description of the expected underlying schema and its size
> (scale) parameters.

There are the description files for different workloads, like [1], (with the
custom amount of records, of course) and the schema [2]. Would this
information be enough?

[1]: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada
[2]: https://github.com/brianfrankcooper/YCSB/blob/master/jdbc/src/main/resources/sql/create_table.sql

Re: pgbench-ycsb

From

a.bykov@postgrespro.ru

Date:

19 July 2018, 17:24:10

On 2018-07-19 16:50, Dmitry Dolgov wrote:
>> On Thu, 19 Jul 2018 at 15:36, Fabien COELHO <coelho@cri.ensmp.fr> 
>> wrote:
>> 
>> 
>> Hello Anthony,
>> 
>> > applications with pgbench under different real-life-like load. So that
>> > they will be able to see what's going to happen on production.
>> >
>> > YCSB (Yahoo! Cloud Serving Benchmark) was taken as a concept. YCSB tests
>> > were originally designed to facilitate performance comparisons of
>> > different cloud data serving systems and it takes into account different
>> > application workloads like:
>> > workload A - assumes that application do a lot of reads(50%) and
>> > updates(50%).
>> > workload B - case when application do 95% of cases reads
>> > and 5% updates
>> > workload C - models behavior of read-only application.
>> > workload E - the workload of the applications which in 95% of cases
>> > requests for several neighboring tuples and in 5% of cases - does
>> > updates.
>> >
>> > In the patch those workloads were implemented to be executed by pgbench:
>> > pgbench -b ycsb-A
>> 
>> Could you provide a link to the specification?
>> 
>> I cannot find something simple, and I was kind of hoping to avoid 
>> diving
>> into the source code of the java tool on github:-) In particular, I'm
>> looking for a description of the expected underlying schema and its 
>> size
>> (scale) parameters.
> 
> There are the description files for different workloads, like [1], 
> (with the
> custom amount of records, of course) and the schema [2]. Would this
> information be enough?
> 
> [1]: 
> https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada
> [2]:
> https://github.com/brianfrankcooper/YCSB/blob/master/jdbc/src/main/resources/sql/create_table.sql

Hi.
Thanks for your feedback, I'll fix it soon.
Actually I used the article "Brian F. Cooper, Adam Silberstein, Erwin 
Tam,
Raghu Ramakrishnan and Russell Sears. Benchmarking Cloud Serving Systems
with YCSB. ACM Symposium on Cloud Computing (SoCC), Indianapolis, IN, 
USA, 2010"
It is available here:
https://github.com/brianfrankcooper/YCSB/wiki/Papers-and-Presentations

But maybe an article is more complicated then your example.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: pgbench-ycsb

From

Fabien COELHO

Date:

21 July 2018, 23:40:59

>> Could you provide a link to the specification?
>>
>> I cannot find something simple, and I was kind of hoping to avoid diving
>> into the source code of the java tool on github:-) In particular, I'm
>> looking for a description of the expected underlying schema and its size
>> (scale) parameters.
>
> There are the description files for different workloads, like [1], (with the
> custom amount of records, of course) and the schema [2]. Would this
> information be enough?
>
> [1]: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada
> [2]: https://github.com/brianfrankcooper/YCSB/blob/master/jdbc/src/main/resources/sql/create_table.sql

The second link is a start.

I notice that the submitted patch transactions do not apply to this 
schema, which is significantly different from the pgbench TPC-B (like) 
benchmark.

The YCSB schema is key -> fields[0-9], all of them TEXT, somehow expected 
to be 100 bytes each, and update is expected to update one of these 
fields.

This suggest that maybe a -i extension would be in order. Possibly

    pgbench -i -s 1 --layout={tpcb,ycsb} (or schema ?)

where "tpcb" would be the default?

I'm sceptical about using a textual primary key as it corresponds more to 
NoSQL limitations than to an actual design choice. I'd be okay with INT8 
as a pkey.

I find the YSCB tablename "usertable" especially unhelpful. Maybe 
"pgbench_ycsb"?

-- 
Fabien.

Re: pgbench-ycsb

From

Dmitry Dolgov

Date:

22 July 2018, 13:22:45

> On Sat, 21 Jul 2018 at 22:41, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> >> Could you provide a link to the specification?
> >>
> >> I cannot find something simple, and I was kind of hoping to avoid diving
> >> into the source code of the java tool on github:-) In particular, I'm
> >> looking for a description of the expected underlying schema and its size
> >> (scale) parameters.
> >
> > There are the description files for different workloads, like [1], (with the
> > custom amount of records, of course) and the schema [2]. Would this
> > information be enough?
> >
> > [1]: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada
> > [2]: https://github.com/brianfrankcooper/YCSB/blob/master/jdbc/src/main/resources/sql/create_table.sql
>
> The second link is a start.
>
> I notice that the submitted patch transactions do not apply to this
> schema, which is significantly different from the pgbench TPC-B (like)
> benchmark.
>
> The YCSB schema is key -> fields[0-9], all of them TEXT, somehow expected
> to be 100 bytes each, and update is expected to update one of these
> fields.
>
> This suggest that maybe a -i extension would be in order. Possibly
>
>     pgbench -i -s 1 --layout={tpcb,ycsb} (or schema ?)
>
> where "tpcb" would be the default?
>
> I'm sceptical about using a textual primary key as it corresponds more to
> NoSQL limitations than to an actual design choice. I'd be okay with INT8
> as a pkey.
>
> I find the YSCB tablename "usertable" especially unhelpful. Maybe
> "pgbench_ycsb"?

Just to clarify - if I understand Anthony correctly, this proposal is not about
implementing exactly YCSB as it is, but more about using zipfian distribution
for an id in the regular pgbench table structure in conjunction with read/write
balance to simulate something similar to it.

And probably instead of implementing the exact YCSB workload inside pgbench, it
makes more sense to add PostgreSQL Jsonb as one of the options into the
framework itself (I was in the middle of it few years ago, but then was
distracted by some interesting benchmarking results).

Re: pgbench-ycsb

From

Fabien COELHO

Date:

22 July 2018, 16:56:08

> Just to clarify - if I understand Anthony correctly, this proposal is not about
> implementing exactly YCSB as it is, but more about using zipfian distribution
> for an id in the regular pgbench table structure in conjunction with read/write
> balance to simulate something similar to it.

Ok, I misunderstood. My 0.02€: If it does not implement YCSB, and the 
point is not to implement YCSB, then do not call it YCSB:-)

Maybe there could be other simpler builtins to use non uniform 
distributions: {zipf,exp,...}-{simple,select} and default values 
(exp_param, zipf_param?) for the random distribution parameters.

   \set id random_zipfian(1, 100000*:scale, :zipf_param)
   \set val random(-5000, 5000)
   UPDATE pgbench_whatever ...;

Then

   pgbench -b zipf-se@1 -b zipf-si@1 [ -D zipf_param=1.1 ... ] -T 10000 ...

> And probably instead of implementing the exact YCSB workload inside pgbench, it
> makes more sense to add PostgreSQL Jsonb as one of the options into the
> framework itself (I was in the middle of it few years ago, but then was
> distracted by some interesting benchmarking results).

Sure.

-- 
Fabien.

Re: pgbench-ycsb

From

a.bykov@postgrespro.ru

Date:

22 July 2018, 20:16:55

On 2018-07-22 16:56, Fabien COELHO wrote:
>> Just to clarify - if I understand Anthony correctly, this proposal is 
>> not about
>> implementing exactly YCSB as it is, but more about using zipfian 
>> distribution
>> for an id in the regular pgbench table structure in conjunction with 
>> read/write
>> balance to simulate something similar to it.
> 
> Ok, I misunderstood. My 0.02€: If it does not implement YCSB, and the
> point is not to implement YCSB, then do not call it YCSB:-)
> 
> Maybe there could be other simpler builtins to use non uniform
> distributions: {zipf,exp,...}-{simple,select} and default values
> (exp_param, zipf_param?) for the random distribution parameters.
> 
>   \set id random_zipfian(1, 100000*:scale, :zipf_param)
>   \set val random(-5000, 5000)
>   UPDATE pgbench_whatever ...;
> 
> Then
> 
>   pgbench -b zipf-se@1 -b zipf-si@1 [ -D zipf_param=1.1 ... ] -T 10000 
> ...
> 
>> And probably instead of implementing the exact YCSB workload inside 
>> pgbench, it
>> makes more sense to add PostgreSQL Jsonb as one of the options into 
>> the
>> framework itself (I was in the middle of it few years ago, but then 
>> was
>> distracted by some interesting benchmarking results).
> 
> Sure.

Hello,
thank you for your interest. I'm still improving this idea, the patch
and I'm very happy about the discussion we have. It really helps.

The idea was to implement the workloads as close to YCSB as possible
using pgbench.

So, the schema it should be applied to - is default schema generated by
pgbnench -i (pgbench_accounts).

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: pgbench-ycsb

From

Fabien COELHO

Date:

22 July 2018, 23:42:14

>>> Just to clarify - if I understand Anthony correctly, this proposal is 
>>> not about implementing exactly YCSB as it is, but more about using 
>>> zipfian distribution for an id in the regular pgbench table structure 
>>> in conjunction with read/write balance to simulate something similar 
>>> to it.
>> 
>> Ok, I misunderstood. My 0.02€: If it does not implement YCSB, and the
>> point is not to implement YCSB, then do not call it YCSB:-)
>> 
>> Maybe there could be other simpler builtins to use non uniform
>> distributions: {zipf,exp,...}-{simple,select} and default values
>> (exp_param, zipf_param?) for the random distribution parameters.
>>
>>   \set id random_zipfian(1, 100000*:scale, :zipf_param)
>>   \set val random(-5000, 5000)
>>   UPDATE pgbench_whatever ...;
>> 
>> Then
>>
>>   pgbench -b zipf-se@1 -b zipf-si@1 [ -D zipf_param=1.1 ... ] -T 10000 ...
>> 
>>> And probably instead of implementing the exact YCSB workload inside 
>>> pgbench, it makes more sense to add PostgreSQL Jsonb as one of the 
>>> options into the framework itself (I was in the middle of it few years 
>>> ago, but then was distracted by some interesting benchmarking 
>>> results).
>> 
>> Sure.
>
> Hello,
> thank you for your interest. I'm still improving this idea, the patch
> and I'm very happy about the discussion we have. It really helps.
>
> The idea was to implement the workloads as close to YCSB as possible
> using pgbench.

Basically I'm against having something called YCSB if it is not YCSB;-)

> So, the schema it should be applied to - is default schema generated by
> pgbnench -i (pgbench_accounts).

This is a contradiction, because pgbench_accounts table is in no way, even 
remotely, conformant to the YCSB benchmark test table.

So for me there are three possibilities:

(1) do nothing, always an option as committers may be against extending 
pgbench in this direction anyway. Personally I'm fine with having it.

(2) implement YCSB cleanly, i.e. both initialization and operations, at 
least if this is "reasonable" (i.e. it does not result in 2000 lines of 
new code). ISTM that it can be done, given that the YCSB schema is very 
simple, hence I suggested "pgbench -i --schema yscb" to trigger a non 
default initialization.

(3) if you are interested in demonstrating non uniform distribution on 
pgbench_accounts, I'm also fine with it, just do so, but do *NOT* call it 
YCSB.

Also it seems that the YCSB bench uses some hashing to mix keys and avoid 
having 1 as the most frequent, 2 as the second, and so on. There is a hash 
function in pgbench which can be used (although the solution is not 
perfect, some values cannot be reached), but it is used by YCSB. Otherwise 
I'm planning to submit a pseudo-random permutation function to ease this 
some day, provided that the size of the table stays constant.

-- 
Fabien.

Re: pgbench-ycsb

From

Robert Haas

Date:

23 July 2018, 18:34:48

On Sun, Jul 22, 2018 at 4:42 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> Basically I'm against having something called YCSB if it is not YCSB;-)

Yep, that seems pretty clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company