Home > mailing lists

Thread: Sample data generator for performance testing

Sample data generator for performance testing

From

arun chirappurath

Date:

03 January 2024, 07:23:42

Hi All,

Do we have any open source tools which can be used to create sample data at scale from our postgres databases?

Which considers data distribution and randomness

Regards,

Arun

Re: Sample data generator for performance testing

From

Adrian Klaver

Date:

03 January 2024, 17:10:30

On 1/2/24 23:23, arun chirappurath wrote:
> Hi All,
> 
> Do we have any open source tools which can be used to create sample data 
> at scale from our postgres databases?
> Which considers data distribution and randomness

Is this for all tables in the database or a subset?

Does it need to deal with foreign key relationships?

What are the sizes of the existing data and what size sample data do you 
want to produce?

> 
> Regards,
> Arun

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Sample data generator for performance testing

From

arun chirappurath

Date:

03 January 2024, 17:24:08

Hi Adrian,

Thanks for your mail.

Is this for all tables in the database or a subset? Yes

Does it need to deal with foreign key relationships? No

What are the sizes of the existing data and what size sample data do you

want to produce?1Gb and 1Gb test data.

On Wed, 3 Jan, 2024, 22:40 Adrian Klaver, <adrian.klaver@aklaver.com> wrote:

On 1/2/24 23:23, arun chirappurath wrote:
> Hi All,
>
> Do we have any open source tools which can be used to create sample data
> at scale from our postgres databases?
> Which considers data distribution and randomness

>
> Regards,
> Arun

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Sample data generator for performance testing

From

Adrian Klaver

Date:

03 January 2024, 17:33:02

On 1/3/24 09:24, arun chirappurath wrote:
> Hi Adrian,
> 
> Thanks for your mail.
> 
> Is this for all tables in the database or a subset? Yes

Yes all tables or yes just some tables?

> 
> Does it need to deal with foreign key relationships? No
> 
> What are the sizes of the existing data and what size sample data do you
> want to produce?1Gb and 1Gb test data.

If the source data is 1GB and the test data is 1GB then there is no 
sampling, you are using the data population in its entirety.

> 
> On Wed, 3 Jan, 2024, 22:40 Adrian Klaver, <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 1/2/24 23:23, arun chirappurath wrote:
>      > Hi All,
>      >
>      > Do we have any open source tools which can be used to create
>     sample data
>      > at scale from our postgres databases?
>      > Which considers data distribution and randomness
> 
> 
> 
>      >
>      > Regards,
>      > Arun
> 
>     -- 
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: Sample data generator for performance testing

From

arun chirappurath

Date:

03 January 2024, 17:50:30

On Wed, 3 Jan, 2024, 23:03 Adrian Klaver, <adrian.klaver@aklaver.com> wrote:

On 1/3/24 09:24, arun chirappurath wrote:
> Hi Adrian,
>
> Thanks for your mail.
>
> Is this for all tables in the database or a subset? Yes

Yes all tables or yes just some tables?
All tables.except some which has user details.

>
> Does it need to deal with foreign key relationships? No
>
> What are the sizes of the existing data and what size sample data do you
> want to produce?1Gb and 1Gb test data.

If the source data is 1GB and the test data is 1GB then there is no
sampling, you are using the data population in its entirety.

Yes.would like to double the load and test.

Also do we have any standard methods for sampling and generating test data

>
> On Wed, 3 Jan, 2024, 22:40 Adrian Klaver, <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
> On 1/2/24 23:23, arun chirappurath wrote:
> > Hi All,
> >
> > Do we have any open source tools which can be used to create
> sample data
> > at scale from our postgres databases?
> > Which considers data distribution and randomness
>
>
>
> >
> > Regards,
> > Arun
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Sample data generator for performance testing

From

Jeremy Schneider

Date:

03 January 2024, 17:56:53

On 1/2/24 11:23 PM, arun chirappurath wrote:
> Do we have any open source tools which can be used to create sample data
> at scale from our postgres databases?
> Which considers data distribution and randomness

I would suggest to use the most common tools whenever possible, because
then if you want to discuss results with other people (for example on
these mailing lists) then you're working with data sets that are widely
and well understood.

The most common tool for PostgreSQL is pgbench, which does a TPCB-like
schema that you can scale to any size, always the same [small] number of
tables/columns and same uniform data distribution, and there are
relationships between tables so you can create FKs if needed.

My second favorite tool is sysbench. Any number of tables, easily scale
to any size, standardized schema with small number of colums and no
relationships/FKs.  Data distribution is uniformly random however on the
query side it supports a bunch of different distribution models, not
just uniform random, as well as queries processing ranges of rows.

The other tool that I'm intrigued by these days is benchbase from CMU.
It can do TPCC and a bunch of other schemas/workloads, you can scale the
data sizes. If you're just looking at data generation and you're going
to make your own workloads, well benchbase has a lot of different
schemas available out of the box.

You can always hand-roll your schema and data with scripts & SQL, but
the more complex and bespoke your performance test schema is, the more
work & explaining it takes to get lots of people to engage in a
discussion since they need to take time to understand how the test is
engineered. For very narrowly targeted reproductions this is usually the
right approach with a very simple schema and workload, but not commonly
for general performance testing.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Sample data generator for performance testing

From

arun chirappurath

Date:

03 January 2024, 18:02:03

Thanks for the insights..

Thanks,

Arun

On Wed, 3 Jan, 2024, 23:26 Jeremy Schneider, <schneider@ardentperf.com> wrote:

On 1/2/24 11:23 PM, arun chirappurath wrote:
> Do we have any open source tools which can be used to create sample data
> at scale from our postgres databases?
> Which considers data distribution and randomness

I would suggest to use the most common tools whenever possible, because
then if you want to discuss results with other people (for example on
these mailing lists) then you're working with data sets that are widely
and well understood.

The most common tool for PostgreSQL is pgbench, which does a TPCB-like
schema that you can scale to any size, always the same [small] number of
tables/columns and same uniform data distribution, and there are
relationships between tables so you can create FKs if needed.

My second favorite tool is sysbench. Any number of tables, easily scale
to any size, standardized schema with small number of colums and no
relationships/FKs. Data distribution is uniformly random however on the
query side it supports a bunch of different distribution models, not
just uniform random, as well as queries processing ranges of rows.

The other tool that I'm intrigued by these days is benchbase from CMU.
It can do TPCC and a bunch of other schemas/workloads, you can scale the
data sizes. If you're just looking at data generation and you're going
to make your own workloads, well benchbase has a lot of different
schemas available out of the box.

You can always hand-roll your schema and data with scripts & SQL, but
the more complex and bespoke your performance test schema is, the more
work & explaining it takes to get lots of people to engage in a
discussion since they need to take time to understand how the test is
engineered. For very narrowly targeted reproductions this is usually the
right approach with a very simple schema and workload, but not commonly
for general performance testing.

-Jeremy

--
http://about.me/jeremy_schneider

Re: Sample data generator for performance testing

From

Adrian Klaver

Date:

03 January 2024, 18:31:20

On 1/3/24 9:50 AM, arun chirappurath wrote:

On Wed, 3 Jan, 2024, 23:03 Adrian Klaver, <adrian.klaver@aklaver.com> wrote:
On 1/3/24 09:24, arun chirappurath wrote:
> Hi Adrian,
>
> Thanks for your mail.
>
> Is this for all tables in the database or a subset? Yes

Yes all tables or yes just some tables?
All tables.except some which has user details.

>
> Does it need to deal with foreign key relationships? No
>
> What are the sizes of the existing data and what size sample data do you
> want to produce?1Gb and 1Gb test data.

If the source data is 1GB and the test data is 1GB then there is no
sampling, you are using the data population in its entirety.

Yes.would like to double the load and test.

Does that mean you want to take the 1GB of your existing data and double it to 2GB while maintaining

the data distribution from the original data?

Also do we have any standard methods for sampling and generating test data

Something like?:

https://www.postgresql.org/docs/current/sql-select.html

"TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]

A TABLESAMPLE clause after a table_name indicates that the specified sampling_method should be used to retrieve a subset of the rows in that table. This sampling precedes the application of any other filters such as WHERE clauses. The standard PostgreSQL distribution includes two sampling methods, BERNOULLI and SYSTEM, and other sampling methods can be installed in the database via extensions
...: "

Read the rest of the documentation for TABLESAMPLE to get the details.

>
> On Wed, 3 Jan, 2024, 22:40 Adrian Klaver, <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>> wrote:
>
> On 1/2/24 23:23, arun chirappurath wrote:
> > Hi All,
> >
> > Do we have any open source tools which can be used to create
> sample data
> > at scale from our postgres databases?
> > Which considers data distribution and randomness
>
>
>
> >
> > Regards,
> > Arun
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>

--
Adrian Klaver
adrian.klaver@aklaver.com