Thread: Performance of ByteA: ascii vs binary

Performance of ByteA: ascii vs binary

From
Thomas Güttler
Date:
I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

Regards,
   Thomas Güttler


-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: Performance of ByteA: ascii vs binary

From
Adrian Klaver
Date:
On 3/18/19 7:33 AM, Thomas Güttler wrote:
> I did some benchmarking and in my setup there was major
> performance difference.
> 
> I tested a ByteA column.

What was the test?

> 
> If I used ascii data the tests took 52 seconds.
> If I used random binary data the test took 250 seconds.
> 
> binary data is (roughly) five times slower than ascii data?
> 
> Is this a know fact, or is there something wrong with my benchmark?
> 
> I used Python and psycopg2.
> 
> Regards,
>    Thomas Güttler
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Performance of ByteA: ascii vs binary

From
Tom Lane
Date:
Adrian Klaver <adrian.klaver@aklaver.com> writes:
> On 3/18/19 7:33 AM, Thomas Güttler wrote:
>> If I used ascii data the tests took 52 seconds.
>> If I used random binary data the test took 250 seconds.

This doesn't seem terribly surprising in bytea_output = escape
mode.  Probably with bytea_output = hex the performance would
be less data-dependent.

            regards, tom lane


minio server. Was: Performance of ByteA: ascii vs binary

From
Thomas Güttler
Date:

Am 18.03.19 um 15:49 schrieb Tom Lane:
> Adrian Klaver <adrian.klaver@aklaver.com> writes:
>> On 3/18/19 7:33 AM, Thomas Güttler wrote:
>>> If I used ascii data the tests took 52 seconds.
>>> If I used random binary data the test took 250 seconds.
> 
> This doesn't seem terribly surprising in bytea_output = escape
> mode.  Probably with bytea_output = hex the performance would
> be less data-dependent.

Thank you for your fast reply.

I guess we will use a minio server to store the blobs.




-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: minio server. Was: Performance of ByteA: ascii vs binary

From
Adrian Klaver
Date:
On 3/18/19 7:59 AM, Thomas Güttler wrote:
> 
> 
> Am 18.03.19 um 15:49 schrieb Tom Lane:
>> Adrian Klaver <adrian.klaver@aklaver.com> writes:
>>> On 3/18/19 7:33 AM, Thomas Güttler wrote:
>>>> If I used ascii data the tests took 52 seconds.
>>>> If I used random binary data the test took 250 seconds.
>>
>> This doesn't seem terribly surprising in bytea_output = escape
>> mode.  Probably with bytea_output = hex the performance would
>> be less data-dependent.
> 
> Thank you for your fast reply.
> 
> I guess we will use a minio server to store the blobs.

According to this:

http://initd.org/psycopg/docs/usage.html?highlight=bytea#adapt-binary

psycopg2 will use 'hex' by default.

Have you tried the test outside Python/psycopg2 e.g in psql to see if 
the performance hit still exists?

> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: minio server. Was: Performance of ByteA: ascii vs binary

From
Thomas Güttler
Date:

Am 18.03.19 um 17:52 schrieb Adrian Klaver:
> On 3/18/19 7:59 AM, Thomas Güttler wrote:
>>
>>
>> Am 18.03.19 um 15:49 schrieb Tom Lane:
>>> Adrian Klaver <adrian.klaver@aklaver.com> writes:
>>>> On 3/18/19 7:33 AM, Thomas Güttler wrote:
>>>>> If I used ascii data the tests took 52 seconds.
>>>>> If I used random binary data the test took 250 seconds.
>>>
>>> This doesn't seem terribly surprising in bytea_output = escape
>>> mode.  Probably with bytea_output = hex the performance would
>>> be less data-dependent.
>>
>> Thank you for your fast reply.
>>
>> I guess we will use a minio server to store the blobs.
> 
> According to this:
> 
> http://initd.org/psycopg/docs/usage.html?highlight=bytea#adapt-binary
> 
> psycopg2 will use 'hex' by default.
> 
> Have you tried the test outside Python/psycopg2 e.g in psql to see if the performance hit still exists?

No, I did not test it outside Python/psycopg2.

Since our code will use Python/psycopg2 I did not run a test outside.

The decision was made. Now I am busy learning minio. I use it for the first time.

Regards,
   Thomas Güttler



-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: Performance of ByteA: ascii vs binary

From
"Peter J. Holzer"
Date:
On 2019-03-18 15:33:17 +0100, Thomas Güttler wrote:
> I did some benchmarking and in my setup there was major
> performance difference.
>
> I tested a ByteA column.
>
> If I used ascii data the tests took 52 seconds.
> If I used random binary data the test took 250 seconds.
>
> binary data is (roughly) five times slower than ascii data?
>
> Is this a know fact, or is there something wrong with my benchmark?
>
> I used Python and psycopg2.

I don't see this here (Debian 9, Python 3,5, psycopg2 2.7.6, PostgreSQL
9.5).

I modified my bench-bytea script (https://github.com/hjp/blob-bench) to
restrict the byte values to printable ASCII (32 .. 126). There was
absolutely no difference, as the attached graph shows.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: Performance of ByteA: ascii vs binary

From
Thomas Güttler
Date:

Am 19.03.19 um 20:37 schrieb Peter J. Holzer:
> On 2019-03-18 15:33:17 +0100, Thomas Güttler wrote:
>> I did some benchmarking and in my setup there was major
>> performance difference.
>>
>> I tested a ByteA column.
>>
>> If I used ascii data the tests took 52 seconds.
>> If I used random binary data the test took 250 seconds.
>>
>> binary data is (roughly) five times slower than ascii data?
>>
>> Is this a know fact, or is there something wrong with my benchmark?
>>
>> I used Python and psycopg2.
> 
> I don't see this here (Debian 9, Python 3,5, psycopg2 2.7.6, PostgreSQL
> 9.5).
> 
> I modified my bench-bytea script (https://github.com/hjp/blob-bench) to
> restrict the byte values to printable ASCII (32 .. 126). There was
> absolutely no difference, as the attached graph shows.

Strange. I saw a big difference.
What did you test?
I tested inserts.

Regards,
   Thomas Güttler



-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: Performance of ByteA: ascii vs binary

From
"Peter J. Holzer"
Date:
On 2019-03-20 13:20:57 +0100, Thomas Güttler wrote:
>
>
> Am 19.03.19 um 20:37 schrieb Peter J. Holzer:
> > On 2019-03-18 15:33:17 +0100, Thomas Güttler wrote:
> > > I did some benchmarking and in my setup there was major
> > > performance difference.
> > >
> > > I tested a ByteA column.
> > >
> > > If I used ascii data the tests took 52 seconds.
> > > If I used random binary data the test took 250 seconds.
> > >
> > > binary data is (roughly) five times slower than ascii data?
> > >
> > > Is this a know fact, or is there something wrong with my benchmark?
> > >
> > > I used Python and psycopg2.
> >
> > I don't see this here (Debian 9, Python 3,5, psycopg2 2.7.6, PostgreSQL
> > 9.5).
> >
> > I modified my bench-bytea script (https://github.com/hjp/blob-bench) to
> > restrict the byte values to printable ASCII (32 .. 126). There was
> > absolutely no difference, as the attached graph shows.
>
> Strange. I saw a big difference.
> What did you test?
> I tested inserts.

The graph with the quantiles was for selects. For inserts I made only a
scatterplot and there didn't seem to be any difference either. I'll
check the quantiles, too, but I expect to see at most a small
difference. A large difference as the one you saw would show up clearly
in a scatterplot, too.

My test ist in github (see URL above). Do see the same difference with
my test on your system?

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: Performance of ByteA: ascii vs binary

From
Tom Lane
Date:
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> On 2019-03-20 13:20:57 +0100, Thomas Güttler wrote:
>> Strange. I saw a big difference.
>> What did you test?
>> I tested inserts.

> The graph with the quantiles was for selects.

Hmm, so there are two different code paths being considered here
-- the OP is apparently talking about the speed of bytea_in,
while the rest of us have been thinking about bytea_out.

The question about whether the text representation is hex or
"escape" style still applies, though.

            regards, tom lane


Re: Performance of ByteA: ascii vs binary

From
Adrian Klaver
Date:
On 3/21/19 6:49 AM, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
>> On 2019-03-20 13:20:57 +0100, Thomas Güttler wrote:
>>> Strange. I saw a big difference.
>>> What did you test?
>>> I tested inserts.
> 
>> The graph with the quantiles was for selects.
> 
> Hmm, so there are two different code paths being considered here
> -- the OP is apparently talking about the speed of bytea_in,
> while the rest of us have been thinking about bytea_out.
> 
> The question about whether the text representation is hex or
> "escape" style still applies, though.

Yes, it would be nice to see the test code.

> 
>             regards, tom lane
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Script which shows performance of ByteA: ascii vs binary

From
Thomas Güttler
Date:
Thank you for asking several times for a benchmark.

I wrote it now and it is visible: inserting random bytes into bytea is much slower,
if you use the psycopg2 defaults.

Here is the chart:
   https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png

And here is the script which creates the chart:
   https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

Please let me know if there is something wrong or could get improved.

Regards,
   Thomas Güttler


Am 18.03.19 um 15:33 schrieb Thomas Güttler:
> I did some benchmarking and in my setup there was major
> performance difference.
> 
> I tested a ByteA column.
> 
> If I used ascii data the tests took 52 seconds.
> If I used random binary data the test took 250 seconds.
> 
> binary data is (roughly) five times slower than ascii data?
> 
> Is this a know fact, or is there something wrong with my benchmark?
> 
> I used Python and psycopg2.
> 
> Regards,
>    Thomas Güttler
> 
> 

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: Script which shows performance of ByteA: ascii vs binary

From
Francisco Olarte
Date:
Thomas:

On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
<guettliml@thomas-guettler.de> wrote:
> Thank you for asking several times for a benchmark.
> I wrote it now and it is visible: inserting random bytes into bytea is much slower,
> if you use the psycopg2 defaults.
> Here is the chart:
>    https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
> And here is the script which creates the chart:
>    https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

I'm not too sure, but I read ( in the code ) you are measuring a
nearly not compressible urandom data againtst a highly compressible (
'x'*i ) data,
are you sure the difference is not due to data being compressed and
generating much less disk usage in toast-tables/wal?

Francisco Olarte.


Re: Script which shows performance of ByteA: ascii vs binary

From
Kenneth Marshall
Date:
On Fri, Mar 22, 2019 at 01:40:28PM +0100, Francisco Olarte wrote:
> Thomas:
> 
> On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
> <guettliml@thomas-guettler.de> wrote:
> > Thank you for asking several times for a benchmark.
> > I wrote it now and it is visible: inserting random bytes into bytea is much slower,
> > if you use the psycopg2 defaults.
> > Here is the chart:
> >    https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
> > And here is the script which creates the chart:
> >    https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py
> 
> I'm not too sure, but I read ( in the code ) you are measuring a
> nearly not compressible urandom data againtst a highly compressible (
> 'x'*i ) data,
> are you sure the difference is not due to data being compressed and
> generating much less disk usage in toast-tables/wal?
> 
> Francisco Olarte.
> 
+1

Regards,
Ken


Re: Script which shows performance of ByteA: ascii vs binary

From
Thomas Güttler
Date:

Am 22.03.19 um 13:40 schrieb Francisco Olarte:
> Thomas:
> 
> On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
> <guettliml@thomas-guettler.de> wrote:
>> Thank you for asking several times for a benchmark.
>> I wrote it now and it is visible: inserting random bytes into bytea is much slower,
>> if you use the psycopg2 defaults.
>> Here is the chart:
>>     https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
>> And here is the script which creates the chart:
>>     https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py
> 
> I'm not too sure, but I read ( in the code ) you are measuring a
> nearly not compressible urandom data againtst a highly compressible (
> 'x'*i ) data,
> are you sure the difference is not due to data being compressed and
> generating much less disk usage in toast-tables/wal?

+1

for this case toast-tables/wal is a detail of the implementation.
This tests does not care about the "why it takes longer". It just generates
a performance chart.

Yes, it does exactly what you say: it compares
nearly not compressible urandom data against a highly compressible data.

In my case, will get nearly random data (binary PDF, JPG, ...). And that's why
I wanted to benchmark it.

Regards,
   Thomas


-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: Script which shows performance of ByteA: ascii vs binary

From
Adrian Klaver
Date:
On 3/22/19 6:04 AM, Thomas Güttler wrote:
> 
> 
> Am 22.03.19 um 13:40 schrieb Francisco Olarte:
>> Thomas:
>>
>> On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
>> <guettliml@thomas-guettler.de> wrote:
>>> Thank you for asking several times for a benchmark.
>>> I wrote it now and it is visible: inserting random bytes into bytea 
>>> is much slower,
>>> if you use the psycopg2 defaults.
>>> Here is the chart:
>>>     
>>> https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png 
>>>
>>> And here is the script which creates the chart:
>>>     
>>> https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py 
>>>
>>
>> I'm not too sure, but I read ( in the code ) you are measuring a
>> nearly not compressible urandom data againtst a highly compressible (
>> 'x'*i ) data,
>> are you sure the difference is not due to data being compressed and
>> generating much less disk usage in toast-tables/wal?
> 
> +1
> 
> for this case toast-tables/wal is a detail of the implementation.
> This tests does not care about the "why it takes longer". It just generates
> a performance chart.

TOAST is tunable, might want to take a look at:

https://www.postgresql.org/docs/11/storage-toast.html

> 
> Yes, it does exactly what you say: it compares
> nearly not compressible urandom data against a highly compressible data.
> 
> In my case, will get nearly random data (binary PDF, JPG, ...). And 
> that's why
> I wanted to benchmark it.
> 
> Regards,
>    Thomas
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Script which shows performance of ByteA: ascii vs binary

From
Francisco Olarte
Date:
Thomas:

On Fri, Mar 22, 2019 at 2:03 PM Thomas Güttler
<guettliml@thomas-guettler.de> wrote:
> > I'm not too sure, but I read ( in the code ) you are measuring a
> > nearly not compressible urandom data againtst a highly compressible (
...
> for this case toast-tables/wal is a detail of the implementation.
> This tests does not care about the "why it takes longer". It just generates
> a performance chart.
> Yes, it does exactly what you say: it compares
> nearly not compressible urandom data against a highly compressible data.
> In my case, will get nearly random data (binary PDF, JPG, ...). And that's why
> I wanted to benchmark it.

Well, if all you wanted is to benchmark, a performance chart, you know
it. I assumed you wanted to know more, like where the bottleneck is
and how to try to avoid it. My fault.

I was specifically confused because, IIRC, you said ascii data took
much longer than binary, which is a completely different test ( you
can test ascii vs binary by generating
chunks of random numbers in say, the 0 to 127 and sending that data
once as is and once more shifting left by one every byte. That should
test ascii-binary differences, but
if you test random vs uniform I think the problem is in the
randomness, you could just test sending '\x92'*i, or something
similar, but if you are just interested in the benchmark, you have it
done.

Just one thing, a single graph with two labels "ascii" and "random" is
misleading, as constant/random is orthogonal with ascii/binary, but as
I said, my fault, all the text about psycopg and other stuff led me to
think you wanted some kind of diagnosis and improvements.

Regards.


Re: Script which shows performance of ByteA: ascii vs binary

From
"Peter J. Holzer"
Date:
On 2019-03-22 13:40:28 +0100, Francisco Olarte wrote:
> On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
> <guettliml@thomas-guettler.de> wrote:
> > Thank you for asking several times for a benchmark.
> > I wrote it now and it is visible: inserting random bytes into bytea
> > is much slower, if you use the psycopg2 defaults.
> > Here is the chart:
> >    https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
> > And here is the script which creates the chart:
> >    https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py
>
> I'm not too sure, but I read ( in the code ) you are measuring a
> nearly not compressible urandom data againtst a highly compressible (
> 'x'*i ) data,

Yes, that seems to be the main difference. My "ascii" test creates
random data in the range [32, 126], which is not very compressible, and
I didn't see much of a difference to the full range (10th percentile and
median were the same, 90th percentile was noticeably better). If I
create "random" data in the range [120, 120], I also get a large speedup
(about 3.5 times). Interestingly the difference vanishes for large (> 10
MB) blobs.

> are you sure the difference is not due to data being compressed and
> generating much less disk usage in toast-tables/wal?

Yes, I think that's it: He is basically measuring how fast his CPU
can compress a stream of constant bytes. Not very meaningful.

Another difference I noticed between our benchmarks is that I used a
plain bytes object while he used a psycopg2.Binary object. Those might
be serialized differently, but since the speed difference is adequately
explained by the (lack of) randomness, I am not going to investigate
this.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: Performance of ByteA: ascii vs binary

From
Олег Самойлов
Date:
Just theoretically assumption. PostgreSQL sometimes may optimise internal format of data and can sometimes zip data.
ASCIIdata can be zipped better, then binary random data. Also PostgreSQL sometimes take decision to keep a column in an
externalfile, if the column is still too big after zip. I don’t know what exactly happens in your case, but here can be
areason. 

> 18 марта 2019 г., в 17:33, Thomas Güttler <guettliml@thomas-guettler.de> написал(а):
>
> I did some benchmarking and in my setup there was major
> performance difference.
>
> I tested a ByteA column.
>
> If I used ascii data the tests took 52 seconds.
> If I used random binary data the test took 250 seconds.
>
> binary data is (roughly) five times slower than ascii data?
>
> Is this a know fact, or is there something wrong with my benchmark?
>
> I used Python and psycopg2.
>
> Regards,
>  Thomas Güttler
>
>
> --
> Thomas Guettler http://www.thomas-guettler.de/
> I am looking for feedback: https://github.com/guettli/programming-guidelines
>