Thread: bytea performance tweak

bytea performance tweak

From

Andreas Schmitz

Date:

21 June 2011, 11:41:12

Hi,

I've noticed that fetching a bytea field produces a lot of overhead when
converting into a byte[] or InputStream. Looking at the code, I've
produced a small patch (a lookup table for the hex codes) which makes
the conversion faster. What do you think?

Best regards, Andreas

Attachment

Re: bytea performance tweak

From

Oliver Jowett

Date:

21 June 2011, 21:05:12

On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote:
> Hi,
>
> I've noticed that fetching a bytea field produces a lot of overhead when
> converting into a byte[] or InputStream. Looking at the code, I've
> produced a small patch (a lookup table for the hex codes) which makes
> the conversion faster. What do you think?

Surprising. How did you benchmark it?

Oliver

Re: bytea performance tweak

From

Craig Ringer

Date:

22 June 2011, 00:27:22

On 22/06/11 07:59, Oliver Jowett wrote:
> On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote:
>> Hi,
>>
>> I've noticed that fetching a bytea field produces a lot of overhead when
>> converting into a byte[] or InputStream. Looking at the code, I've
>> produced a small patch (a lookup table for the hex codes) which makes
>> the conversion faster. What do you think?
>
> Surprising.

It's not at _all_ surprising that an array index lookup is much faster
than two function calls and a bunch of branches. It'll probably JIT
better and be easier on cache, too.

If Oliver is lobbing around mostly bytea data and/or using big bytea
fields I'm not surprised he's seeing this as at least somewhat of a
hotspot. I've seen comments on bytea performance with JDBC here before,
though nobody seems to have dug into it enough to track things down before.

Thanks Oliver!

--
Craig Ringer

Re: bytea performance tweak

From

Oliver Jowett

Date:

22 June 2011, 01:25:01

On 22 June 2011 15:27, Craig Ringer <craig@postnewspapers.com.au> wrote:
> On 22/06/11 07:59, Oliver Jowett wrote:
>> On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote:
>>> Hi,
>>>
>>> I've noticed that fetching a bytea field produces a lot of overhead when
>>> converting into a byte[] or InputStream. Looking at the code, I've
>>> produced a small patch (a lookup table for the hex codes) which makes
>>> the conversion faster. What do you think?
>>
>> Surprising.
>
> It's not at _all_ surprising that an array index lookup is much faster
> than two function calls and a bunch of branches. It'll probably JIT
> better and be easier on cache, too.
>
> If Oliver is lobbing around mostly bytea data and/or using big bytea
> fields I'm not surprised he's seeing this as at least somewhat of a
> hotspot. I've seen comments on bytea performance with JDBC here before,
> though nobody seems to have dug into it enough to track things down before.

(The patch was Andreas' work, not mine)

I'm interested in the benchmark results because

a) I would have thought that a smaller 1D array plus bitshift/or would
be about as fast as a 2D array and probably cache-friendlier.
b) the JIT should be inlining the function calls in the original
version anyway, and I'm surprised that the branches are that expensive

I've seen enough weird performance behavior from the JIT in the past
to be inherently suspicious of anything that claims to be a speedup
without numbers to back it up.
If we have a benchmark to work from, then I'm sure there's even more
performance we can squeeze out of it.

Oliver

Re: bytea performance tweak

From

Oliver Jowett

Date:

22 June 2011, 04:58:05

On 22 June 2011 16:24, Oliver Jowett <oliver@opencloud.com> wrote:

> I've seen enough weird performance behavior from the JIT in the past
> to be inherently suspicious of anything that claims to be a speedup
> without numbers to back it up.
> If we have a benchmark to work from, then I'm sure there's even more
> performance we can squeeze out of it.

FWIW here's a quick microbenchmark.
Andreas' patch is about 5x faster than the current code.
A version using a one-dimensional lookup array is marginally faster again (7%).

$ java -server -version; uname -a; time java -server -classpath . Benchmark 5
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Linux downpour 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux
original code, 1000                          1161632 iterations in
5018679 us     4.32 us/iteration
2D array lookup, 1000                        6025844 iterations in
4948308 us     0.82 us/iteration
1D array lookup, 1000                        6361396 iterations in
4996605 us     0.79 us/iteration
original code, 100000                           9196 iterations in
5019373 us   545.82 us/iteration
2D array lookup, 100000                        57049 iterations in
4999200 us    87.63 us/iteration
1D array lookup, 100000                        61162 iterations in
5000197 us    81.75 us/iteration

real    1m55.117s
user    1m55.400s
sys    0m0.210s

(I also tried some variations with induction variables but got no real
speedup and enough weird results to make me think it's confusing the
JIT)

There is also a bug in the original patch that I noticed along the way
(the initialization of HEX_LOOKUP doesn't use lo2/hi2) - fixed in the
benchmark code.

Oliver

Attachment

Benchmark.java

Re: bytea performance tweak

From

Andreas Schmitz

Date:

22 June 2011, 11:20:12

Oliver Jowett wrote:

Hi,

> On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote:
> > I've noticed that fetching a bytea field produces a lot of overhead when
> > converting into a byte[] or InputStream. Looking at the code, I've
> > produced a small patch (a lookup table for the hex codes) which makes
> > the conversion faster. What do you think?
>
> Surprising. How did you benchmark it?

I did some local tests using a main function, with a static array. I've
seen times like ~1900ms for decoding using the unpatched version, and
~1600ms for the patched version. I've also benchmarked my application
(which makes heavy use of bytea), and have seen some speedup there as
well (around 2-300ms faster when decoding ~23000 bytea rows with ~30MB
of data in total).

I was wondering why the protocol uses such an encoding anyway. Wouldn't
it be a lot faster just sending the bytes as they are?

Best regards, Andreas

Attachment

signature.asc

Re: bytea performance tweak

From

Andreas Schmitz

Date:

22 June 2011, 11:23:18

Oliver Jowett wrote:

Hi,

> On 22 June 2011 16:24, Oliver Jowett <oliver@opencloud.com> wrote:
>
> > I've seen enough weird performance behavior from the JIT in the past
> > to be inherently suspicious of anything that claims to be a speedup
> > without numbers to back it up.
> > If we have a benchmark to work from, then I'm sure there's even more
> > performance we can squeeze out of it.
>
> FWIW here's a quick microbenchmark.
> Andreas' patch is about 5x faster than the current code.
> A version using a one-dimensional lookup array is marginally faster again (7%).
>
> $ java -server -version; uname -a; time java -server -classpath . Benchmark 5
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
> Linux downpour 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
> 2011 x86_64 x86_64 x86_64 GNU/Linux
> original code, 1000                          1161632 iterations in
> 5018679 us     4.32 us/iteration
> 2D array lookup, 1000                        6025844 iterations in
> 4948308 us     0.82 us/iteration
> 1D array lookup, 1000                        6361396 iterations in
> 4996605 us     0.79 us/iteration
> original code, 100000                           9196 iterations in
> 5019373 us   545.82 us/iteration
> 2D array lookup, 100000                        57049 iterations in
> 4999200 us    87.63 us/iteration
> 1D array lookup, 100000                        61162 iterations in
> 5000197 us    81.75 us/iteration
>
> real    1m55.117s
> user    1m55.400s
> sys    0m0.210s
>
> (I also tried some variations with induction variables but got no real
> speedup and enough weird results to make me think it's confusing the
> JIT)
>
> There is also a bug in the original patch that I noticed along the way
> (the initialization of HEX_LOOKUP doesn't use lo2/hi2) - fixed in the
> benchmark code.

wow, I wouldn't have expected that much of a speedup. I've also tested
using a one dimensional array, but my crappy main method did not see
much of a difference. Thanks for benchmarking in detail!

Best regards, Andreas

Attachment

signature.asc

Re: bytea performance tweak

From

Oliver Jowett

Date:

22 June 2011, 20:23:52

On 23 June 2011 02:17, Andreas Schmitz <schmitz@occamlabs.de> wrote:

> I was wondering why the protocol uses such an encoding anyway. Wouldn't
> it be a lot faster just sending the bytes as they are?

We can request binary-format data, which for bytea is indeed just the
raw bytes, but the problem is that it's all or nothing: if we request
binary data we have to handle binary results for all datatypes, not
just bytea (because the request is per-column and generally the driver
doesn't know the column type at the point it makes the request)
Some progress has been made with that, but it's not production-ready
yet I think.

Oliver

Re: bytea performance tweak

From

Andreas Schmitz

Date:

23 June 2011, 07:06:34

Oliver Jowett wrote:

Hi,

> On 23 June 2011 02:17, Andreas Schmitz <schmitz@occamlabs.de> wrote:
>
> > I was wondering why the protocol uses such an encoding anyway. Wouldn't
> > it be a lot faster just sending the bytes as they are?
>
> We can request binary-format data, which for bytea is indeed just the
> raw bytes, but the problem is that it's all or nothing: if we request
> binary data we have to handle binary results for all datatypes, not
> just bytea (because the request is per-column and generally the driver
> doesn't know the column type at the point it makes the request)
> Some progress has been made with that, but it's not production-ready
> yet I think.

ok, understood. Is there a timeline for fully supporting the binary
protocol?

Best regards, Andreas

Attachment

signature.asc

Re: bytea performance tweak

From

Florian Weimer

Date:

27 June 2011, 10:57:22

* Andreas Schmitz:

> Is there a timeline for fully supporting the binary protocol?

A PostgreSQL installation can contain user-defined types with a custom
binary format, so this will not work with all databases anyway.

A way to tell PostgreSQL to always send some times in binary format
would solve this, though.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: bytea performance tweak

From

Vitalii Tymchyshyn

Date:

27 June 2011, 12:23:41

27.06.11 16:57, Florian Weimer написав(ла):
> * Andreas Schmitz:
>
>> Is there a timeline for fully supporting the binary protocol?
> A PostgreSQL installation can contain user-defined types with a custom
> binary format, so this will not work with all databases anyway.
I'd say it would be up to Java developer to provide it's own decode for
such a types (in a PGObject?). If he wants, he should do server-level
cast to string in a query text.
Of course, I'd leave an option to make non-binary connection.

Best regards, Vitalii Tymchyshyn.

Re: bytea performance tweak

From

Hannu Krosing

Date:

27 June 2011, 13:25:12

On Mon, 2011-06-27 at 13:57 +0000, Florian Weimer wrote:
> * Andreas Schmitz:
>
> > Is there a timeline for fully supporting the binary protocol?
>
> A PostgreSQL installation can contain user-defined types with a custom
> binary format, so this will not work with all databases anyway.

Is'nt the same also true for _text_ representation of user-defined
types ?

> A way to tell PostgreSQL to always send some times in binary format
> would solve this, though.
>
> --
> Florian Weimer                <fweimer@bfk.de>
> BFK edv-consulting GmbH       http://www.bfk.de/
> Kriegsstraße 100              tel: +49-721-96201-1
> D-76133 Karlsruhe             fax: +49-721-96201-99
>

--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/

Re: bytea performance tweak

From

Maciek Sakrejda

Date:

27 June 2011, 14:46:17

>> Is there a timeline for fully supporting the binary protocol?
>
> A PostgreSQL installation can contain user-defined types with a custom
> binary format, so this will not work with all databases anyway.

Having to read unknown types in text mode is not any easier if you're
going to do something interesting with them. In either case, the
protocol provides a way to read the type payload off the wire (so,
e.g., a column with a type you can't read won't "mess up" the whole
result set), but it's up to you to process those bytes. As Vitalii
stated, something like PGObject is probably the right way to handle
this.

> A way to tell PostgreSQL to always send some [types] in binary format
> would solve this, though.

(I assume you meant "types" not "times")

Yep, that would be nice. There are per-*column* output facilities in
the protocol, but not per-type. It's possible to ask the server for
the columns in a query before you commit to output types, but that
involves an extra round trip which won't be worth it most of the time
(still, it could be worth it if you're doing something that outputs a
lot of rows with far leaner binary representations for some data
types). Perhaps per-statement encoding choice might be the right
compromise, although that smells like a niche use case.

What's the current state of the binary patch? The wiki (
http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say.
With a brief look at the list archives, the last activity I see is a
completely new patch submitted by Radosław Smogura, but that
discussion peters out. That discussion implies we want a solution to
the "which-types-to-encode-how" problem, but realistically, if we want
that, we won't have binary support until the protocol changes to allow
per-type encoding selection. Would a polished patch with
all-or-nothing connection-flag-level binary support be acceptable?

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Re: bytea performance tweak

From

Radoslaw Smogura

Date:

27 June 2011, 15:07:15

I started with different assumption - encode everything in binary, eventually fix server. As the result, one huge
barrierwas closed - binary encoding  for  void type. Actually only aclitem, but it's  less important, so some 9.x
serversshould allow implementing of binary.

Problems are with timestamp encoding  , here you can't  send unknown, and
Idon't  think it will be fixed in any way.

My work with binary went on greater sea, as I added JDBC4 exceptions,
supportfor struct types, changed how input is handled to allow to download large datums, moved to maven and removed
someJava.in  

Regards,
Radek 

-----Original Message-----
From: Maciek Sakrejda
Sent: 27 czerwca 2011 18:51
To: Florian Weimer
Cc: Andreas Schmitz; Oliver Jowett; pgsql-jdbc@postgresql.org
Subject: Re: [JDBC] bytea performance tweak

>> Is there a timeline for fully supporting the binary protocol?
>
> A PostgreSQL installation can contain user-defined types with a custom
> binary format, so this will not work with all databases anyway.

Having to read unknown types in text mode is not any easier if you're
going to do something interesting with them. In either case, the
protocol provides a way to read the type payload off the wire (so,
e.g., a column with a type you can't read won't "mess up" the whole
result set), but it's up to you to process those bytes. As Vitalii
stated, something like PGObject is probably the right way to handle
this.

> A way to tell PostgreSQL to always send some [types] in binary format
> would solve this, though.

(I assume you meant "types" not "times")

Yep, that would be nice. There are per-*column* output facilities in
the protocol, but not per-type. It's possible to ask the server for
the columns in a query before you commit to output types, but that
involves an extra round trip which won't be worth it most of the time
(still, it could be worth it if you're doing something that outputs a
lot of rows with far leaner binary representations for some data
types). Perhaps per-statement encoding choice might be the right
compromise, although that smells like a niche use case.

What's the current state of the binary patch? The wiki (
http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say.
With a brief look at the list archives, the last activity I see is a
completely new patch submitted by Radosław Smogura, but that
discussion peters out. That discussion implies we want a solution to
the "which-types-to-encode-how" problem, but realistically, if we want
that, we won't have binary support until the protocol changes to allow
per-type encoding selection. Would a polished patch with
all-or-nothing connection-flag-level binary support be acceptable?

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

--
Sent via pgsql-jdbc mailing list (pgsql-jdbc@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-jdbc

Re: bytea performance tweak

From

Maciek Sakrejda

Date:

27 June 2011, 16:33:40

> I started with different assumption - encode everything in binary, eventually fix server.

I didn't realize the server lacked binary support for some types. I
guess the driver can disclose this as a caveat emptor when documenting
the binary flag. If you know your particular query workload will not
use certain types, it's not really an issue. If you can't be sure,
that's pretty much a show-stopper unless there's some kind of
automatic fallback to text mode at the driver level, which sounds much
too crazy to be useful.

> Problems are with timestamp encoding  , here you can't  send unknown, and I don't  think it will be fixed in any way.

What issue are you talking about here? Is it related to this:
http://archives.postgresql.org/pgsql-jdbc/2006-12/msg00037.php ? If
so, we don't necessarily need to implement binary parameter input to
support binary result output.

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com

Re: bytea performance tweak

From

Andreas Schmitz

Date:

27 June 2011, 20:50:14

Maciek Sakrejda wrote:

Hi,

> >> Is there a timeline for fully supporting the binary protocol?
> >
> > A PostgreSQL installation can contain user-defined types with a custom
> > binary format, so this will not work with all databases anyway.
>
> Having to read unknown types in text mode is not any easier if you're
> going to do something interesting with them. In either case, the
> protocol provides a way to read the type payload off the wire (so,
> e.g., a column with a type you can't read won't "mess up" the whole
> result set), but it's up to you to process those bytes. As Vitalii
> stated, something like PGObject is probably the right way to handle
> this.
>
> > A way to tell PostgreSQL to always send some [types] in binary format
> > would solve this, though.
>
> (I assume you meant "types" not "times")
>
> Yep, that would be nice. There are per-*column* output facilities in
> the protocol, but not per-type. It's possible to ask the server for
> the columns in a query before you commit to output types, but that
> involves an extra round trip which won't be worth it most of the time
> (still, it could be worth it if you're doing something that outputs a
> lot of rows with far leaner binary representations for some data
> types). Perhaps per-statement encoding choice might be the right
> compromise, although that smells like a niche use case.
>
> What's the current state of the binary patch? The wiki (
> http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say.
> With a brief look at the list archives, the last activity I see is a
> completely new patch submitted by Radosław Smogura, but that
> discussion peters out. That discussion implies we want a solution to
> the "which-types-to-encode-how" problem, but realistically, if we want
> that, we won't have binary support until the protocol changes to allow
> per-type encoding selection. Would a polished patch with
> all-or-nothing connection-flag-level binary support be acceptable?

so it seems to me that switching to the binary protocol is probably
going to be hard, without server side support. What about just applying
my patch in the meantime, including Oliver's fix of course?

Best regards, Andreas

Attachment

signature.asc

Re: bytea performance tweak

From

Radoslaw Smogura

Date:

28 June 2011, 04:26:33

Nop, this is problem with information passed by encoding to binary format . Currently driver  sends timestamp and
offsetas text unknown.
However  in binary  mode you may only send Julian local date or Julian UTC date, this means when
youcall set timestamp you need to know what is target type of field.
Sending timestamps as text unknown resolves this,
butmay add problems like in post you has referenced.

Regards,
Radek 

-----Original Message-----
From: Maciek Sakrejda
Sent: 27 czerwca 2011 21:33
To: Radoslaw Smogura
Cc: Florian Weimer; Andreas Schmitz; Oliver Jowett; pgsql-jdbc@postgresql.org
Subject: Re: [JDBC] bytea performance tweak

> I started with different assumption - encode everything in binary, eventually fix server.

I didn't realize the server lacked binary support for some types. I
guess the driver can disclose this as a caveat emptor when documenting
the binary flag. If you know your particular query workload will not
use certain types, it's not really an issue. If you can't be sure,
that's pretty much a show-stopper unless there's some kind of
automatic fallback to text mode at the driver level, which sounds much
too crazy to be useful.

> Problems are with timestamp encoding  , here you can't  send unknown, and I don't  think it will be fixed in any way.

What issue are you talking about here? Is it related to this:
http://archives.postgresql.org/pgsql-jdbc/2006-12/msg00037.php ? If
so, we don't necessarily need to implement binary parameter input to
support binary result output.

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com