Thread: bytea performance tweak
Hi, I've noticed that fetching a bytea field produces a lot of overhead when converting into a byte[] or InputStream. Looking at the code, I've produced a small patch (a lookup table for the hex codes) which makes the conversion faster. What do you think? Best regards, Andreas
Attachment
On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote: > Hi, > > I've noticed that fetching a bytea field produces a lot of overhead when > converting into a byte[] or InputStream. Looking at the code, I've > produced a small patch (a lookup table for the hex codes) which makes > the conversion faster. What do you think? Surprising. How did you benchmark it? Oliver
On 22/06/11 07:59, Oliver Jowett wrote: > On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote: >> Hi, >> >> I've noticed that fetching a bytea field produces a lot of overhead when >> converting into a byte[] or InputStream. Looking at the code, I've >> produced a small patch (a lookup table for the hex codes) which makes >> the conversion faster. What do you think? > > Surprising. It's not at _all_ surprising that an array index lookup is much faster than two function calls and a bunch of branches. It'll probably JIT better and be easier on cache, too. If Oliver is lobbing around mostly bytea data and/or using big bytea fields I'm not surprised he's seeing this as at least somewhat of a hotspot. I've seen comments on bytea performance with JDBC here before, though nobody seems to have dug into it enough to track things down before. Thanks Oliver! -- Craig Ringer
On 22 June 2011 15:27, Craig Ringer <craig@postnewspapers.com.au> wrote: > On 22/06/11 07:59, Oliver Jowett wrote: >> On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote: >>> Hi, >>> >>> I've noticed that fetching a bytea field produces a lot of overhead when >>> converting into a byte[] or InputStream. Looking at the code, I've >>> produced a small patch (a lookup table for the hex codes) which makes >>> the conversion faster. What do you think? >> >> Surprising. > > It's not at _all_ surprising that an array index lookup is much faster > than two function calls and a bunch of branches. It'll probably JIT > better and be easier on cache, too. > > If Oliver is lobbing around mostly bytea data and/or using big bytea > fields I'm not surprised he's seeing this as at least somewhat of a > hotspot. I've seen comments on bytea performance with JDBC here before, > though nobody seems to have dug into it enough to track things down before. (The patch was Andreas' work, not mine) I'm interested in the benchmark results because a) I would have thought that a smaller 1D array plus bitshift/or would be about as fast as a 2D array and probably cache-friendlier. b) the JIT should be inlining the function calls in the original version anyway, and I'm surprised that the branches are that expensive I've seen enough weird performance behavior from the JIT in the past to be inherently suspicious of anything that claims to be a speedup without numbers to back it up. If we have a benchmark to work from, then I'm sure there's even more performance we can squeeze out of it. Oliver
On 22 June 2011 16:24, Oliver Jowett <oliver@opencloud.com> wrote: > I've seen enough weird performance behavior from the JIT in the past > to be inherently suspicious of anything that claims to be a speedup > without numbers to back it up. > If we have a benchmark to work from, then I'm sure there's even more > performance we can squeeze out of it. FWIW here's a quick microbenchmark. Andreas' patch is about 5x faster than the current code. A version using a one-dimensional lookup array is marginally faster again (7%). $ java -server -version; uname -a; time java -server -classpath . Benchmark 5 java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) Linux downpour 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux original code, 1000 1161632 iterations in 5018679 us 4.32 us/iteration 2D array lookup, 1000 6025844 iterations in 4948308 us 0.82 us/iteration 1D array lookup, 1000 6361396 iterations in 4996605 us 0.79 us/iteration original code, 100000 9196 iterations in 5019373 us 545.82 us/iteration 2D array lookup, 100000 57049 iterations in 4999200 us 87.63 us/iteration 1D array lookup, 100000 61162 iterations in 5000197 us 81.75 us/iteration real 1m55.117s user 1m55.400s sys 0m0.210s (I also tried some variations with induction variables but got no real speedup and enough weird results to make me think it's confusing the JIT) There is also a bug in the original patch that I noticed along the way (the initialization of HEX_LOOKUP doesn't use lo2/hi2) - fixed in the benchmark code. Oliver
Attachment
Oliver Jowett wrote: Hi, > On 22 June 2011 01:08, Andreas Schmitz <schmitz@occamlabs.de> wrote: > > I've noticed that fetching a bytea field produces a lot of overhead when > > converting into a byte[] or InputStream. Looking at the code, I've > > produced a small patch (a lookup table for the hex codes) which makes > > the conversion faster. What do you think? > > Surprising. How did you benchmark it? I did some local tests using a main function, with a static array. I've seen times like ~1900ms for decoding using the unpatched version, and ~1600ms for the patched version. I've also benchmarked my application (which makes heavy use of bytea), and have seen some speedup there as well (around 2-300ms faster when decoding ~23000 bytea rows with ~30MB of data in total). I was wondering why the protocol uses such an encoding anyway. Wouldn't it be a lot faster just sending the bytes as they are? Best regards, Andreas
Attachment
Oliver Jowett wrote: Hi, > On 22 June 2011 16:24, Oliver Jowett <oliver@opencloud.com> wrote: > > > I've seen enough weird performance behavior from the JIT in the past > > to be inherently suspicious of anything that claims to be a speedup > > without numbers to back it up. > > If we have a benchmark to work from, then I'm sure there's even more > > performance we can squeeze out of it. > > FWIW here's a quick microbenchmark. > Andreas' patch is about 5x faster than the current code. > A version using a one-dimensional lookup array is marginally faster again (7%). > > $ java -server -version; uname -a; time java -server -classpath . Benchmark 5 > java version "1.6.0_24" > Java(TM) SE Runtime Environment (build 1.6.0_24-b07) > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) > Linux downpour 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC > 2011 x86_64 x86_64 x86_64 GNU/Linux > original code, 1000 1161632 iterations in > 5018679 us 4.32 us/iteration > 2D array lookup, 1000 6025844 iterations in > 4948308 us 0.82 us/iteration > 1D array lookup, 1000 6361396 iterations in > 4996605 us 0.79 us/iteration > original code, 100000 9196 iterations in > 5019373 us 545.82 us/iteration > 2D array lookup, 100000 57049 iterations in > 4999200 us 87.63 us/iteration > 1D array lookup, 100000 61162 iterations in > 5000197 us 81.75 us/iteration > > real 1m55.117s > user 1m55.400s > sys 0m0.210s > > (I also tried some variations with induction variables but got no real > speedup and enough weird results to make me think it's confusing the > JIT) > > There is also a bug in the original patch that I noticed along the way > (the initialization of HEX_LOOKUP doesn't use lo2/hi2) - fixed in the > benchmark code. wow, I wouldn't have expected that much of a speedup. I've also tested using a one dimensional array, but my crappy main method did not see much of a difference. Thanks for benchmarking in detail! Best regards, Andreas
Attachment
On 23 June 2011 02:17, Andreas Schmitz <schmitz@occamlabs.de> wrote: > I was wondering why the protocol uses such an encoding anyway. Wouldn't > it be a lot faster just sending the bytes as they are? We can request binary-format data, which for bytea is indeed just the raw bytes, but the problem is that it's all or nothing: if we request binary data we have to handle binary results for all datatypes, not just bytea (because the request is per-column and generally the driver doesn't know the column type at the point it makes the request) Some progress has been made with that, but it's not production-ready yet I think. Oliver
Oliver Jowett wrote: Hi, > On 23 June 2011 02:17, Andreas Schmitz <schmitz@occamlabs.de> wrote: > > > I was wondering why the protocol uses such an encoding anyway. Wouldn't > > it be a lot faster just sending the bytes as they are? > > We can request binary-format data, which for bytea is indeed just the > raw bytes, but the problem is that it's all or nothing: if we request > binary data we have to handle binary results for all datatypes, not > just bytea (because the request is per-column and generally the driver > doesn't know the column type at the point it makes the request) > Some progress has been made with that, but it's not production-ready > yet I think. ok, understood. Is there a timeline for fully supporting the binary protocol? Best regards, Andreas
Attachment
* Andreas Schmitz: > Is there a timeline for fully supporting the binary protocol? A PostgreSQL installation can contain user-defined types with a custom binary format, so this will not work with all databases anyway. A way to tell PostgreSQL to always send some times in binary format would solve this, though. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
27.06.11 16:57, Florian Weimer написав(ла): > * Andreas Schmitz: > >> Is there a timeline for fully supporting the binary protocol? > A PostgreSQL installation can contain user-defined types with a custom > binary format, so this will not work with all databases anyway. I'd say it would be up to Java developer to provide it's own decode for such a types (in a PGObject?). If he wants, he should do server-level cast to string in a query text. Of course, I'd leave an option to make non-binary connection. Best regards, Vitalii Tymchyshyn.
On Mon, 2011-06-27 at 13:57 +0000, Florian Weimer wrote: > * Andreas Schmitz: > > > Is there a timeline for fully supporting the binary protocol? > > A PostgreSQL installation can contain user-defined types with a custom > binary format, so this will not work with all databases anyway. Is'nt the same also true for _text_ representation of user-defined types ? > A way to tell PostgreSQL to always send some times in binary format > would solve this, though. > > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Kriegsstraße 100 tel: +49-721-96201-1 > D-76133 Karlsruhe fax: +49-721-96201-99 > -- ------- Hannu Krosing PostgreSQL Infinite Scalability and Performance Consultant PG Admin Book: http://www.2ndQuadrant.com/books/
>> Is there a timeline for fully supporting the binary protocol? > > A PostgreSQL installation can contain user-defined types with a custom > binary format, so this will not work with all databases anyway. Having to read unknown types in text mode is not any easier if you're going to do something interesting with them. In either case, the protocol provides a way to read the type payload off the wire (so, e.g., a column with a type you can't read won't "mess up" the whole result set), but it's up to you to process those bytes. As Vitalii stated, something like PGObject is probably the right way to handle this. > A way to tell PostgreSQL to always send some [types] in binary format > would solve this, though. (I assume you meant "types" not "times") Yep, that would be nice. There are per-*column* output facilities in the protocol, but not per-type. It's possible to ask the server for the columns in a query before you commit to output types, but that involves an extra round trip which won't be worth it most of the time (still, it could be worth it if you're doing something that outputs a lot of rows with far leaner binary representations for some data types). Perhaps per-statement encoding choice might be the right compromise, although that smells like a niche use case. What's the current state of the binary patch? The wiki ( http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say. With a brief look at the list archives, the last activity I see is a completely new patch submitted by Radosław Smogura, but that discussion peters out. That discussion implies we want a solution to the "which-types-to-encode-how" problem, but realistically, if we want that, we won't have binary support until the protocol changes to allow per-type encoding selection. Would a polished patch with all-or-nothing connection-flag-level binary support be acceptable? --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com
I started with different assumption - encode everything in binary, eventually fix server. As the result, one huge barrierwas closed - binary encoding for void type. Actually only aclitem, but it's less important, so some 9.x serversshould allow implementing of binary. Problems are with timestamp encoding , here you can't send unknown, and Idon't think it will be fixed in any way. My work with binary went on greater sea, as I added JDBC4 exceptions, supportfor struct types, changed how input is handled to allow to download large datums, moved to maven and removed someJava.in Regards, Radek -----Original Message----- From: Maciek Sakrejda Sent: 27 czerwca 2011 18:51 To: Florian Weimer Cc: Andreas Schmitz; Oliver Jowett; pgsql-jdbc@postgresql.org Subject: Re: [JDBC] bytea performance tweak >> Is there a timeline for fully supporting the binary protocol? > > A PostgreSQL installation can contain user-defined types with a custom > binary format, so this will not work with all databases anyway. Having to read unknown types in text mode is not any easier if you're going to do something interesting with them. In either case, the protocol provides a way to read the type payload off the wire (so, e.g., a column with a type you can't read won't "mess up" the whole result set), but it's up to you to process those bytes. As Vitalii stated, something like PGObject is probably the right way to handle this. > A way to tell PostgreSQL to always send some [types] in binary format > would solve this, though. (I assume you meant "types" not "times") Yep, that would be nice. There are per-*column* output facilities in the protocol, but not per-type. It's possible to ask the server for the columns in a query before you commit to output types, but that involves an extra round trip which won't be worth it most of the time (still, it could be worth it if you're doing something that outputs a lot of rows with far leaner binary representations for some data types). Perhaps per-statement encoding choice might be the right compromise, although that smells like a niche use case. What's the current state of the binary patch? The wiki ( http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say. With a brief look at the list archives, the last activity I see is a completely new patch submitted by Radosław Smogura, but that discussion peters out. That discussion implies we want a solution to the "which-types-to-encode-how" problem, but realistically, if we want that, we won't have binary support until the protocol changes to allow per-type encoding selection. Would a polished patch with all-or-nothing connection-flag-level binary support be acceptable? --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com -- Sent via pgsql-jdbc mailing list (pgsql-jdbc@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-jdbc
> I started with different assumption - encode everything in binary, eventually fix server. I didn't realize the server lacked binary support for some types. I guess the driver can disclose this as a caveat emptor when documenting the binary flag. If you know your particular query workload will not use certain types, it's not really an issue. If you can't be sure, that's pretty much a show-stopper unless there's some kind of automatic fallback to text mode at the driver level, which sounds much too crazy to be useful. > Problems are with timestamp encoding , here you can't send unknown, and I don't think it will be fixed in any way. What issue are you talking about here? Is it related to this: http://archives.postgresql.org/pgsql-jdbc/2006-12/msg00037.php ? If so, we don't necessarily need to implement binary parameter input to support binary result output. --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com
Maciek Sakrejda wrote: Hi, > >> Is there a timeline for fully supporting the binary protocol? > > > > A PostgreSQL installation can contain user-defined types with a custom > > binary format, so this will not work with all databases anyway. > > Having to read unknown types in text mode is not any easier if you're > going to do something interesting with them. In either case, the > protocol provides a way to read the type payload off the wire (so, > e.g., a column with a type you can't read won't "mess up" the whole > result set), but it's up to you to process those bytes. As Vitalii > stated, something like PGObject is probably the right way to handle > this. > > > A way to tell PostgreSQL to always send some [types] in binary format > > would solve this, though. > > (I assume you meant "types" not "times") > > Yep, that would be nice. There are per-*column* output facilities in > the protocol, but not per-type. It's possible to ask the server for > the columns in a query before you commit to output types, but that > involves an extra round trip which won't be worth it most of the time > (still, it could be worth it if you're doing something that outputs a > lot of rows with far leaner binary representations for some data > types). Perhaps per-statement encoding choice might be the right > compromise, although that smells like a niche use case. > > What's the current state of the binary patch? The wiki ( > http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say. > With a brief look at the list archives, the last activity I see is a > completely new patch submitted by Radosław Smogura, but that > discussion peters out. That discussion implies we want a solution to > the "which-types-to-encode-how" problem, but realistically, if we want > that, we won't have binary support until the protocol changes to allow > per-type encoding selection. Would a polished patch with > all-or-nothing connection-flag-level binary support be acceptable? so it seems to me that switching to the binary protocol is probably going to be hard, without server side support. What about just applying my patch in the meantime, including Oliver's fix of course? Best regards, Andreas
Attachment
Nop, this is problem with information passed by encoding to binary format . Currently driver sends timestamp and offsetas text unknown. However in binary mode you may only send Julian local date or Julian UTC date, this means when youcall set timestamp you need to know what is target type of field. Sending timestamps as text unknown resolves this, butmay add problems like in post you has referenced. Regards, Radek -----Original Message----- From: Maciek Sakrejda Sent: 27 czerwca 2011 21:33 To: Radoslaw Smogura Cc: Florian Weimer; Andreas Schmitz; Oliver Jowett; pgsql-jdbc@postgresql.org Subject: Re: [JDBC] bytea performance tweak > I started with different assumption - encode everything in binary, eventually fix server. I didn't realize the server lacked binary support for some types. I guess the driver can disclose this as a caveat emptor when documenting the binary flag. If you know your particular query workload will not use certain types, it's not really an issue. If you can't be sure, that's pretty much a show-stopper unless there's some kind of automatic fallback to text mode at the driver level, which sounds much too crazy to be useful. > Problems are with timestamp encoding , here you can't send unknown, and I don't think it will be fixed in any way. What issue are you talking about here? Is it related to this: http://archives.postgresql.org/pgsql-jdbc/2006-12/msg00037.php ? If so, we don't necessarily need to implement binary parameter input to support binary result output. --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com