Thread: Binary Cursors, and the COPY command
OK, I wrote a utility for 7.3 that takes the output of a select command in a Binary cursor and creates a binary "COPY" file. The premise of the utility is to take the results of two or more selects from external databases and create a single unified table. Here are the issues: In 7.3, COPY BINARY was machine specific and so was the output of a binary cursor. Everything just worked fine. In 7.4, COPY BINARY uses "network byte order," i.e. native data types are altered to big endian if nessisary. The documentation for binary cursors does not specify whether or not the "binary" data is native or "network byte order." I have a few issues with "COPY BINARY" using "network byte order," first, it is pointless. The problem it intends to solve, i.e. transfering across different machine types is already answered using the tried and true ascii method. Second, it actually makes the COPY functionality less usable. You can not create the data outside of the database because all the data type definitions and manipulation functions are inside the database. (Unless you only use simple data types, of course.) Third, if a binary cursor does not encode the binary data as "network byte order" a binary copy can ONLY communicate between two postgreSQL databases because the information required to go from native ordering to network ordering is only in the database. Lastly, the vast majority of machines in use today are intel. Meaning that they are small endian. Except in a very rare circumstance, two machines that would normally be able to communicate in native byte order, will ALWAYS have to convert data. The only use case network byte order fixes is a BINARY COPY between different machine types, but in doing that, it forces anyone trying to add value to postgresql or create a utility that uses COPY to reimplement all the data type handlers outside of the database, even if they never need to interpret or inspect the values, because they have to do this to put them in network byte order. I would say that the history of the word "BINARY" would tend more to indicate incompatible machine specific data. I would submit that the 7.4 format of data, i.e. one data size int32 instead of an int16 followed by the optional int32 is cleaner, but I would remove the "network byte order" and put the byte order int32 back in the header for 7.5
pgsql@mohawksoft.com writes: > Lastly, the vast majority of machines in use today are intel. Meaning that > they are small endian. Except in a very rare circumstance, two machines > that would normally be able to communicate in native byte order, will > ALWAYS have to convert data. Quite honestly, that is exactly one of the reasons for using network order. If we standardized on little-endian then 90% of programmers would ignore the need to do anything about byte ordering issues, and their code would be unportable to big-endian machines. Putting a byte order flag into the header doesn't improve matters; it just forces everybody to deal with *both* orders, which is not simpler. (Except for those who would like not to be bothered with portability, which is a position I have zero sympathy for.) If you don't want to deal with this, don't use binary data. It's as simple as that. BTW, for your stated use-case of combining binary output from separate databases, I should think you'd be happy about having a not-so-machine- dependent data format. It would still work if the databases were running on servers of different endianness. regards, tom lane
pgsql@mohawksoft.com wrote: > The only use case network byte order fixes is a BINARY COPY between > different machine types, but in doing that, it forces anyone trying to add > value to postgresql or create a utility that uses COPY to reimplement all > the data type handlers outside of the database, even if they never need to > interpret or inspect the values, because they have to do this to put them > in network byte order. This is not true if you happen to be using Java on the client side, which has no idea (unless you grot around in the guts of the JVM) what the native byte order is. This actually means that Java clients have the opposite problem -- it's a lot of work to try to use the 7.3-style binary formats. > I would say that the history of the word "BINARY" would tend more to > indicate incompatible machine specific data. "Binary" implies "not plaintext" to me.. What about binary parameters in Bind or binary resultsets from Execute? They follow the same format as binary COPY values. Are you suggesting those should be changed too? -O
> pgsql@mohawksoft.com writes: >> Lastly, the vast majority of machines in use today are intel. Meaning >> that >> they are small endian. Except in a very rare circumstance, two machines >> that would normally be able to communicate in native byte order, will >> ALWAYS have to convert data. > > Quite honestly, that is exactly one of the reasons for using network > order. If we standardized on little-endian then 90% of programmers > would ignore the need to do anything about byte ordering issues, and > their code would be unportable to big-endian machines. So what you are saying is that you should inconvenience 90% of your users to make sure they do something "right?" When you *really* think about it, by forcing a big endian byte order over machine dependent byte order, you are inconveniencing 90% of the users, but only helping the very small amount of people who run in mixed environments where the server is intel and the client is big endian. > > Putting a byte order flag into the header doesn't improve matters; > it just forces everybody to deal with *both* orders, which is not > simpler. (Except for those who would like not to be bothered with > portability, which is a position I have zero sympathy for.) While I understand your lack of sympathy, I think you are too removed from the trenches. There is a viable argument in many projects for machine dependent constructs. Sometimes there are real-world performance and delivery goals with absolutely stated objectives that the project is not heterogenus. > > If you don't want to deal with this, don't use binary data. It's as > simple as that. That's not a good answer either. Come on, what the hell good is having a "BINARY" if the data is not usable? Using ascii isn't good either because you need a text parser for every possible interpretation of the various simple types. > > BTW, for your stated use-case of combining binary output from separate > databases, I should think you'd be happy about having a not-so-machine- > dependent data format. It would still work if the databases were > running on servers of different endianness. Few deployments will *ever* really have different endian-ness amongst their servers. 99% of all deployments will be the same or compatible hardware, and probably intel at that. OK, so you like the idea of binary being "big endian," what about the idea of adding a keyword that is exclusive of "binary" called "native" or something like that? DECLARE fubar NATIVE CURSOR ... COPY fubar {FROM | TO} {'filename' | STD..} WITH [NATIVE | BINARY] ...
Oliver Jowett wrote:> This is not true if you happen to be using Java on the client side,> which has no idea (unless yougrot around in the guts of the JVM) what> the native byte order is.> The method java.nio.ByteOrder.nativeOrder() will tell you what the native byte order is. > This actually means that Java clients have the> opposite problem -- it's a lot of work to try to use the 7.3-style> binaryformats.> The commonly used java.io.DataInput will always use network order but it's easy enough to read/write little endian using the java.nio and java.nio.channel packages. Regards, Thomas Hallgren
Thomas Hallgren wrote: > Oliver Jowett wrote: > > This is not true if you happen to be using Java on the client side, > > which has no idea (unless you grot around in the guts of the JVM) what > > the native byte order is. > > > The method java.nio.ByteOrder.nativeOrder() will tell you what the > native byte order is. NIO is not present before JDK 1.4. The JDBC driver, at least, needs to support earlier JVMs. > > This actually means that Java clients have the > > opposite problem -- it's a lot of work to try to use the 7.3-style > > binary formats. > > > The commonly used java.io.DataInput will always use network order but > it's easy enough to read/write little endian using the java.nio and > java.nio.channel packages. That's no use if you don't know the endianness of the data you're receiving (which is what happens under <= 7.3 -- the data followed the server's byte ordering) The problem with using native byte orderings is not the byte ordering itself, but that the order is unpredictable -- at best, you have to implement code to handle both orders, and at worst you have to just take a guess and hope you were right.. -O
"Oliver Jowett" <oliver@opencloud.com> wrote in message news:4105FF43.40508@opencloud.com... > NIO is not present before JDK 1.4. The JDBC driver, at least, needs to > support earlier JVMs. > Clients only capable of network order (such as a Java 1.3 based JDBC driver) must of course be supported still. No argument there. My objection was to your general statement that "Java has no idea what the native byte order is". Another more philosophical question (more suitable on the jdbc list) is when the Java 1.3 support should be limited (or perhaps discontinued altogether) so that further development can exploit everything that 1.4 provides. After all, it's been around for more than 2 years now. AFAIK, the early bugs forcing you to cling on to the 1.3 have been fixed a long time ago. Doesn't the current 3.0 driver make use of features from the Java 1.4 version of java.sql already? > The problem with using native byte orderings is not the byte ordering > itself, but that the order is unpredictable -- at best, you have to > implement code to handle both orders, and at worst you have to just take > a guess and hope you were right.. > Sure, but those problems are present regardless of implementation language. Regards, Thomas Hallgren
Thomas Hallgren wrote: > "Oliver Jowett" <oliver@opencloud.com> wrote in message > news:4105FF43.40508@opencloud.com... > >>NIO is not present before JDK 1.4. The JDBC driver, at least, needs to >>support earlier JVMs. >> > > Clients only capable of network order (such as a Java 1.3 based JDBC driver) > must of course be supported still. No argument there. My objection was to > your general statement that "Java has no idea what the native byte order > is". I suppose so. I'd point out that the NIO byteorder info is really just an optimization hint -- the rest of the NIO API is byteorder agnostic, regardless of what the native byteorder is. >>The problem with using native byte orderings is not the byte ordering >>itself, but that the order is unpredictable -- at best, you have to >>implement code to handle both orders, and at worst you have to just take >>a guess and hope you were right.. > > Sure, but those problems are present regardless of implementation language. That was my real argument.. -O
Oliver Jowett wrote: > I suppose so. I'd point out that the NIO byteorder info is really just > an optimization hint -- the rest of the NIO API is byteorder agnostic, > regardless of what the native byteorder is. > The rest of the NIO API is not agnostic. A java.nio.ByteBuffer is big endian by default and can be changed using the order method. This is the mechanism you'd use to implement I/O that can handle either endian type. Regards, Thomas Hallgren
On Mon, Jul 26, 2004 at 10:06:28PM -0400, pgsql@mohawksoft.com wrote: > So what you are saying is that you should inconvenience 90% of your users > to make sure they do something "right?" I would say that was pretty solid reasoning. Exposing 10% of users to a high data corruption risk just to get slightly better CPU performance on I/O bound operations (think Amdahl's Law!) does not sound all that sensible to me. Remember what happened to NT on the Alpha? Device driver writers failed to use the portability macros for byte or 16-bit memory-mapped hardware register accesses, exactly because it also worked without on x86. This was one of the main reasons why they had to add 8-bit and 16-bit loads and stores to the architecture! Similarly, lots of programs harbor annoying, hard-to-find endianness bugs because little-endian byte order happens to hide some typical pointer bugs. > When you *really* think about it, by forcing a big endian byte order over > machine dependent byte order, you are inconveniencing 90% of the users, > but only helping the very small amount of people who run in mixed > environments where the server is intel and the client is big endian. I don't see this holding in situations where the user, the programmer, and the systems manager are different (groups of) people. Users may want to connect to different servers. Systems managers may want to replace servers. In today's world even a non-programming user might recompile your application on a big-endian machine. Applications may be running under emulation, and people will prefer to run the database server natively. All those scenarios may break the client-side application. The chance that the mixed-endian scenario had never been tested would be close to those 90%. > Few deployments will *ever* really have different endian-ness amongst > their servers. 99% of all deployments will be the same or compatible > hardware, and probably intel at that. I'd like to add 2 points here: 1. Server architectures and client architectures are two very different things. There are lots of SPARC, PA-RISC, MIPS, POWER, zSeries etc. servers out there. 2. Even if both platforms are AMD or AMD-clones (we'll have to get used to saying that as 64-bit becomes more popular), the number of situations where this is _guaranteed_ as a part of the project definition will be much lower. Remember, the programmer should generally support both the single-endian and the big-endian scenario anyway. We might as well make sure that the hardest is also the most widely tested. Jeroen