Thread: Allowing postgresql to accept 0xff syntax for data types that it makes sense for?

I'm working on a project converting a bunch of code from another database
system to PostgreSQL.

One of the issues is that the previous system accepted integers and binary
data in the same hex format as C ... i.e. 0xff

I understand that the proper way to handle this in postgres is x'ff', but
the problem is that a large number of #define values use the 0xff syntax
and these macros are used both in C code, as well as in SQL. The simple
fact is that a LOT of code does this, and correcting it all and ensuring
that the modified code is correct is a BIG job. Just to clarify some of
the complexity: there is about 80,000 lines of PL/PGSQL code that contains
these macros, then is run through the C preprocessor to substitute actual
values for them before being loaded into Postgres.

Obviously, there are many options for fixing this. One of those options is
modifying PostgreSQL to accept the 0xff syntax ... and evaluating that
option is the reason for my post.

So, one of my questions is: does anyone have an existing simple answer on
how to fix this?

My other question: is there a specific reason why PostgreSQL doesn't support
this syntax, aside from "nobody has bothered to add such support"? Because
I'm considering writing a patch to Postgres and submitting it, but I'm not
going to go down that path if there's a specific reason why supporting this
syntax would be _bad_. Personally, I feel like it would be a good thing, as
it seems like a lot of other database systems support it, and even though
it's not ANSI, it's pretty much the de-facto standard.

--
Bill <wmoran@potentialtech.com>



Bill Moran <wmoran@potentialtech.com> writes:
> My other question: is there a specific reason why PostgreSQL doesn't support
> this syntax, aside from "nobody has bothered to add such support"? Because
> I'm considering writing a patch to Postgres and submitting it, but I'm not
> going to go down that path if there's a specific reason why supporting this
> syntax would be _bad_. Personally, I feel like it would be a good thing, as
> it seems like a lot of other database systems support it, and even though
> it's not ANSI, it's pretty much the de-facto standard.

How many is "a lot", and do any of the responsible vendors sit on the SQL
standards committee?

One large concern about doing anything like this is whether future
versions of the SQL standard might blindside us with some
not-terribly-compatible interpretation of that syntax.  If we do something
that is also in Oracle or DB2 or one of the other big boys, then we can
probably rely on the assumption that they'll block anything really
incompatible from becoming standardized ;-).

OTOH, if the actual meaning of "a lot" is "MySQL", I'd be pretty worried
about this scenario.

            regards, tom lane


Unsubscribe pgsql-general

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Thursday, May 21, 2015 1:57 PM
To: Bill Moran
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Allowing postgresql to accept 0xff syntax for data types that it makes sense for?

Bill Moran <wmoran@potentialtech.com> writes:
> My other question: is there a specific reason why PostgreSQL doesn't
> support this syntax, aside from "nobody has bothered to add such
> support"? Because I'm considering writing a patch to Postgres and
> submitting it, but I'm not going to go down that path if there's a
> specific reason why supporting this syntax would be _bad_. Personally,
> I feel like it would be a good thing, as it seems like a lot of other
> database systems support it, and even though it's not ANSI, it's pretty much the de-facto standard.

How many is "a lot", and do any of the responsible vendors sit on the SQL standards committee?

One large concern about doing anything like this is whether future versions of the SQL standard might blindside us with
somenot-terribly-compatible interpretation of that syntax.  If we do something that is also in Oracle or DB2 or one of
theother big boys, then we can probably rely on the assumption that they'll block anything really incompatible from
becomingstandardized ;-). 

OTOH, if the actual meaning of "a lot" is "MySQL", I'd be pretty worried about this scenario.

            regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


On May 21, 2015 06:04:37 PM Brown, Joseph E. wrote:

> Unsubscribe pgsql-general

 

This doesn't work. See the footer of the posts to the mailing list.

 

 

On Thu, 21 May 2015 13:57:24 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Bill Moran <wmoran@potentialtech.com> writes:
> > My other question: is there a specific reason why PostgreSQL doesn't support
> > this syntax, aside from "nobody has bothered to add such support"? Because
> > I'm considering writing a patch to Postgres and submitting it, but I'm not
> > going to go down that path if there's a specific reason why supporting this
> > syntax would be _bad_. Personally, I feel like it would be a good thing, as
> > it seems like a lot of other database systems support it, and even though
> > it's not ANSI, it's pretty much the de-facto standard.
>
> How many is "a lot", and do any of the responsible vendors sit on the SQL
> standards committee?

Well, I've personally worked with (in addition to PostgreSQL) Microsoft
SQL Server, MySQL, and Sybase -- PostgreSQL is the only one of those 4 that
doesn't support the 0xff syntax.

I did a litle research and it appears that neither Oracle nor db2 supports
the 0xff syntax ... so not _quite_ as common as it seemed to me.

> One large concern about doing anything like this is whether future
> versions of the SQL standard might blindside us with some
> not-terribly-compatible interpretation of that syntax.  If we do something
> that is also in Oracle or DB2 or one of the other big boys, then we can
> probably rely on the assumption that they'll block anything really
> incompatible from becoming standardized ;-).

I assume that Microsoft is big enough to prevent anything that would
hurt SQL Server's compatibility from becomming a standard?

> OTOH, if the actual meaning of "a lot" is "MySQL", I'd be pretty worried
> about this scenario.

Well, MySQL _does_ support that syntax ... but I couldn't care less. MySQL
also throws away your data instead of giving you errors and I would never
ask PostgreSQL to start behaving like that.

With all that being said, if I were to build a patch, would it be likely
to be accepted into core?

--
Bill Moran


On 22 May 2015 at 04:46, Bill Moran <wmoran@potentialtech.com> wrote:
> I did a litle research and it appears that neither Oracle nor db2 supports
> the 0xff syntax ... so not _quite_ as common as it seemed to me.

> With all that being said, if I were to build a patch, would it be likely
> to be accepted into core?

Wouldn't you also need to support similar syntax for octal numbers for
the patch to be complete? Or are those already supported (ISTR that's
'077' for decimal 63)?

Not that I care at all about octal numbers, but supporting one and not
the other just doesn't seem right.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


Alban Hertroys <haramrae@gmail.com> writes:
> On 22 May 2015 at 04:46, Bill Moran <wmoran@potentialtech.com> wrote:
>> With all that being said, if I were to build a patch, would it be likely
>> to be accepted into core?

> Wouldn't you also need to support similar syntax for octal numbers for
> the patch to be complete? Or are those already supported (ISTR that's
> '077' for decimal 63)?

A patch that made 077 be interpreted as octal would certainly get rejected
out of hand, because that's valid syntax right now and it doesn't mean 63.

A similar objection might be raised to 0x..., as that is also valid
syntax; it's read as 0 followed by an identifier:

regression=# select 0xff;
 xff
-----
   0
(1 row)

(Yet another place where the fact that AS is optional yields surprising
results...)  So there would be a backwards compatibility break here,
and no you can't fix it with a GUC.  Still, it might be a small enough
break that we could get away with it.  I'm not personally very excited
but other people might be.

Other questions you'd have to think about: what is the data type of
0xffffffff; what do you do with 0xffffffffffffffffffffffff (too big
even for int8).  And it'd likely behoove you to check how Microsoft
answers those questions, if you want to point to SQL Server as what's
going to keep you out of standards-compatibility problems.  (IOW,
if 0x ever did get standardized, the text might well match what
SQL Server does.)

            regards, tom lane



On Fri, May 22, 2015 at 10:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alban Hertroys <haramrae@gmail.com> writes:
> On 22 May 2015 at 04:46, Bill Moran <wmoran@potentialtech.com> wrote:
>> With all that being said, if I were to build a patch, would it be likely
>> to be accepted into core?


How feasible would it be to write a network proxy, like pg_bouncer, to handle converting the values on the fly, so that you need to change neither your original code base (with respect to this issue), nor change PostgreSQL itself?
On Fri, 22 May 2015 11:02:47 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Alban Hertroys <haramrae@gmail.com> writes:
> > On 22 May 2015 at 04:46, Bill Moran <wmoran@potentialtech.com> wrote:
> >> With all that being said, if I were to build a patch, would it be likely
> >> to be accepted into core?
>
> > Wouldn't you also need to support similar syntax for octal numbers for
> > the patch to be complete? Or are those already supported (ISTR that's
> > '077' for decimal 63)?
>
> A patch that made 077 be interpreted as octal would certainly get rejected
> out of hand, because that's valid syntax right now and it doesn't mean 63.

You'll get no objection from me on that point.

> A similar objection might be raised to 0x..., as that is also valid
> syntax; it's read as 0 followed by an identifier:
>
> regression=# select 0xff;
>  xff
> -----
>    0
> (1 row)
>
> (Yet another place where the fact that AS is optional yields surprising
> results...)  So there would be a backwards compatibility break here,
> and no you can't fix it with a GUC.  Still, it might be a small enough
> break that we could get away with it.

I hadn't even considered that issue. I really hate the fact that AS is
optional, and I'm irritated by code that omits it ... but nobody's
asking me ...

> I'm not personally very excited
> but other people might be.
>
> Other questions you'd have to think about: what is the data type of
> 0xffffffff; what do you do with 0xffffffffffffffffffffffff (too big
> even for int8).  And it'd likely behoove you to check how Microsoft
> answers those questions, if you want to point to SQL Server as what's
> going to keep you out of standards-compatibility problems.  (IOW,
> if 0x ever did get standardized, the text might well match what
> SQL Server does.)

MSSQL seems to use it specifically for the equivalent of BYTEA types,
and it seems to me that should be how it works in PostgreSQL. Does
anyone watching this thread have access to a MSSQL server to verify?
If an implicit cast from a 4-byte BYTEA to int works now, then it
should work ... otherwise an explicit cast would be needed, with the
same behavior if you tried to specify a number that overflows an int
in any other way.

MySQL is more liberal in that you can use it to specify ints as
well, but I don't think MySQL is a good example of proper behavior.

--
Bill Moran


On Fri, 22 May 2015 11:27:49 -0500
Dennis Jenkins <dennis.jenkins.75@gmail.com> wrote:

> On Fri, May 22, 2015 at 10:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Alban Hertroys <haramrae@gmail.com> writes:
> > > On 22 May 2015 at 04:46, Bill Moran <wmoran@potentialtech.com> wrote:
> > >> With all that being said, if I were to build a patch, would it be likely
> > >> to be accepted into core?
>
> How feasible would it be to write a network proxy, like pg_bouncer, to
> handle converting the values on the fly, so that you need to change neither
> your original code base (with respect to this issue), nor change PostgreSQL
> itself?

Certainly feasible, but absolutely undesirable. The system I'm working on
is needlessly complex as it is ... I'd rather convince my bosses to let
me rewrite 80,000 lines of code than add another compatibility shim to
the mess.

I brought it up because I saw an opportunity to benefit my employer and
the PostgreSQL community at the same time. I have about 4 fallback plans
if there's a reason not to do this one. Quite frankly, adding a compatibility
shim isn't even on that list.

--
Bill Moran


Bill Moran <wmoran@potentialtech.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Other questions you'd have to think about: what is the data type of
>> 0xffffffff; what do you do with 0xffffffffffffffffffffffff (too big
>> even for int8).  And it'd likely behoove you to check how Microsoft
>> answers those questions, if you want to point to SQL Server as what's
>> going to keep you out of standards-compatibility problems.  (IOW,
>> if 0x ever did get standardized, the text might well match what
>> SQL Server does.)

> MSSQL seems to use it specifically for the equivalent of BYTEA types,
> and it seems to me that should be how it works in PostgreSQL.

Oh really?  Wow, I'd just assumed you wanted this as a way to write
integers.  That's certainly the use-case I would have personally.
I'm not even sure I like the idea of being able to write byteas without
quotes --- they seem like strings to me, not numbers.

> If an implicit cast from a 4-byte BYTEA to int works now, then it
> should work ... otherwise an explicit cast would be needed, with the
> same behavior if you tried to specify a number that overflows an int
> in any other way.

There's no cast at all from bytea to int.  For one thing, it's quite
unclear what endianness should be assumed for such a cast.  (To get
unsurprising behavior from what you're describing, I think we'd have
to use a big-endian interpretation of the bytea; but that would be
a pain for a lot of other scenarios, or even for this case if you'd
written a bytea of length other than 4 or 8 bytes.)

            regards, tom lane


On Fri, 22 May 2015 12:44:40 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Bill Moran <wmoran@potentialtech.com> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Other questions you'd have to think about: what is the data type of
> >> 0xffffffff; what do you do with 0xffffffffffffffffffffffff (too big
> >> even for int8).  And it'd likely behoove you to check how Microsoft
> >> answers those questions, if you want to point to SQL Server as what's
> >> going to keep you out of standards-compatibility problems.  (IOW,
> >> if 0x ever did get standardized, the text might well match what
> >> SQL Server does.)
>
> > MSSQL seems to use it specifically for the equivalent of BYTEA types,
> > and it seems to me that should be how it works in PostgreSQL.
>
> Oh really?  Wow, I'd just assumed you wanted this as a way to write
> integers.  That's certainly the use-case I would have personally.
> I'm not even sure I like the idea of being able to write byteas without
> quotes --- they seem like strings to me, not numbers.

Arrgh ... it's good that you're bringing this up, but you're making me
realize that there's more to figure out than I originally thought ...
My focus had been on it being used for BYTEA columns, but there _are_
plenty of places in the code that do things like:

WHERE int_col & 0x04 = 0x04

Which means that Sybase will implicitly cast that to an int, which
probably means that MSSQL will as well.

Once I take that into consideration, I start thinking that int_col
should actualy be a bit string. which means that:

WHERE bit_varying_col & 0x04 = 0x04

should probably work without explicit casts as well.

> > If an implicit cast from a 4-byte BYTEA to int works now, then it
> > should work ... otherwise an explicit cast would be needed, with the
> > same behavior if you tried to specify a number that overflows an int
> > in any other way.
>
> There's no cast at all from bytea to int.  For one thing, it's quite
> unclear what endianness should be assumed for such a cast.  (To get
> unsurprising behavior from what you're describing, I think we'd have
> to use a big-endian interpretation of the bytea; but that would be
> a pain for a lot of other scenarios, or even for this case if you'd
> written a bytea of length other than 4 or 8 bytes.)

As an implicit cast, obviously anything too large to fit in the
target data type should be an error. But the subject of endianness
becomes damn important.

It's more complex than I original thought, but it still seems like it
can be done without creating idiocy in the way things are cast. I'll
think about it some more and try to come up with some more specific
use scenarios to see what behavior seems the most POLA to me.

--
Bill Moran


On Fri, May 22, 2015 at 6:57 PM, Bill Moran <wmoran@potentialtech.com> wrote:
> On Fri, 22 May 2015 12:44:40 -0400
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Bill Moran <wmoran@potentialtech.com> writes:
>> > Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> Other questions you'd have to think about: what is the data type of
>> >> 0xffffffff; what do you do with 0xffffffffffffffffffffffff (too big
>> >> even for int8).  And it'd likely behoove you to check how Microsoft
>> >> answers those questions, if you want to point to SQL Server as what's
>> >> going to keep you out of standards-compatibility problems.  (IOW,
>> >> if 0x ever did get standardized, the text might well match what
>> >> SQL Server does.)
>>
>> > MSSQL seems to use it specifically for the equivalent of BYTEA types,
>> > and it seems to me that should be how it works in PostgreSQL.
>>
>> Oh really?  Wow, I'd just assumed you wanted this as a way to write
>> integers.  That's certainly the use-case I would have personally.
>> I'm not even sure I like the idea of being able to write byteas without
>> quotes --- they seem like strings to me, not numbers.
>
> Arrgh ... it's good that you're bringing this up, but you're making me
> realize that there's more to figure out than I originally thought ...
> My focus had been on it being used for BYTEA columns, but there _are_
> plenty of places in the code that do things like:
>
> WHERE int_col & 0x04 = 0x04
>
> Which means that Sybase will implicitly cast that to an int, which
> probably means that MSSQL will as well.
>
> Once I take that into consideration, I start thinking that int_col
> should actualy be a bit string. which means that:
>
> WHERE bit_varying_col & 0x04 = 0x04
>
> should probably work without explicit casts as well.
>
>> > If an implicit cast from a 4-byte BYTEA to int works now, then it
>> > should work ... otherwise an explicit cast would be needed, with the
>> > same behavior if you tried to specify a number that overflows an int
>> > in any other way.
>>
>> There's no cast at all from bytea to int.  For one thing, it's quite
>> unclear what endianness should be assumed for such a cast.  (To get
>> unsurprising behavior from what you're describing, I think we'd have
>> to use a big-endian interpretation of the bytea; but that would be
>> a pain for a lot of other scenarios, or even for this case if you'd
>> written a bytea of length other than 4 or 8 bytes.)
>
> As an implicit cast, obviously anything too large to fit in the
> target data type should be an error. But the subject of endianness
> becomes damn important.
>
> It's more complex than I original thought, but it still seems like it
> can be done without creating idiocy in the way things are cast. I'll
> think about it some more and try to come up with some more specific
> use scenarios to see what behavior seems the most POLA to me.
>

SQL server does this:

0x10 is VARBINARY:

   0x10 + 0x00 = 0x1000

There is an implicit cast from varbinary to int:

   0x10 + 0 = 16
   0xfffffff9  + 0 = -7

And there is silent truncation:

    0xff00000000 + 0 = 0


Bill Moran wrote on 22.05.2015 18:57:
> Arrgh ... it's good that you're bringing this up, but you're making me
> realize that there's more to figure out than I originally thought ...
> My focus had been on it being used for BYTEA columns, but there _are_
> plenty of places in the code that do things like:

If this is only about a more compact format for BYTEA, the following is valid in Postgres:

   select '\xff'::bytea

It's not exactly what you are after, but maybe "good enough".


Tom Lane wrote on 21.05.2015 19:57:
> One large concern about doing anything like this is whether future
> versions of the SQL standard might blindside us with some
> not-terribly-compatible interpretation of that syntax.  If we do something
> that is also in Oracle or DB2 or one of the other big boys, then we can
> probably rely on the assumption that they'll block anything really
> incompatible from becoming standardized ;-).


The SQL standard already specifies the format for "binary strings":

<binary string literal> ::=
     X <quote> [ <space>... ] [ { <hexit> [ <space>... ] <hexit> [ <space>... ] }... ] <quote>
        [ { <separator> <quote> [ <space>... ] [ { <hexit> [ <space>... ]
        <hexit> [ <space>... ] }... ] <quote> }... ]

<hexit> ::=
     <digit> | A | B | C | D | E | F | a | b | c | d | e | f


The data type for such a literal is somewhat "undefined":

     It is implementation-defined whether the declared type of a <binary string literal>
     is a fixed-length binary string type, a variable-length binary string type,
     or a binary large object string type

But the above syntax seems to be only supported by H2, HSQLDB and Apache Derby.



Thomas Kellerer <spam_eater@gmx.net> writes:
> The SQL standard already specifies the format for "binary strings":

> <binary string literal> ::=
>      X <quote> [ <space>... ] [ { <hexit> [ <space>... ] <hexit> [ <space>... ] }... ] <quote>
>         [ { <separator> <quote> [ <space>... ] [ { <hexit> [ <space>... ]
>         <hexit> [ <space>... ] }... ] <quote> }... ]

Yeah, and we do honor that:

regression=# select x'1234abcdf';
               ?column?
--------------------------------------
 000100100011010010101011110011011111
(1 row)

... although I notice we don't allow spaces in the input, which seems
like something to fix.  (On further investigation, it looks like the
optional spaces are new since SQL99, which probably explains why we
have not got 'em.)

> The data type for such a literal is somewhat "undefined":

>      It is implementation-defined whether the declared type of a <binary string literal>
>      is a fixed-length binary string type, a variable-length binary string type,
>      or a binary large object string type

Hm, we think it's bit(N):

regression=# select pg_typeof(x'1234abcdf');
 pg_typeof
-----------
 bit
(1 row)

which is what the SQL standard *used* to say, before they unceremoniously
threw the bit types under the bus and replaced them with "binary strings".

I wonder how painful it would be to update that stuff to be compatible
with more recent versions of the standard.  Or whether anyone really
cares --- people doing this in PG seem to be happy enough with bytea.

            regards, tom lane


Well, whether good or bad, my employer has nixed the idea of paying me
to work on this, and I don't have personal time right now to do it,
so it's not going to be addressed by me at this time.

--
Bill <wmoran@potentialtech.com>