Thread: 7.2 - changed array_out() - quotes vs no quotes

7.2 - changed array_out() - quotes vs no quotes

From
David Gould
Date:
Somewhere after 7.2b2 (it looks like for 7.2b4) a change was made to
array_out() in:
 src/backend/utils/adt/arrayfuncs.c,v 1.72 2001 /11/29 21:02:41 tgl
   "Fix array_out's failure to backslash backslashes, per bug# 524.    Also, remove brain-dead rule that double quotes
areneeded if    and only if the datatype is pass-by-reference; neither direction    of the implication holds water.
Instead,examine the actual data    string to see if it contains any characters that force us to quote    it. Add some
documentationabout quoting of array values, which was    previously explained nowhere AFAICT."
 

The older code quoted any "pass by ref" types, ie varlena's as opposed to
ints or floats. Which was perhaps clumsy, but at least was predictable.
That is, if it was a char or text type you could assume it was quoted, eg
 { "foo", "bar", "foo bar", "foo.bar", "foo,bar" }.

The new code goes to the trouble of scanning the string for embedded
commas, curlys, backslashes and whitespace. If it finds some it arranges
for quotes, otherwise it suppresses the quotes anywhere it is possible
not to use them.

There is now no easy way for a client to know what the output will look
like as it now depends on the specific data values, eg: { foo, bar, "foo bar", foo.bar, "foo,bar", "foo\"bar" }

This pretty much breaks any client that does not have a scanner at least
as intelligent as the array_in() scanner. Which includes most perl
programs, and pretty much all shell scripts using psql and arrays.

Since Jdbc did not support arrays well until very recently, I suspect it
broke a few hand written java array scanners too. It did ours.

Finally, I simply don't see the need for this change. There is no pressing
world shortage of double-quote characters, so there seems little reason to
conserve them. Especially at the cost of breaking existing clients and
complicating future client code (exercise: write a shell function to
scan the new style of array result correctly).

I view this change as a bug and would like to see it backed out. 


<rant>

We monitor the hackers list, test with most betas, read the release
notes carefully. And were completely blindsided when our production apps
fell over after we upgraded from 7.2b2 to the release. Grrrr....

So right now we have reverted to our previous version and deferred our
our upgrade until we decide whether to patch out this change (and maintain
another patch) or to rewrite some of our client code (at least we have
source so we have this option, some don't). 

I would really like to hear that I have persuaded you that it is a bug
and will be fixed in a future release.

Alternatively that you plan not to slipstream client visible formatting
changes into late betas.  Especially without any mention in the release
notes.

I really wish one could upgrade from release to release of postgres without
a db reload, or rewriting client software, and especially not without
both.

Grrrr....

Grrrr....

</rant>

Regards

-dg

-- 
David Gould               dg@nextbus.com   (or davidg@dnai.com)
Criminal: A person with predatory instincts who has not sufficient capital
to form a corporation.      - Howard Scott


Re: 7.2 - changed array_out() - quotes vs no quotes

From
Tom Lane
Date:
David Gould <dg@nextbus.com> writes:
> Somewhere after 7.2b2 (it looks like for 7.2b4) a change was made to
> array_out() in:

I will take the blame for that.

> I view this change as a bug and would like to see it backed out. 

The old behavior was certainly broken and I will not accept a proposal
to back out the change entirely.

What you are really saying is that you'd prefer the choice of quotes or
no quotes to be driven by the datatype rather than by the data value.
That's a legitimate gripe, but where shall we get the knowledge of whether
the datatype might sometimes emit strings that need quoting?  Using the
pass-by-reference flag is *completely* wrong; the fact that it chanced
not to fail in your application does not make it less wrong.

The only way I could see to make the behavior totally predictable at
the datatype level (while not being broken) is to always quote every
array element.  However, that would likely break some other people's
oversimplified parsers, so I'm not convinced it's a net win.  Perhaps
you should fix your application code rather than relying on a
never-documented behavior.
        regards, tom lane


Re: 7.2 - changed array_out() - quotes vs no quotes

From
David Gould
Date:
On Fri, Feb 08, 2002 at 12:29:46AM -0500, Tom Lane wrote:
> David Gould <dg@nextbus.com> writes:
> > Somewhere after 7.2b2 (it looks like for 7.2b4) a change was made to
> > array_out() in:
> 
> I will take the blame for that.

Well, it would have been nice if it had been in the release notes...

> > I view this change as a bug and would like to see it backed out. 
> 
> The old behavior was certainly broken and I will not accept a proposal
> to back out the change entirely.

I am not defending the old behaviour, just the fact that it _was_
predictable. It was simple to know which types are pass by ref and it is
simple to allow for it. I am not saying that pass by ref is the right
criteria, just that predictable is good.
> What you are really saying is that you'd prefer the choice of quotes or
> no quotes to be driven by the datatype rather than by the data value.

Yes. I think it is not excessive to insist that types have stable,
predicatable representations. The other types do, why should arrays be
even more special?

> That's a legitimate gripe, but where shall we get the knowledge of whether
> the datatype might sometimes emit strings that need quoting?  Using the

Hmmm, maybe the type designers should deal with this. And if they don't,
well quote em.

Or, you don't even need the quotes, you could just promise never
to insert white space and to always escape embedded commas and curlys.
So a dumb client could simply split on un-escaped commas and be done.
I could live with that as long as it never changed again, at least without
some warning.

> pass-by-reference flag is *completely* wrong; the fact that it chanced
> not to fail in your application does not make it less wrong.

"chanced not to fail". Nice. 

> The only way I could see to make the behavior totally predictable at
> the datatype level (while not being broken) is to always quote every
> array element.  However, that would likely break some other people's

Fine with me. That is what it did before.

> oversimplified parsers, so I'm not convinced it's a net win.  Perhaps
> you should fix your application code rather than relying on a
> never-documented behavior.

"relying on a never-documented behavior" ..., so it is now my fault there
is no documentation? Heh.

Besides, this isn't even strictly true, in the html documentation
(doc/arrays.html "chapter 6") that ships with 7.2 there is a nice example:
   SELECT schedule[1:2][1:1] FROM sal_emp WHERE name = 'Bill';
         schedule   --------------------    {{"meeting"},{""}}   (1 row)

which shows quotes around the strings. Likewise, the doc quotes strings in
every example of an insert. Further, it says:
 "Observe that to write an array value, we enclose the element values  within curly braces and separate them by commas.
Ifyou know C, this  is not unlike the syntax for initializing structures."
 

Which suggests that the syntax should be similar to C. Which requires
strings to be quoted.

And, since the "undocumented behaviour" has been there since the dawn
of time, or at least since:
 src/backend/utils/adt/arrayfuncs.c Revision 1.1, Tue Jul 9 06:22:03 1996 UTC (5 years, 7 months ago) by scrappy

it is likely that _not_ _changing_ _it_ would not break any of the last
six years worth of existing "other people's oversimplified parsers".

... breathes ...

I don't mean to be rude about this, I have a lot of respect for postgres,
all of the postgres developers and the overall process.

But to slip a client visible change late in a beta cycle to a specific
format that has been stable since UC Berkeley freed the code, and then
to suggest that is just a some silly user relying on
"never-documented behavior" is almost comical.

Seriously, one point of a database is to insulate client applications
from the exact representation and layout of the data. Which is not
accomplished by making arbitrary changes to simple things like strings
that make them take a yards and yards of code to parse.

I think a clearly explained rule about what types are quoted (and either
qoted or not quoted, not sometimes this and sometimes that) would be
a nice addition to the documentation, especially if the code then
was made to match it. And it was in the release notes.

But, as it stands, I still see a bug.

-dg


-- 
David Gould               dg@nextbus.com   (or davidg@dnai.com)
'Some people, when confronted with a problem, think "I know, I'll useregular expressions".  Now they have two
problems.' -- Jamie Zawinski
 


Re: 7.2 - changed array_out() - quotes vs no quotes

From
Tom Lane
Date:
David Gould <dg@nextbus.com> writes:
> Yes. I think it is not excessive to insist that types have stable,
> predicatable representations. The other types do, why should arrays be
> even more special?

The representation is stable and predictable.  You're simply hoping to
avoid building smarts into your parser for it.  Unfortunately, some
degree of smarts are *necessary* if you are going to deal with array
items containing arbitrary text.  I can hardly believe that a client
program that can deal with backslash-escapes is going to have trouble
removing quotes.

> Or, you don't even need the quotes, you could just promise never
> to insert white space and to always escape embedded commas and curlys.

No, we can't, because that would break applications that rely on the
existing rules for array input: leading whitespace is insignificant
unless quoted.  Besides, weren't you complaining because the quotes
disappeared?  The above variant would still break your code.

> So a dumb client could simply split on un-escaped commas and be done.

I hardly think that a client that can tell the difference between an
escaped comma and an un-escaped one qualifies as "dumb".

We could perhaps dispense with quotes on output if we escaped leading
spaces.  For example, instead of"  foo"
emit\  foo
I don't think this is a step forward in readability, though.  And
increased reliance on backslashes instead of double quotes won't really
make anyone's life easier --- for example, you'd have to remember to
double them when sending the same value back to the SQL parser.

>> The only way I could see to make the behavior totally predictable at
>> the datatype level (while not being broken) is to always quote every
>> array element.

> Fine with me. That is what it did before.

No, it has never done that.  In particular, I do not wish to change the
longstanding no-quotes behavior for arrays of integers.  That *would*
break other people's code.  (One of the things I hoped to accomplish
with this change is to extend the same no-quotes behavior to floats and
numerics.)

> But to slip a client visible change late in a beta cycle to a specific
> format that has been stable since UC Berkeley freed the code,

It's been broken since Berkeley, too; the fact that no one complained
till a month or two ago just indicates how little arrays are used, IMHO.
I doubt you'd be any less annoyed no matter when in the development
cycle we'd done this.

I do agree that it'd be better if this had been called out in the
release notes.  We don't currently have any process for ensuring that
minor incompatibilities get noted in the release notes.  Bruce makes up
the notes based on scanning the CVS logs after the fact, and if he
misses the significance of an entry, it's missed.  Maybe we can do
better than that --- adding an entry to a release-notes-to-be file when
the change is made might be more reliable.

It's also true that the SGML documentation is sadly deficient on this
point; but then, its discussion of arrays is overly terse in just about
every respect.  Someone want to volunteer to expand it?

> Seriously, one point of a database is to insulate client applications
> from the exact representation and layout of the data. Which is not
> accomplished by making arbitrary changes to simple things like strings
> that make them take a yards and yards of code to parse.

Properly parsing arrays of text values is going to require dealing with
backslash-escapes in any case; seems to me that that's what will take
"yards and yards" of code.  Stripping off optional quotes is trivial
by comparison.  On the other hand, parsing arrays of integers is pretty
trivial since you know there are no escapable characters anywhere.
I don't favor pushing complexity out of the one case and into the other.

I'm willing to consider the output-no-quotes-at-all approach if people
think that's a superior solution.  Comments anyone?
        regards, tom lane


Re: 7.2 - changed array_out() - quotes vs no quotes

From
Justin Clift
Date:
Tom Lane wrote:
<snip>
> 
> I'm willing to consider the output-no-quotes-at-all approach if people
> think that's a superior solution.  Comments anyone?

Keeping the quotes seems better, purely because of readability.

i.e. "  xyzzy " is better than ,\  xyzzy\ , or similar.

Regards and best wishes,

Justin Clift


> 
>                         regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."  - Indira Gandhi


Re: 7.2 - changed array_out() - quotes vs no quotes

From
Elein
Date:

The issue is that changing a user interface will break people's
code.  It is clear that a user interface change was made and
it is also clear that it broke people's code.  In any organization
that is a bug.

I did this job of gating server user interfaces for commercial postgres
based systems for several years. It is not always a nice job.  The
need to carefully ensure that interfaces were (almost) always backward
compatible was given, as was the necessity of highlighting any unavoidable
changes in interfaces so that the users would be impacted in the smallest
possible way.

In this case, the container aspect of the datatype requires more studied
parsing that most others with or without the changes.  Also, the array
handling routines in jdbc were only completed in 7.2, so we can assume
that java users had to do their own parsing of the representation of
this container type.

The counter argument that there were no complaints
is not particularly relevent since few people have been using the
beta versions and have been waiting for 7.2 to stabilize since it requires
a significant effort to do the initdb.  It is safe to say that the
masses will be a bit slowere on the adoption of an initdb release than
others.

You all have been stellar at supporting postgresql and responding to users
in very helpful and immediate ways.  We have benefited directly from that.
Keep up the good work by not changing user interfaces for anything other
than unavoidable circumstances.

elein

Tom Lane wrote:

> David Gould <dg@nextbus.com> writes:
> 
>>Yes. I think it is not excessive to insist that types have stable,
>>predicatable representations. The other types do, why should arrays be
>>even more special?
>>
> 
> The representation is stable and predictable.  You're simply hoping to
> avoid building smarts into your parser for it.  Unfortunately, some
> degree of smarts are *necessary* if you are going to deal with array
> items containing arbitrary text.  I can hardly believe that a client
> program that can deal with backslash-escapes is going to have trouble
> removing quotes.
> 
> 
>>Or, you don't even need the quotes, you could just promise never
>>to insert white space and to always escape embedded commas and curlys.
>>
> 
> No, we can't, because that would break applications that rely on the
> existing rules for array input: leading whitespace is insignificant
> unless quoted.  Besides, weren't you complaining because the quotes
> disappeared?  The above variant would still break your code.
> 
> 
>>So a dumb client could simply split on un-escaped commas and be done.
>>
> 
> I hardly think that a client that can tell the difference between an
> escaped comma and an un-escaped one qualifies as "dumb".
> 
> We could perhaps dispense with quotes on output if we escaped leading
> spaces.  For example, instead of
>     "  foo"
> emit
>     \  foo
> I don't think this is a step forward in readability, though.  And
> increased reliance on backslashes instead of double quotes won't really
> make anyone's life easier --- for example, you'd have to remember to
> double them when sending the same value back to the SQL parser.
> 
> 
>>>The only way I could see to make the behavior totally predictable at
>>>the datatype level (while not being broken) is to always quote every
>>>array element.
>>>
> 
>>Fine with me. That is what it did before.
>>
> 
> No, it has never done that.  In particular, I do not wish to change the
> longstanding no-quotes behavior for arrays of integers.  That *would*
> break other people's code.  (One of the things I hoped to accomplish
> with this change is to extend the same no-quotes behavior to floats and
> numerics.)
> 
> 
>>But to slip a client visible change late in a beta cycle to a specific
>>format that has been stable since UC Berkeley freed the code,
>>
> 
> It's been broken since Berkeley, too; the fact that no one complained
> till a month or two ago just indicates how little arrays are used, IMHO.
> I doubt you'd be any less annoyed no matter when in the development
> cycle we'd done this.
> 
> I do agree that it'd be better if this had been called out in the
> release notes.  We don't currently have any process for ensuring that
> minor incompatibilities get noted in the release notes.  Bruce makes up
> the notes based on scanning the CVS logs after the fact, and if he
> misses the significance of an entry, it's missed.  Maybe we can do
> better than that --- adding an entry to a release-notes-to-be file when
> the change is made might be more reliable.
> 
> It's also true that the SGML documentation is sadly deficient on this
> point; but then, its discussion of arrays is overly terse in just about
> every respect.  Someone want to volunteer to expand it?
> 
> 
>>Seriously, one point of a database is to insulate client applications
>>from the exact representation and layout of the data. Which is not
>>accomplished by making arbitrary changes to simple things like strings
>>that make them take a yards and yards of code to parse.
>>
> 
> Properly parsing arrays of text values is going to require dealing with
> backslash-escapes in any case; seems to me that that's what will take
> "yards and yards" of code.  Stripping off optional quotes is trivial
> by comparison.  On the other hand, parsing arrays of integers is pretty
> trivial since you know there are no escapable characters anywhere.
> I don't favor pushing complexity out of the one case and into the other.
> 
> I'm willing to consider the output-no-quotes-at-all approach if people
> think that's a superior solution.  Comments anyone?
> 
>             regards, tom lane
> 
> 



-- 
--------------------------------------------------------
elein@nextbus.com 
(510)420-3120 
www.nextbus.comspinning to infinity, hallelujah
--------------------------------------------------------