Thread: multi line text data/query ?bug?

multi line text data/query ?bug?

From

"Sim Zacks"

Date:

22 March 2005, 07:02:54

While I was testing an issue in python I discovered a general
*nix/windows/mac issue with newlines.
The same query will give different results depending on what client executes
it.

create table test1(f1 text);

insert into test1 values('this is a long string.
it will have new lines in it.
I want to see if those new lines go away.
so ignore. the ugliness')

If the insert was done on a Windows machine there will be a CRLF as the
EOLN, if done on Unix it will have LF and if done on Mac it will have CR.

So if the insert was done on windows, the following query will only work
from a windows client :
select * from test1 where f1='this is a long string.
it will have new lines in it.
I want to see if those new lines go away.
so ignore. the ugliness'

If that insert was done on a *NIX then the query will only work from that
client

Re: multi line text data/query ?bug?

From

Peter Eisentraut

Date:

22 March 2005, 14:35:25

Am Dienstag, 22. März 2005 08:00 schrieb Sim Zacks:
> While I was testing an issue in python I discovered a general
> *nix/windows/mac issue with newlines.
> The same query will give different results depending on what client
> executes it.

While that is indoubtedly strange behavior, I can't see how it qualifies as a
bug.  You press a certain key, and you get exactly the characters that the
key is defined to produce.  That differs between operating systems, but so do
a lot of other things.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: multi line text data/query ?bug?

From

"Sim Zacks"

Date:

23 March 2005, 06:53:10

While I would agree with you that from a purely technical standpoint, the
user inserted into the database a CRLF and a query with just an LF does not
exactly match that, from a users and more practical perspective, that does
not make sense at all. That is why I surrounded the  word bug in ??.

I would say that from a users perspective it qualifies as a bug because they
did not put in specific binary characters. They want a newline. From a
database standards perspective, I would argue that any database that allows
connections from a client without qualifying a required operating system
should be OS neutral.

I would say it is a bug from a users perspective because the exact same
query works differently from different clients. Since the user does not
choose what binary characters to put in, they are invisible to the user.
Anything that is completely invisible to the user should not be considered
valid qualifying data.

As there is no postgresql database standard, such as "all newlines are unix
newlines" it is impossible to write a client that will necessarily return
the data that you want.

This is the exact problem we are having with Python right now, as a Windows
client cannot write a python function to be run on a linux server.


"Peter Eisentraut" <peter_e@gmx.net> wrote in message
news:200503221535.07641.peter_e@gmx.net...
> Am Dienstag, 22. M�rz 2005 08:00 schrieb Sim Zacks:
> > While I was testing an issue in python I discovered a general
> > *nix/windows/mac issue with newlines.
> > The same query will give different results depending on what client
> > executes it.
>
> While that is indoubtedly strange behavior, I can't see how it qualifies
as a
> bug.  You press a certain key, and you get exactly the characters that the
> key is defined to produce.  That differs between operating systems, but so
do
> a lot of other things.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: multi line text data/query ?bug?

From

Marco Colombo

Date:

23 March 2005, 09:51:25

On Wed, 23 Mar 2005, Sim Zacks wrote:

> While I would agree with you that from a purely technical standpoint, the
> user inserted into the database a CRLF and a query with just an LF does not
> exactly match that, from a users and more practical perspective, that does
> not make sense at all. That is why I surrounded the  word bug in ??.
>
> I would say that from a users perspective it qualifies as a bug because they
> did not put in specific binary characters. They want a newline. From a
> database standards perspective, I would argue that any database that allows
> connections from a client without qualifying a required operating system
> should be OS neutral.
>
> I would say it is a bug from a users perspective because the exact same
> query works differently from different clients. Since the user does not
> choose what binary characters to put in, they are invisible to the user.
> Anything that is completely invisible to the user should not be considered
> valid qualifying data.
>
> As there is no postgresql database standard, such as "all newlines are unix
> newlines" it is impossible to write a client that will necessarily return
> the data that you want.
>
> This is the exact problem we are having with Python right now, as a Windows
> client cannot write a python function to be run on a linux server.

Unfortunately, it's not that simple. There are problems with python
when _both_ the client and the server are Windows. Python itself
_always_ uses \n even on Windows. So the only solution is to
"pythonize" the input (convert to \n), no matter what.

For the more general problem of handling text, see my comments in
this thread:
http://archives.postgresql.org/pgsql-general/2005-01/msg00792.php

There are interesting problems with multiline text, as a datatype.
Think of digital signatures and checksums. Think of a simple function:
     len(text)
should it count line separators as characters? In theory, the only
way to get cross-platform consistent behaviour, is to _ignore_ line
separators when counting or checksumming. But the real world solution
is to treat textfiles as binary and let the users or the application
handle the conversion.

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it

Re: multi line text data/query ?bug?

From

"Sim Zacks"

Date:

23 March 2005, 10:21:33

In any case, there are 2 correct solutions to the problem for the case of
postgresql.
1) Database standard - postgresql chooses a newline standard and every
client must support that if they support postgresql. Either put the onus on
the client developers to override the OS standard and support the postgresql
standard, or have the db convert the incoming newlines into the db standard.

2) Server's OS - the server must convert any new lines coming in to the
standard that it expects. This is similar to an ftp client that converts
newlines on the transfer. That means that data sent to a Windows server with
an LF will be converted to a CRLF and vice versa.

The data restore function will also have to follow the above procedures to
make sure you can take data from one server to the other without
compromising integrity.

Without one of these solutions, PostGreSQL is not compatible between servers
and clients.
A query written on any client should return the same result. The query being
the visible appearance on the screen. That is what the users would expect to
have returned.

"Marco Colombo" <pgsql@esiway.net> wrote in message
news:Pine.LNX.4.61.0503231028120.26346@Megathlon.ESI...
> On Wed, 23 Mar 2005, Sim Zacks wrote:
>
> > While I would agree with you that from a purely technical standpoint,
the
> > user inserted into the database a CRLF and a query with just an LF does
not
> > exactly match that, from a users and more practical perspective, that
does
> > not make sense at all. That is why I surrounded the  word bug in ??.
> >
> > I would say that from a users perspective it qualifies as a bug because
they
> > did not put in specific binary characters. They want a newline. From a
> > database standards perspective, I would argue that any database that
allows
> > connections from a client without qualifying a required operating system
> > should be OS neutral.
> >
> > I would say it is a bug from a users perspective because the exact same
> > query works differently from different clients. Since the user does not
> > choose what binary characters to put in, they are invisible to the user.
> > Anything that is completely invisible to the user should not be
considered
> > valid qualifying data.
> >
> > As there is no postgresql database standard, such as "all newlines are
unix
> > newlines" it is impossible to write a client that will necessarily
return
> > the data that you want.
> >
> > This is the exact problem we are having with Python right now, as a
Windows
> > client cannot write a python function to be run on a linux server.
>
> Unfortunately, it's not that simple. There are problems with python
> when _both_ the client and the server are Windows. Python itself
> _always_ uses \n even on Windows. So the only solution is to
> "pythonize" the input (convert to \n), no matter what.
>
> For the more general problem of handling text, see my comments in
> this thread:
> http://archives.postgresql.org/pgsql-general/2005-01/msg00792.php
>
> There are interesting problems with multiline text, as a datatype.
> Think of digital signatures and checksums. Think of a simple function:
>   len(text)
> should it count line separators as characters? In theory, the only
> way to get cross-platform consistent behaviour, is to _ignore_ line
> separators when counting or checksumming. But the real world solution
> is to treat textfiles as binary and let the users or the application
> handle the conversion.
>
> .TM.
> --
>        ____/  ____/   /
>       /      /       / Marco Colombo
>      ___/  ___  /   /       Technical Manager
>     /          /   / ESI s.r.l.
>   _____/ _____/  _/        Colombo@ESI.it
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Re: multi line text data/query ?bug?

From

Marco Colombo

Date:

23 March 2005, 11:10:17

On Wed, 23 Mar 2005, Sim Zacks wrote:

> In any case, there are 2 correct solutions to the problem for the case of
> postgresql.
> 1) Database standard - postgresql chooses a newline standard and every
> client must support that if they support postgresql. Either put the onus on
> the client developers to override the OS standard and support the postgresql
> standard, or have the db convert the incoming newlines into the db standard.
>
> 2) Server's OS - the server must convert any new lines coming in to the
> standard that it expects. This is similar to an ftp client that converts
> newlines on the transfer. That means that data sent to a Windows server with
> an LF will be converted to a CRLF and vice versa.
>
> The data restore function will also have to follow the above procedures to
> make sure you can take data from one server to the other without
> compromising integrity.
>
> Without one of these solutions, PostGreSQL is not compatible between servers
> and clients.
> A query written on any client should return the same result. The query being
> the visible appearance on the screen. That is what the users would expect to
> have returned.

Yeah, those were my points.

The _open_ problems are:
- what about storing a signed document? it's possible that newline
   conversion makes the signature invalid. How would you restore the original
   document? Before you answer think of:
   a) a client running on a platform different from the one that inserted
      the document;
   b) a document with _mixed_ newline types, such as a windows text with
      enmbedded bare \n or \r. [*]
- what about any other function that may be affected by newline style?
   I mean, the user may insert a text that he knows it's 1000 chars long,
   and finds that PG thinks it's only 980. Is this "consistent"? What if
   the user selects for the messages longer than 990? What is the expected
   answer, from the user standpoint?

There's no easy solution I think.

[*] This is _way_ more common than you'd think. RFC2822, internet message
format, says lines are CRFL separated. It happens sometimes that a message
contains a NL or a CR alone.

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it

Re: multi line text data/query ?bug?

From

"Sim Zacks"

Date:

23 March 2005, 12:01:15

> - what about storing a signed document? it's possible that newline
>    conversion makes the signature invalid. How would you restore the
original
>    document? Before you answer think of:
>    a) a client running on a platform different from the one that inserted
>       the document;
>    b) a document with _mixed_ newline types, such as a windows text with
>       enmbedded bare \n or \r. [*]
I think there can be a difference between ASCII (text) data and binary data.
Think of the example of FTP again, which handles this very nicely. Binary
data should not have new lines converted. There is a well accepted
difference between binary and ascii. Text can be defined as being completely
readable by the human eye. If there happens to be binary data embedded in
the text, that is not a part of the text but rather supports the text. Any
data that can not have the binary data modified becuase it will render the
data unusable is not really text.
I am not familiar with signed documents, but I would question how they are
currently handled in a cross OS environment.  You may as well ask the same
question about how you would handle a jpeg image. Obviously you will not
want to look for LF and replace it with CRLF because in that data LF does
not mean skip a line.

>    what about any other function that may be affected by newline style?
>    I mean, the user may insert a text that he knows it's 1000 chars long,
>    and finds that PG thinks it's only 980. Is this "consistent"?

If a user inserts a string that he thinks is 1000 charcters long and across
PG implementations and documentation it is considered to be 980 charcters
long, then that is consistent. If he enters a string that he thinks is 1000
characters long and sometimes PG thinks that it is 980, sometimes 1000 and
sometimes 1050 that is not consistent.

Re: multi line text data/query ?bug?

From

Tom Lane

Date:

23 March 2005, 17:52:47

"Sim Zacks" <sim@compulab.co.il> writes:
> A query written on any client should return the same result. The query being
> the visible appearance on the screen.

This is presupposing the answer to the question at hand.  I do not agree
with the above premise; it would seem to imply, for example, expanding
tabs to spaces so that "where foo = '<tab>'" yields the same result as
"where foo = '  '" for some appropriate number of spaces.

            regards, tom lane

Re: multi line text data/query ?bug?

From

"Sim Zacks"

Date:

24 March 2005, 07:26:28

The difference between a Tab and a newline is that tab is a universally
recognized single ascii character while newline is in flux. Aside from this,
a tab is a quasi-viewable character as the cursor will not go to the middle
of the tab. Meaning if the tab takes up the space of 10 characters, you
could not scroll to the place where the 5th character would be if it were in
fact 10 spaces. You cannot highlight half of a tab in editors that allow
text highlighting. I would therefore say that a tab is as visible as a space
and can be easily differentiated. On the other hand, it is impossible to
determine which binary charcters the editor stuck in at the end of a newline
without looking at the binary/hex code.

I understand the complexity of dealing with multiple operating systems, but
seriously, how many non-viewable characters can be embedded in text that
actually make a difference between operating systems? Are there any besides
newline?

Sim

"Tom Lane" <tgl@sss.pgh.pa.us> wrote in message
news:8379.1111600299@sss.pgh.pa.us...
> "Sim Zacks" <sim@compulab.co.il> writes:
> > A query written on any client should return the same result. The query
being
> > the visible appearance on the screen.
>
> This is presupposing the answer to the question at hand.  I do not agree
> with the above premise; it would seem to imply, for example, expanding
> tabs to spaces so that "where foo = '<tab>'" yields the same result as
> "where foo = '  '" for some appropriate number of spaces.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: multi line text data/query ?bug?

From

Martijn van Oosterhout

Date:

24 March 2005, 10:16:53

On Thu, Mar 24, 2005 at 09:24:11AM +0200, Sim Zacks wrote:
> The difference between a Tab and a newline is that tab is a universally
> recognized single ascii character while newline is in flux. Aside from this,
> a tab is a quasi-viewable character as the cursor will not go to the middle
> of the tab. Meaning if the tab takes up the space of 10 characters, you
> could not scroll to the place where the 5th character would be if it were in
> fact 10 spaces. You cannot highlight half of a tab in editors that allow
> text highlighting. I would therefore say that a tab is as visible as a space
> and can be easily differentiated. On the other hand, it is impossible to
> determine which binary charcters the editor stuck in at the end of a newline
> without looking at the binary/hex code.

Actually, Emacs has a space-through-tab mode in which you can just move
through a tab as if it were spaces. If you delete or insert it
automatically reforms the tabs and spaces around it. Several editors
have an auto-lineending mode in which they'll detect the end of line
character and apply that everywhere. Admittedly this is an extreme case
but editors regularly show things that don't reflect the underlying
file.

> I understand the complexity of dealing with multiple operating systems, but
> seriously, how many non-viewable characters can be embedded in text that
> actually make a difference between operating systems? Are there any besides
> newline?

Sure, the character 0xE9 means different things depending on the
encoding and will sort differently based on the locale. Text files
generally don't indicate what encoding they are, leading to all sorts
of confusion. Unix tends to use Latin1 or UTF8. Windows has it's own
encoding.

IMHO, if you're trying to write portably, don't just hit enter when you
want an end of line, use \n or \r to indicate *exactly* what you mean.
Using a variable behaviour and expecting the server to fix it for you
is wrong. I beleive the server should take exactly what the client
gives as the client is the only one who knows for sure the type.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-94993.dat

Re: multi line text data/query ?bug?

From

Tom Lane

Date:

24 March 2005, 14:48:04

Martijn van Oosterhout <kleptog@svana.org> writes:
> IMHO, if you're trying to write portably, don't just hit enter when you
> want an end of line, use \n or \r to indicate *exactly* what you mean.

Indeed.  We are already permissive about line endings in SQL text, so
this discussion really boils down only to whether we should reinterpret
data that's inside a string literal.  There are good style reasons why
you should never put an unescaped newline into a string literal in the
first place ... and if you do, I don't think it's the database's job to
second-guess what you meant.

            regards, tom lane