Re: Multiline plpython procedure - Mailing list pgsql-general

From Marco Colombo
Subject Re: Multiline plpython procedure
Date
Msg-id Pine.LNX.4.61.0501211520460.4205@Megathlon.ESI
Whole thread Raw
In response to Re: Multiline plpython procedure  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-general
On Fri, 21 Jan 2005, Martijn van Oosterhout wrote:

> On Fri, Jan 21, 2005 at 12:02:09PM +0100, Marco Colombo wrote:
>> On Fri, 21 Jan 2005, Greg Stark wrote:
>>> I don't think it's reasonable for pg_dump to think about converting
>>> data from one language to another. It's important for pg_dump to
>>> restore an identical database. Having it start with special case
>>> data conversation from one flavour to another seems too dangerous.
>>
>> Makes no sense. pg_dump already make a lot of conversions: from internal
>> representation (which may be platform dependent) to some common format,
>> say text. It's just multi-line text which is a hard to deal with, because
>> there _no_ single format for it. pg_dump may just choose one format, and
>> stick with it. Every dump/restore will work. You may have trouble editing
>> a text dump, but that's another matter. BTW, what pg_dump does on windows?
>> I mean with -F p. Does it produce a text file with CRNL line seperator?
>> What happens if you feed that file to psql on a Unix box?
>
> Ah, but you see, looking at it from your point of view, pg_dump doesn't
> interpret text strings. For example, the python script in a function is
> a opaque string. Not multiline, nothing. All postgresql does is pass
> that block of opaque data to the interpreter for that language. pg_dump
> dumps that opaque data into the output, and the CREATE FUNCTION dumps
> that opaque data back into the system tables. Postgresql doesn't
> understand python any more or less than perl, tcl, R or any other
> language.

I was referring to psql output in general.
E.g. (comments stripped):
CREATE TABLE t2 (
     f1 text
);

COPY t2 (f1) FROM stdin;
test1
test2
test3
\.

This dump, produced on Unix, will have lines separated by \n. What does the
same dump produced on Windows look like? If it's \n separated, it's not
editable (natively) on Windows. Which is fine to me, we just defined pg_dump
textual output to be \n terminated, always. Or, it's \r\n terminated. If so,
how would it be to restore it on a Unix box (with psql -f). Now, if the
data contains a \r I think it shows like that, escaped. Whether intended
or not, that's the only thing that saves us (note that there's no need
to escape a bare \r in Unix).

> The argument here is that basically this opaque data has different
> meanings for Python on windows and Python on unix. You can't make any
> special cases because I can rename plperl.so to plpython.so (or
> vice-versa) the opaque data won't be passed to the interpreter that
> you'd expect from looking at the definition.
>
>> I'm for defining a format used by PostgreSQL, and force the python parser
>> into accepting it on all platforms. That is, let's set the rule that
>> python programs to be embedded into PostgreSQL use \n as line termination.
>
> Wouldn't that disadvantage non-unix pl/python users, whose python
> functions would have to be converted at run-time to conform to the
> local text format. With the extra bummer that the resulting string may
> not be the same size either. Remember, postgresql uses the standard
> shared library for the language on the platform, it doesn't build its
> own.
> But sure, preprocessing the source at run-time seems to be the only
> realistic solution without a change to the interpreter.

Yeah. My fav. solution is to convert the string to platform format before
passing it to the parser. See the martian example.

>> Think of this: tomorrow we meet people from Mars. One of them really likes
>> PostgreSQL, and ports it to their platform. Being a martian platform, it
>> uses a different text file format. Line separator there is the first 1000
>
> <snip>
>
> Spurious argument. You're assuming Martians would use ASCII to write
> programs without using one of the two defined line-ending characters.
> If they were smart they'd simply use a character set which doesn't have
> the ambiguity. If they even use 8-bit bytes. An ASCII C compiler won't
> compile EBCDIC source code either, but nobody thinks that's
> unreasonable, probably because nobody uses EBCDIC anymore :).

You missed the point. Charset has nothing to do with the issue.
While you can handle both at the same time, they are unrelated.
Line separator is not dictated by the charset, only by the platform.
\r\n or \n or \r for line termination is _not_ defined by ASCII.
The _same_ ASCII textfile looks differently when looked in binary mode
on various platforms. The point was: what if someone introduces
another platform with yet-another-line-termination-standard?
It's unlikely, just like martians. But it makes you realize that
conversion is the job of the software that handles inter-platform
communication (much like FTP).

> No-one is complaining about the use of line-ending characters, they
> could have said that you need a semi-colon to seperate "lines". The
> problem is that it's *not consistant* across platforms.
>
> Have a nice day,

What about C? How about fopen("afile", "r") in C? Is it "portable"?
Or should you use: fopen("afile", "rb")? Define "consistant across
platforms" here. If you use "rb", your program will be consistant
in that with the same _binary_ input, produces the same _binary_
output. But if it's supposed to handle text files, it will fail.
That is, it is consistant if it is supposed to handle binary data,
it is not if it is supposed to handle text files. If you use "r",
it's the opposite. No matter what, your program will never be
completely consistant! You have to decide if it handles text file
_or_ binary data (unless you make runtime detection, that is, of
course - but that's another matter. Under Windows you can assume
a .txt file is "text". Under Unix things are not that simple).

Think of the meaning of '$' in a regular expression. What (binary)
character(s) does it match? I expect it to match \n under Unix and
the sequence \r\n under Windows. What is the usage scope of '$'?
A multiline text. If you look at the data you're using it on as
_binary_ data, it's behaviour it's inconsistant.

Face it, _every_ time you're handling multiline text data, you
should know in advance what separator it uses. If handling includes
moving across platforms, you should take care of conversion, _before_
you pass it to an external program that expects textual input.

Try and read the binmode() entry in the Perl manual. In particular:
" For the sake of portability it is a good idea to always
   use it when appropriate, and to never use it when it isn't
   appropriate."
That is, you should be well aware of that type of data you're
handling, and handle it correctly. Burying your head in the sand
and say "well I treat it as binary opaque data, so I'm fine" is
calling for problems. Expecially when you're moving it across platform.

Otherwise, you _define_ it to be binary data (and users may have
problems in reading it as text).

.TM.
--
       ____/  ____/   /
      /      /       /            Marco Colombo
     ___/  ___  /   /              Technical Manager
    /          /   /             ESI s.r.l.
  _____/ _____/  _/               Colombo@ESI.it

pgsql-general by date:

Previous
From: "Frank D. Engel, Jr."
Date:
Subject: Re: What is the format of 'binary' data in the postgresql client/server protocol version 3
Next
From: fabrizio.boco@tin.it
Date:
Subject: Cube