Re: Proposal to improve the content in subsection 8.16.6. "CompositeType Input and Output Syntax" - Mailing list pgsql-docs

From David G. Johnston
Subject Re: Proposal to improve the content in subsection 8.16.6. "CompositeType Input and Output Syntax"
Date
Msg-id CAKFQuwZ9BZNUp4YjpszEZRJaVjDXZcHnRe+xPHLf=zSVMt__+g@mail.gmail.com
Whole thread Raw
In response to Re: Proposal to improve the content in subsection 8.16.6. "CompositeType Input and Output Syntax"  (Bryn Llewellyn <bryn@yugabyte.com>)
List pgsql-docs
On Mon, Apr 6, 2020 at 12:46 AM Bryn Llewellyn <bryn@yugabyte.com> wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

...

It's already an example, so I don't see this as an improvement.

...

The proposed additional text is flat-out wrong.

What actually happens here is that text between quotes is considered
quoted (so that, for example, commas within it are not field separators),
but that does not exclude there being other unquoted text within the
same field value.

regards, tom lane

Thanks, Tom. Your reply was very helpful. It prompted me to read, re-read, and then re-re-read 8.16.6 several times. I did notice that neither of “varchar” nor “char” are found in the whole of section 8.16

 
But I suppose that it’s obvious to the reader that “text” is to be taken as a term of art, denoting “character data types” and not as the name of one of these data types.

The word "text" in Tom's paragraph isn't talking about a data type at all.  Its talking about symbols.

Second: for this specified target:
> hello "you" <

I found that each of these six literals produced the same result:
( hello \"you\” )
("" hello \"you\" "")
(" hello \"you\" ")
(" hello ""you"" ")
( "hello \"you\"" )
( "hello ""you""" )

The result was canonically rendered in psql thus:
(" hello ""you"" ")

This observation tells me that this:
""
serves as the syntax to escape a single double quote within a surrounding double quote pair, as an alternative to this:
\"

I searched in the whole of section 8.16 for this:
""
but I found its meaning defined only as the empty string when it occurs outside of a surrounding double quote pair. I assume that it’s obvious to most readers that it has a second meaning in the other context.

You need to expand your search to more than just the literal symbol written "".....

Quoting from the 8.16.6 docs.....

"To put a double quote or backslash in a quoted composite field value, precede it with a backslash. (Also, a [emp]pair of double quotes within a double-quoted field value is taken to represent a double quote character[emp], analogously to the rules for single quotes in SQL literal strings.)"

"Double quotes and backslashes embedded in field values will be doubled."

I found that, with is extra one rule, I could explain all my observations. 

I observed, over all my tests, that there always exists a canonical form like this:
("...")
where the ellipsis indicates a well-formed sequence of characters that produces my specified target.

I’m going to elevate this to a hypothesis.

And unless/until it’s disproved, I’m going to adopt a rule of practice (for myself) always to use this form.

The " " just means first treat the underlying content as being textual (but untyped).  Since every type has an input function that accepts textual data and converts it to the relevant type this works.

However, you may wish to explore other data types comprising your composite and consider boundary cases (empty string or missing/null) variations before solidifying your hypothesis.

For example at least consider why the following fails.

create type rtint as (a int);
select '("")'::rtint;

literal-for-array-of-records.sql
--------------------------------
This was designed to help me meet my ultimate goal: I need to write a client program (say, in python) that will process data from a file whose semantic content is sufficient to let me populate the “rt[]” field in one row of a table, given by “type rt as (n numeric, s text, t timestamp, b booleaan)”.

I’m afraid I couldn’t find the rules in the Version 11 PG doc. This probably reflects my poor searching skills.

The immediately preceding chapter in the documentation discusses arrays at the same level of detail and definitional authority as the chapter on composite types.  Its left as an exercise to the reader to deal with the nesting dynamic - in particular the escaping of special characters especially those common to both syntaxes.


   n   |          s          |          t          | b 
-------+---------------------+---------------------+---
 19.13 | "Hello", she\he\r  +| 2019-07-21 16:47:00 | f
       | (said | is saying). |                     | 

The “\r”, along with the “+” that I’m used to, seem to be artifacts of how psql displays what I created with chr(10). But only in this special context. (This "select 'a'||chr(10)||’b'” produces only the “+” but not the “\r”. Strange.)

Welcome to the fun world of line break (newline) character sequences.  Just for fun:

On a typewriter if you want to begin a new line your machine has to physically do two things:
1 - move the piece of paper up one row (line feed) - this command code was assigned chr(10) and is abbreviated \n
2 - move the strike bar back to the starting position (carriage return) - code chr(13) and is abbreviated \r

Windows faithfully reproduced the typewriter action and defines the two character sequence chr(13) + chr(10) as the newline operation.

Linux (its predecessors) took a more practical approach and decided since a computer isn't a typewriter it doesn't need both characters to represent a new line operation, one is sufficient and "line feed" makes the most sense so chr(10) is chosen.

Now onto psql - when it encounters a linefeed in the value data instead of printing out the escape it actually breaks the content being printed into lines.  It also prints a plus sign (+) and the end of the line.

This seems to be under-documented.  I'm also unsure whether the behavior has been normalized for different platforms.
 

I discovered that the output is produced by this:

{

  "(42.17,dog,\"2020-02-17 20:55:00\",t)",
  "(19.13,\"\"\"Hello\"\", she\\\\he
(said | is saying).\",\"2019-07-21 16:47:00\",f)"

}

I am able to formulate the rules to produce this programmatically (in Python) starting with my target values expressed as Python variables. Here’s how I did it. Notice, that for simplicity here, “rt” has just a single text field. The proper code has to put the surrounding double quootes and parentheses around the comma-separated list of escaped field representations.


If you accept that only PostgreSQL is going to read the resulting data you might want to consider taking the recommendation at the end of the big note in 8.16.6 and output dollar quotes instead of double quotes - at least in the array construction case.

David J.

pgsql-docs by date:

Previous
From: Bryn Llewellyn
Date:
Subject: Re: Proposal to improve the content in subsection 8.16.6. "CompositeType Input and Output Syntax"
Next
From: PG Doc comments form
Date:
Subject: recovery.conf in documentation for postgresql version 12