Thread: Proposal to improve the content in subsection 8.16.6. "Composite Type Input and Output Syntax"

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/11/rowtypes.html
Description:

1. Replace this:

"...the whitespace will be ignored if the field type is integer, but not if
it is text."

with this:

"...the whitespace will be ignored if the field type is integer, but not if
it is a character data type like text."

----------------------------------------------------------------------------------------------

2. Find this:

"...you can write double quotes around any individual field value. You must
do so if the field value would otherwise confuse..."

Insert this between the two sentences, i.e. after "...any individual field
value." and before "You must do so if...":

<<
Notice that, as previously stated, the value for a field whose data type is
text, varchar, or char starts with the charater that immediately follows the
opening parenthesis or comma deliminator, and ends with the character that
immediately precedes the comma deliminator or closing parenthesis. This
means that when you choose to surround a character value with double quotes,
the opening double quote must immediately follow the starting deliminator
and the closing double quote must immediately precede the closing
deliminator, like this:

'("hello world")'

If you don't follow this rule and write, for example, this:

'(   "hello world"   )'

then you will not get an error. However, the rules for the outcome are
undefined and you should simply avoid doing this.
>>

PG Doc comments form <noreply@postgresql.org> writes:
> 1. Replace this:
> "...the whitespace will be ignored if the field type is integer, but not if
> it is text."
> with this:
> "...the whitespace will be ignored if the field type is integer, but not if
> it is a character data type like text."

It's already an example, so I don't see this as an improvement.

> Insert this between the two sentences, i.e. after "...any individual field
> value." and before "You must do so if...":

The proposed additional text is flat-out wrong.

What actually happens here is that text between quotes is considered
quoted (so that, for example, commas within it are not field separators),
but that does not exclude there being other unquoted text within the
same field value.

            regards, tom lane



Tom Lane <tgl@sss.pgh.pa.us> wrote:

...

It's already an example, so I don't see this as an improvement.

...

The proposed additional text is flat-out wrong.

What actually happens here is that text between quotes is considered
quoted (so that, for example, commas within it are not field separators),
but that does not exclude there being other unquoted text within the
same field value.

regards, tom lane

Thanks, Tom. Your reply was very helpful. It prompted me to read, re-read, and then re-re-read 8.16.6 several times. I did notice that neither of “varchar” nor “char” are found in the whole of section 8.16. But I suppose that it’s obvious to the reader that “text” is to be taken as a term of art, denoting “character data types” and not as the name of one of these data types.

I ran some fairly exhaustive empirical tests. Not to be mysterious, I created two “.sql” files. The zip is attached.

record_literal.sql
------------------
This uses "type rt as (v text)” and "table t(k serial primary key, literal text, r rt)”.
It inserts a series of text values into column “literal” as is and into column “r” typecast to “rt”.
I aimed to end up with specified target values shown by “select '>'||(r).v||’<‘ from t”.

Here are two example findings,

First: for this specified target:
>hello<

I found that each of these five literals produced the same result:
(hello)
("hello")
(h\ello)
("h\ello")
(""hello"")

The result was canonically rendered in psql (by its implicit typecast to “text:) thus:
(hello)

Second: for this specified target:
> hello "you" <

I found that each of these six literals produced the same result:
( hello \"you\” )
("" hello \"you\" "")
(" hello \"you\" ")
(" hello ""you"" ")
( "hello \"you\"" )
( "hello ""you""" )

The result was canonically rendered in psql thus:
(" hello ""you"" ")

This observation tells me that this:
""
serves as the syntax to escape a single double quote within a surrounding double quote pair, as an alternative to this:
\"

I searched in the whole of section 8.16 for this:
""
but I found its meaning defined only as the empty string when it occurs outside of a surrounding double quote pair. I assume that it’s obvious to most readers that it has a second meaning in the other context.

I found that, with is extra one rule, I could explain all my observations. 

I observed, over all my tests, that there always exists a canonical form like this:
("...")
where the ellipsis indicates a well-formed sequence of characters that produces my specified target.

I’m going to elevate this to a hypothesis.

And unless/until it’s disproved, I’m going to adopt a rule of practice (for myself) always to use this form.

literal-for-array-of-records.sql
--------------------------------
This was designed to help me meet my ultimate goal: I need to write a client program (say, in python) that will process data from a file whose semantic content is sufficient to let me populate the “rt[]” field in one row of a table, given by “type rt as (n numeric, s text, t timestamp, b booleaan)”.

I’m afraid I couldn’t find the rules in the Version 11 PG doc. This probably reflects my poor searching skills. However, I did find a clue here:


I used the approach that it showed me to construct an example value for the “rt[]” field bottom-up using a “DO” block. And then I printed the canonical form that represets the value. Here’s an example:

For an array with these two values (as shown in psql):

   n   |  s  |          t          | b 
-------+-----+---------------------+---
 42.17 | dog | 2020-02-17 20:55:00 | t

   n   |          s          |          t          | b 
-------+---------------------+---------------------+---
 19.13 | "Hello", she\he\r  +| 2019-07-21 16:47:00 | f
       | (said | is saying). |                     | 

The “\r”, along with the “+” that I’m used to, seem to be artifacts of how psql displays what I created with chr(10). But only in this special context. (This "select 'a'||chr(10)||’b'” produces only the “+” but not the “\r”. Strange.)


I discovered that the output is produced by this:

{

  "(42.17,dog,\"2020-02-17 20:55:00\",t)",
  "(19.13,\"\"\"Hello\"\", she\\\\he
(said | is saying).\",\"2019-07-21 16:47:00\",f)"

}

I am able to formulate the rules to produce this programmatically (in Python) starting with my target values expressed as Python variables. Here’s how I did it. Notice, that for simplicity here, “rt” has just a single text field. The proper code has to put the surrounding double quootes and parentheses around the comma-separated list of escaped field representations.

def fix_1(s):
    # Do the escaping needed for a stringy value.
    s = s.replace(backslash, two_backslashes)
    s = s.replace(quote, two_quotes)
    return s

def fix_2(s):
    # Do the escaping to fix the bare record representation for use as a array element.
    s = s.replace(backslash, two_backslashes)
    s = s.replace(quote, backslash_quote)
    return s

def format(s):
    s = '("' + fix_1(s)  + '")'
    s = fix_2(s)
    return s


I found out, along the way, that while it isn’t necessary to surround, say, a numeric value with double quotes when it takes its place in the literal for a “rt” value, neither is it harmful to do this. In the same way, the code that escapes special characters will have no effect, but will be harmless. I therefore decided not to implement extra logic to prepare such items differently from text items. I appreciate that my program is therefore slower than it might be.

Long story short, my goal is met. Thanks again.




Attachment
On Mon, Apr 6, 2020 at 12:46 AM Bryn Llewellyn <bryn@yugabyte.com> wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

...

It's already an example, so I don't see this as an improvement.

...

The proposed additional text is flat-out wrong.

What actually happens here is that text between quotes is considered
quoted (so that, for example, commas within it are not field separators),
but that does not exclude there being other unquoted text within the
same field value.

regards, tom lane

Thanks, Tom. Your reply was very helpful. It prompted me to read, re-read, and then re-re-read 8.16.6 several times. I did notice that neither of “varchar” nor “char” are found in the whole of section 8.16

 
But I suppose that it’s obvious to the reader that “text” is to be taken as a term of art, denoting “character data types” and not as the name of one of these data types.

The word "text" in Tom's paragraph isn't talking about a data type at all.  Its talking about symbols.

Second: for this specified target:
> hello "you" <

I found that each of these six literals produced the same result:
( hello \"you\” )
("" hello \"you\" "")
(" hello \"you\" ")
(" hello ""you"" ")
( "hello \"you\"" )
( "hello ""you""" )

The result was canonically rendered in psql thus:
(" hello ""you"" ")

This observation tells me that this:
""
serves as the syntax to escape a single double quote within a surrounding double quote pair, as an alternative to this:
\"

I searched in the whole of section 8.16 for this:
""
but I found its meaning defined only as the empty string when it occurs outside of a surrounding double quote pair. I assume that it’s obvious to most readers that it has a second meaning in the other context.

You need to expand your search to more than just the literal symbol written "".....

Quoting from the 8.16.6 docs.....

"To put a double quote or backslash in a quoted composite field value, precede it with a backslash. (Also, a [emp]pair of double quotes within a double-quoted field value is taken to represent a double quote character[emp], analogously to the rules for single quotes in SQL literal strings.)"

"Double quotes and backslashes embedded in field values will be doubled."

I found that, with is extra one rule, I could explain all my observations. 

I observed, over all my tests, that there always exists a canonical form like this:
("...")
where the ellipsis indicates a well-formed sequence of characters that produces my specified target.

I’m going to elevate this to a hypothesis.

And unless/until it’s disproved, I’m going to adopt a rule of practice (for myself) always to use this form.

The " " just means first treat the underlying content as being textual (but untyped).  Since every type has an input function that accepts textual data and converts it to the relevant type this works.

However, you may wish to explore other data types comprising your composite and consider boundary cases (empty string or missing/null) variations before solidifying your hypothesis.

For example at least consider why the following fails.

create type rtint as (a int);
select '("")'::rtint;

literal-for-array-of-records.sql
--------------------------------
This was designed to help me meet my ultimate goal: I need to write a client program (say, in python) that will process data from a file whose semantic content is sufficient to let me populate the “rt[]” field in one row of a table, given by “type rt as (n numeric, s text, t timestamp, b booleaan)”.

I’m afraid I couldn’t find the rules in the Version 11 PG doc. This probably reflects my poor searching skills.

The immediately preceding chapter in the documentation discusses arrays at the same level of detail and definitional authority as the chapter on composite types.  Its left as an exercise to the reader to deal with the nesting dynamic - in particular the escaping of special characters especially those common to both syntaxes.


   n   |          s          |          t          | b 
-------+---------------------+---------------------+---
 19.13 | "Hello", she\he\r  +| 2019-07-21 16:47:00 | f
       | (said | is saying). |                     | 

The “\r”, along with the “+” that I’m used to, seem to be artifacts of how psql displays what I created with chr(10). But only in this special context. (This "select 'a'||chr(10)||’b'” produces only the “+” but not the “\r”. Strange.)

Welcome to the fun world of line break (newline) character sequences.  Just for fun:

On a typewriter if you want to begin a new line your machine has to physically do two things:
1 - move the piece of paper up one row (line feed) - this command code was assigned chr(10) and is abbreviated \n
2 - move the strike bar back to the starting position (carriage return) - code chr(13) and is abbreviated \r

Windows faithfully reproduced the typewriter action and defines the two character sequence chr(13) + chr(10) as the newline operation.

Linux (its predecessors) took a more practical approach and decided since a computer isn't a typewriter it doesn't need both characters to represent a new line operation, one is sufficient and "line feed" makes the most sense so chr(10) is chosen.

Now onto psql - when it encounters a linefeed in the value data instead of printing out the escape it actually breaks the content being printed into lines.  It also prints a plus sign (+) and the end of the line.

This seems to be under-documented.  I'm also unsure whether the behavior has been normalized for different platforms.
 

I discovered that the output is produced by this:

{

  "(42.17,dog,\"2020-02-17 20:55:00\",t)",
  "(19.13,\"\"\"Hello\"\", she\\\\he
(said | is saying).\",\"2019-07-21 16:47:00\",f)"

}

I am able to formulate the rules to produce this programmatically (in Python) starting with my target values expressed as Python variables. Here’s how I did it. Notice, that for simplicity here, “rt” has just a single text field. The proper code has to put the surrounding double quootes and parentheses around the comma-separated list of escaped field representations.


If you accept that only PostgreSQL is going to read the resulting data you might want to consider taking the recommendation at the end of the big note in 8.16.6 and output dollar quotes instead of double quotes - at least in the array construction case.

David J.