Thread: copying json data and backslashes

copying json data and backslashes

From
Alastair McKinley
Date:
Hi all,

I have come across this apparently common issue COPY-ing json and wondering if there is potentially a better solution.

I am copying data into a jsonb column originating from a 3rd party API.  The data may have literal \r,\t,\n and also double backslashes.

I discovered that I can cast this data to a jsonb value directly but I can't COPY the data without pre-processing.

The example below illustrates my issue (only with \r, but the problem extends to other \X combinations).

do $$
lines=[r'{"test" : "\r this data has a carriage return"}']

with open("/tmp/test1.json","w") as f:
    for line in lines:
        f.write(line.strip() + "\n")

$$ language plpython3u;


create temp table testing (data jsonb);

-- this works
insert into testing (data)
select l::jsonb
from pg_read_file('/tmp/test1.json') f,
lateral regexp_split_to_table(f,'\n') l where l <> '';

-- fails
copy testing (data) from '/tmp/test1.json';

-- works
copy testing (data) from program $c$ sed -e 's/\\r/\\\\u000a/g' /tmp/test1.json $c$;

Is there any other solution with COPY that doesn't require manual implementation of search/replace to handle these edge cases?
Why does ::jsonb work but COPY doesn't?  It seems a bit inconsistent.

Best regards,

Alastair

Re: copying json data and backslashes

From
Erik Wienhold
Date:
> On 22/11/2022 15:23 CET Alastair McKinley <a.mckinley@analyticsengines.com> wrote:
>
> Hi all,
>
> I have come across this apparently common issue COPY-ing json and wondering if
> there is potentially a better solution.
>
> I am copying data into a jsonb column originating from a 3rd party API. The
> data may have literal \r,\t,\n and also double backslashes.
>
> I discovered that I can cast this data to a jsonb value directly but I can't
> COPY the data without pre-processing.
>
> The example below illustrates my issue (only with \r, but the problem extends
> to other \X combinations).
>
> > do $$
> > lines=[r'{"test" : "\r this data has a carriage return"}']
> >
> > with open("/tmp/test1.json","w") as f:
> > for line in lines:
> > f.write(line.strip() + "\n")
> >
> > $$ language plpython3u;
> >
> > create temp table testing (data jsonb);
> >
> > -- this works
> > insert into testing (data)
> > select l::jsonb
> > from pg_read_file('/tmp/test1.json') f,
> > lateral regexp_split_to_table(f,'\n') l where l <> '';
> >
> > -- fails
> > copy testing (data) from '/tmp/test1.json';
> >
> > -- works
> > copy testing (data) from program $c$ sed -e 's/\\r/\\\\u000a/g' /tmp/test1.json $c$;
> >
>
> Is there any other solution with COPY that doesn't require manual
> implementation of search/replace to handle these edge cases?
> Why does ::jsonb work but COPY doesn't? It seems a bit inconsistent.

COPY handles special backslash sequences[1].  The \r in your sample JSON,
although properly escaped according to JSON, is replaced with an actual
carriage return by COPY before casting to jsonb.  The error results from JSON
prohibiting unescaped control characters in strings[2].

You must double escape to pass those characters through COPY.

See how COPY outputs backslash sequences:

    -- Actual carriage return:
    copy (select e'\r') to stdout;
    \r

    -- Backslash sequence for carriage return:
    copy (select '\r') to stdout;
    \\r

[1] https://www.postgresql.org/docs/current/sql-copy.html#id-1.9.3.55.9.2
[2] https://www.json.org/json-en.html

--
Erik



Re: copying json data and backslashes

From
"pbj@cmicdo.com"
Date:
 >
 > On Tuesday, November 22, 2022 at 10:16:11 AM EST, Erik Wienhold <ewie@ewie.name> wrote:
 >
 >
 > > On 22/11/2022 15:23 CET Alastair McKinley <a.mckinley@analyticsengines.com> wrote:
 > >
 > > Hi all,
 > >
 > > I have come across this apparently common issue COPY-ing json and wondering if
 > > there is potentially a better solution.
 > >
 > > I am copying data into a jsonb column originating from a 3rd party API. The
 > > data may have literal \r,\t,\n and also double backslashes.
 > >
 > > I discovered that I can cast this data to a jsonb value directly but I can't
 > > COPY the data without pre-processing.
 >
 >
 > > Is there any other solution with COPY that doesn't require manual
 > > implementation of search/replace to handle these edge cases?
 > > Why does ::jsonb work but COPY doesn't? It seems a bit inconsistent.
 >
 > COPY handles special backslash sequences[1].  The \r in your sample JSON,
 > although properly escaped according to JSON, is replaced with an actual
 > carriage return by COPY before casting to jsonb.  The error results from JSON
 > prohibiting unescaped control characters in strings[2].
 >
 > You must double escape to pass those characters through COPY.
 >
 > See how COPY outputs backslash sequences:
 >
 >     -- Actual carriage return:
 >     copy (select e'\r') to stdout;
 >     \r
 >
 >     -- Backslash sequence for carriage return:
 >     copy (select '\r') to stdout;
 >
 >     \\r




Re: copying json data and backslashes

From
Alastair McKinley
Date:
> From: pbj@cmicdo.com <pbj@cmicdo.com>
> Sent: 22 November 2022 15:30
> To: Alastair McKinley <a.mckinley@analyticsengines.com>; pgsql-general@lists.postgresql.org
<pgsql-general@lists.postgresql.org>;Erik Wienhold <ewie@ewie.name> 
> Subject: Re: copying json data and backslashes
>  
>  >
>  > On Tuesday, November 22, 2022 at 10:16:11 AM EST, Erik Wienhold <ewie@ewie.name> wrote:
>  >
>  >
>  > > On 22/11/2022 15:23 CET Alastair McKinley <a.mckinley@analyticsengines.com> wrote:
>  > >
>  > > Hi all,
>  > >
>  > > I have come across this apparently common issue COPY-ing json and wondering if
>  > > there is potentially a better solution.
>  > >
>  > > I am copying data into a jsonb column originating from a 3rd party API. The
>  > > data may have literal \r,\t,\n and also double backslashes.
>  > >
>  > > I discovered that I can cast this data to a jsonb value directly but I can't
>  > > COPY the data without pre-processing.
>  >
>  >
>  > > Is there any other solution with COPY that doesn't require manual
>  > > implementation of search/replace to handle these edge cases?
>  > > Why does ::jsonb work but COPY doesn't? It seems a bit inconsistent.
>  >
>  > COPY handles special backslash sequences[1].  The \r in your sample JSON,
>  > although properly escaped according to JSON, is replaced with an actual
>  > carriage return by COPY before casting to jsonb.  The error results from JSON
>  > prohibiting unescaped control characters in strings[2].
>  >
>  > You must double escape to pass those characters through COPY.
>  >
>  > See how COPY outputs backslash sequences:
>  >
>  >     -- Actual carriage return:
>  >     copy (select e'\r') to stdout;
>  >     \r
>  >
>  >     -- Backslash sequence for carriage return:
>  >     copy (select '\r') to stdout;
>  >
>  >     \\r
>
> I have been able to get around this problem by using the following method:
>
> \copy footable from 'input.json' (format csv, escape '^B', delimieter '^C', quote '^E')
>
> where the control characters are the actual control char, not the caret-letter, and it requires no escaping escapes. 
Irealize this won't work for all 
> situations.
>
> PJ

Hi PJ,

Thanks for the suggestion, this is interesting to me to try but I am not quite sure how this works.
As far as I understand, escape/quote/delimiter have to be a single character, and CTRL-C etc. are multiple characters.

What way do you input each of the escape/quote/delimiter characters?

Best regards,

Alastair

>
>  >
>  > [1]
>  > https://www.postgresql.org/docs/current/sql-copy.html#id-1.9.3.55.9.2
>  > [2] https://www.json.org/json-en.html
>  >
>  > --
>  > Erik
>





Re: copying json data and backslashes

From
"Peter J. Holzer"
Date:
On 2022-11-22 17:39:04 +0000, Alastair McKinley wrote:
> > \copy footable from 'input.json' (format csv, escape '^B', delimieter '^C', quote '^E')
> >
> > where the control characters are the actual control char, not the
> > caret-letter, and it requires no escaping escapes.  I realize this
> > won't work for all
> > situations.
>
> Thanks for the suggestion, this is interesting to me to try but I am
> not quite sure how this works.
> As far as I understand, escape/quote/delimiter have to be a single
> character, and CTRL-C etc. are multiple characters.

You may have to hit several Keys[1] on your keyboard, but Ctrl-C is a
single character, just like Shift-C is (the former has code 0003, the
latter 0043).

On Unix-like systems you can usually type the control characters by
typing Ctrl-V first:

At the psql prompt, type
    select ascii('
then hit V while holding the ctrl key
then hit C while holding the ctrl key
The terminal should display that as ^C
then complete the line with
    ');
so that it looks like
    select ascii('^C');
and hit return:
╔═══════╗
║ ascii ║
╟───────╢
║     3 ║
╚═══════╝
(1 row)

Same for the other ctrl characters.

        hp

[1] There are usually four Ctrl-Characters which need only a single
    key: Ctrl-I (TAB), Ctrl-M (CR), Ctrl-[ (ESC) and Ctrl-H (BS) or Ctrl-?
    (DEL).

    (On Unix systems CR is normally translated to LF, on Windows to CRLF)


--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: copying json data and backslashes

From
"pbj@cmicdo.com"
Date:


On Tuesday, November 22, 2022 at 01:16:02 PM EST, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

 > On 2022-11-22 17:39:04 +0000, Alastair McKinley wrote:
 > > > \copy footable from 'input.json' (format csv, escape '^B', delimieter '^C
', quote '^E')
 > > >
 > > > where the control characters are the actual control char, not the
 > > > caret-letter, and it requires no escaping escapes.  I realize this
 > > > won't work for all
 > > > situations.
 > >
 > > Thanks for the suggestion, this is interesting to me to try but I am
 > > not quite sure how this works.
 > > As far as I understand, escape/quote/delimiter have to be a single
 > > character, and CTRL-C etc. are multiple characters.
 >

Yes, Alastair, Peter said what I would have...

 > You may have to hit several Keys[1] on your keyboard, but Ctrl-C is a
 > single character, just like Shift-C is (the former has code 0003, the
 > latter 0043).
 >
 > On Unix-like systems you can usually type the control characters by
 > typing Ctrl-V first:
 >
 > At the psql prompt, type
 >     select ascii('
 > then hit V while holding the ctrl key
 > then hit C while holding the ctrl key
 > The terminal should display that as ^C
 > then complete the line with
 >     ');
 > so that it looks like
 >     select ascii('^C');
 > and hit return:
 >
 >
 > [1] There are usually four Ctrl-Characters which need only a single
 >     key: Ctrl-I (TAB), Ctrl-M (CR), Ctrl-[ (ESC) and Ctrl-H (BS) or Ctrl-?
 >     (DEL).
 >  
 >     (On Unix systems CR is normally translated to LF, on Windows to CRLF)
 >  


Re: copying json data and backslashes

From
Erik Wienhold
Date:
> On 22/11/2022 20:11 CET pbj@cmicdo.com <pbj@cmicdo.com> wrote:
>
> On Tuesday, November 22, 2022 at 01:16:02 PM EST, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
> > On 2022-11-22 17:39:04 +0000, Alastair McKinley wrote:
>  > > > \copy footable from 'input.json' (format csv, escape '^B', delimieter '^C
> ', quote '^E')
>  > > >
>  > > > where the control characters are the actual control char, not the
>  > > > caret-letter, and it requires no escaping escapes. I realize this
>  > > > won't work for all
>  > > > situations.
>  > >
>  > > Thanks for the suggestion, this is interesting to me to try but I am
>  > > not quite sure how this works.
>  > > As far as I understand, escape/quote/delimiter have to be a single
>  > > character, and CTRL-C etc. are multiple characters.
>  >
>
> Yes, Alastair, Peter said what I would have...
>
>  > You may have to hit several Keys[1] on your keyboard, but Ctrl-C is a
>  > single character, just like Shift-C is (the former has code 0003, the
>  > latter 0043).
>  >
>  > On Unix-like systems you can usually type the control characters by
>  > typing Ctrl-V first:
>  >
>  > At the psql prompt, type
>  > select ascii('
>  > then hit V while holding the ctrl key
>  > then hit C while holding the ctrl key
>  > The terminal should display that as ^C
>  > then complete the line with
>  > ');
>  > so that it looks like
>  > select ascii('^C');
>  > and hit return:
>  >
>  >
>  > [1] There are usually four Ctrl-Characters which need only a single
>  > key: Ctrl-I (TAB), Ctrl-M (CR), Ctrl-[ (ESC) and Ctrl-H (BS) or Ctrl-?
>  > (DEL).
>  >
>  > (On Unix systems CR is normally translated to LF, on Windows to CRLF)
>  >

Or use C-style escapes[1]:

    \copy footable from 'input.json' (format csv, escape e'\2', delimiter e'\3', quote e'\5')

[1] https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-ESCAPE

--
Erik