Thread: BUG #12320: json parsing with embedded double quotes

BUG #12320: json parsing with embedded double quotes

From
postgres@bt-r.com
Date:
The following bug has been logged on the website:

Bug reference:      12320
Logged by:          Aaron Botsis
Email address:      postgres@bt-r.com
PostgreSQL version: 9.4.0
Operating system:   Mac OSX
Description:

postgres=# create table ct (data jsonb);
CREATE TABLE
postgres=# copy ct (data) from  '/Users/aaron/Desktop/queries.json';
ERROR:  invalid input syntax for type json
DETAIL:  Token "root" is invalid.
CONTEXT:  JSON data, line 1: ...1418066241619 AND <=1418671041621) AND
user:"root...
COPY ct, line 3, column data:
"{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"qu..."


This is a sample of the JSON:


{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"query":{"query_string":{"query":"organization_id:[REDACTED]
AND timestamp:(>=1418066241619 AND <=1418671041621) AND

user:\"root\""}}}}},"from":0,"size":20,"timeout":10000,"sort":{"timestamp":{"order":"desc"}},"facets":{}},"organization_id":"[REDACTED]","indices":["logs.[REDACTED].20141209","logs.[REDACTED].20141208","logs.[REDACTED].20141215","logs.[REDACTED].20141210","logs.[REDACTED].20141213","logs.[REDACTED].20141214","logs.[REDACTED].20141211","logs.[REDACTED].20141212"],"type":"all","level":"info","message":""}

Re: BUG #12320: json parsing with embedded double quotes

From
Francisco Olarte
Date:
Hi Aaron:
....

>
> user:\"root\""}}}}},"from":0,"size":20,"timeout":10000,"sort":{"timestamp":{"order":"

.....

Copy from does it's own round of backslash scaping, if you look at

http://www.postgresql.org/docs/9.4/static/sql-copy.html

The third paragraph under File formats / text format reads:

Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters. In
particular, the following characters must be preceded by a backslash if
they appear as part of a column value: backslash itself, newline, carriage
return, and the current delimiter character.

As you want the backslash as part of the data you need to double it in the
file, otherwise you'll fall into the later case:
.....Any other backslashed character that is not mentioned in the above
table will be taken to represent itself

And the backslash will get eaten and the json parser will fail. It's the
same problem that you get when you try to put windows paths or perl type
regular expresions inside doube quoted C/Java/perl/shell/and_many_more
Strings.

If your table is always a single column and you cannot edit the file, you
can always try to change the scape character.

Regards.
    Francisco Olarte.

Re: BUG #12320: json parsing with embedded double quotes

From
Francisco Olarte
Date:
Hi Aaron:

On Tue, Jan 6, 2015 at 7:06 PM, Aaron Botsis <aaron@bt-r.com> wrote:

> Hi Francisco, I=E2=80=99m aware, but still consider this to be a bug, or =
at least
> a great opportunity for an enhancement. :)
>

Maybe, but you are going to have a problem.


> This had bitten me for the third time while trying to import some json
> data. It=E2=80=99d be great to bypass the copy escaping (and possibly oth=
er meta
> characters) when the column type is json or jsonb. I=E2=80=99d be happy t=
o try and
> write it and submit a patch if folks believe this is an acceptable way to
> go=E2=80=A6 That said, I should probably read what the process is for thi=
s kind of
> thing :)
>

Reading this, you are talking about 'the column being json'. COPY needs to
do the escaping at the same time it's constructing the columns. The present
way is easy to do, read char by char, if it's a escape, process next char
acumulating into current field, otherwise see whether it is a field or
record separator and act accordingly. It's also layered, when you construct
the records for copy you get all the field data, turn them into escaped
strings, join them by the field separator and spit them out followed by a
record separator ( practical implementations may do this virtually ).
Intermixing this with the 'I'm in a json column' would need to pass
information from the upper layer, and make it more difficult and, specially
error prone. What do you do if ( using the standard delimiters ) your json
value has embeded newlines and tabs ( which, IIRC, are legal in several
places inside the json ). And all this to make some incorrectly formatted
files read ( which can be correctly formatted with a perl one liner or
something similar ). I'm not the one to decide, but I will vote against
including that ( but do not trust me too much, I would also vote against
including 'csv' which I consider the root of many evils ).

Francisco Olarte.

Re: BUG #12320: json parsing with embedded double quotes

From
Aaron Botsis
Date:
Hi Francisco, I=E2=80=99m aware, but still consider this to be a bug, or =
at least a great opportunity for an enhancement. :)=20

This had bitten me for the third time while trying to import some json =
data. It=E2=80=99d be great to bypass the copy escaping (and possibly =
other meta characters) when the column type is json or jsonb. I=E2=80=99d =
be happy to try and write it and submit a patch if folks believe this is =
an acceptable way to go=E2=80=A6 That said, I should probably read what =
the process is for this kind of thing :)

Aaron


> On Dec 23, 2014, at 1:17 PM, Francisco Olarte <folarte@peoplecall.com> =
wrote:
>=20
> Hi Aaron:
> ....=20
> =
user:\"root\""}}}}},"from":0,"size":20,"timeout":10000,"sort":{"timestamp"=
:{"order":"
> .....
>=20
> Copy from does it's own round of backslash scaping, if you look at=20
>=20
> http://www.postgresql.org/docs/9.4/static/sql-copy.html =
<http://www.postgresql.org/docs/9.4/static/sql-copy.html>
>=20
> The third paragraph under File formats / text format reads:
> Backslash characters (\) can be used in the COPY data to quote data =
characters that might otherwise be taken as row or column delimiters. In =
particular, the following characters must be preceded by a backslash if =
they appear as part of a column value: backslash itself, newline, =
carriage return, and the current delimiter character.
>=20
> As you want the backslash as part of the data you need to double it in =
the file, otherwise you'll fall into the later case:
>=20
> .....Any other backslashed character that is not mentioned in the =
above table will be taken to represent itself
>=20
> And the backslash will get eaten and the json parser will fail. It's =
the same problem that you get when you try to put windows paths or perl =
type regular expresions inside doube quoted =
C/Java/perl/shell/and_many_more Strings.
>=20
> If your table is always a single column and you cannot edit the file, =
you can always try to change the scape character.
>=20
> Regards.
>     Francisco Olarte.
>=20