Thread: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Aaron Botsis
Date:
Hi folks, I was having a problem importing json data with COPY. Lots of things export data nicely as one json blob per line. This is excellent for directly importing into a JSON/JSONB column for analysis. 

...Except when there’s an embedded doublequote. Or anything that’s escaped. COPY handles this, but by the time the escaped char hit the JSON parser, it's not escaped anymore. This breaks the JSON parsing. This means I need to manipulate the input data to double-escape it. See bug #12320 for an example. Yuck.

I propose this small patch that simply allows specifying COPY … ESCAPE without requiring the CSV parser. It will make it much easier to directly use json formatted export data for folks going forward. This seemed like the simplest route.

Usage is simply:

postgres=# copy t1 from '/Users/nok/Desktop/queries.json';
ERROR:  invalid input syntax for type json
DETAIL:  Token "root" is invalid.
CONTEXT:  JSON data, line 1: ...1418066241619 AND <=1418671041621) AND user:"root...
COPY t1, line 3, column bleh: "{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"qu..."
postgres=# copy t1 from '/Users/nok/Desktop/queries.json' escape '';
COPY 1966

I’ve included regression tests, and all existing tests pass. This is my first contribution, so be kind to me. :)



Begin forwarded message:

From: Francisco Olarte <folarte@peoplecall.com>
Date: January 6, 2015 at 1:52:28 PM EST
Subject: Re: [BUGS] BUG #12320: json parsing with embedded double quotes
To: Aaron Botsis <aaron@bt-r.com>

Hi Aaron:

On Tue, Jan 6, 2015 at 7:06 PM, Aaron Botsis <aaron@bt-r.com> wrote:
Hi Francisco, I’m aware, but still consider this to be a bug, or at least a great opportunity for an enhancement. :) 

Maybe, but you are going to have a problem.
 
This had bitten me for the third time while trying to import some json data. It’d be great to bypass the copy escaping (and possibly other meta characters) when the column type is json or jsonb. I’d be happy to try and write it and submit a patch if folks believe this is an acceptable way to go… That said, I should probably read what the process is for this kind of thing :)

Reading this, you are talking about 'the column being json'. COPY needs to do the escaping at the same time it's constructing the columns. The present way is easy to do, read char by char, if it's a escape, process next char acumulating into current field, otherwise see whether it is a field or record separator and act accordingly. It's also layered, when you construct the records for copy you get all the field data, turn them into escaped strings, join them by the field separator and spit them out followed by a record separator ( practical implementations may do this virtually ). Intermixing this with the 'I'm in a json column' would need to pass information from the upper layer, and make it more difficult and, specially error prone. What do you do if ( using the standard delimiters ) your json value has embeded newlines and tabs ( which, IIRC, are legal in several places inside the json ). And all this to make some incorrectly formatted files read ( which can be correctly formatted with a perl one liner or something similar ). I'm not the one to decide, but I will vote against including that ( but do not trust me too much, I would also vote against including 'csv' which I consider the root of many evils ).

Francisco Olarte.



Attachment

Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Aaron Botsis
Date:
Ack. Original had a bug. Let’s try again.



On Jan 7, 2015, at 8:25 AM, Aaron Botsis <aaron@bt-r.com> wrote:

Hi folks, I was having a problem importing json data with COPY. Lots of things export data nicely as one json blob per line. This is excellent for directly importing into a JSON/JSONB column for analysis. 

...Except when there’s an embedded doublequote. Or anything that’s escaped. COPY handles this, but by the time the escaped char hit the JSON parser, it's not escaped anymore. This breaks the JSON parsing. This means I need to manipulate the input data to double-escape it. See bug #12320 for an example. Yuck.

I propose this small patch that simply allows specifying COPY … ESCAPE without requiring the CSV parser. It will make it much easier to directly use json formatted export data for folks going forward. This seemed like the simplest route.

Usage is simply:

postgres=# copy t1 from '/Users/nok/Desktop/queries.json';
ERROR:  invalid input syntax for type json
DETAIL:  Token "root" is invalid.
CONTEXT:  JSON data, line 1: ...1418066241619 AND <=1418671041621) AND user:"root...
COPY t1, line 3, column bleh: "{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"qu..."
postgres=# copy t1 from '/Users/nok/Desktop/queries.json' escape '';
COPY 1966

I’ve included regression tests, and all existing tests pass. This is my first contribution, so be kind to me. :)

<escape-without-csv.patch>


Begin forwarded message:

From: Francisco Olarte <folarte@peoplecall.com>
Date: January 6, 2015 at 1:52:28 PM EST
Subject: Re: [BUGS] BUG #12320: json parsing with embedded double quotes
To: Aaron Botsis <aaron@bt-r.com>

Hi Aaron:

On Tue, Jan 6, 2015 at 7:06 PM, Aaron Botsis <aaron@bt-r.com> wrote:
Hi Francisco, I’m aware, but still consider this to be a bug, or at least a great opportunity for an enhancement. :) 

Maybe, but you are going to have a problem.
 
This had bitten me for the third time while trying to import some json data. It’d be great to bypass the copy escaping (and possibly other meta characters) when the column type is json or jsonb. I’d be happy to try and write it and submit a patch if folks believe this is an acceptable way to go… That said, I should probably read what the process is for this kind of thing :)

Reading this, you are talking about 'the column being json'. COPY needs to do the escaping at the same time it's constructing the columns. The present way is easy to do, read char by char, if it's a escape, process next char acumulating into current field, otherwise see whether it is a field or record separator and act accordingly. It's also layered, when you construct the records for copy you get all the field data, turn them into escaped strings, join them by the field separator and spit them out followed by a record separator ( practical implementations may do this virtually ). Intermixing this with the 'I'm in a json column' would need to pass information from the upper layer, and make it more difficult and, specially error prone. What do you do if ( using the standard delimiters ) your json value has embeded newlines and tabs ( which, IIRC, are legal in several places inside the json ). And all this to make some incorrectly formatted files read ( which can be correctly formatted with a perl one liner or something similar ). I'm not the one to decide, but I will vote against including that ( but do not trust me too much, I would also vote against including 'csv' which I consider the root of many evils ).

Francisco Olarte.




Attachment

Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Andrew Dunstan
Date:

On 01/07/2015 08:25 AM, Aaron Botsis wrote:
> Hi folks, I was having a problem importing json data with COPY. Lots
> of things export data nicely as one json blob per line. This is
> excellent for directly importing into a JSON/JSONB column for analysis.
>
> ...Except when there’s an embedded doublequote. Or anything that’s
> escaped. COPY handles this, but by the time the escaped char hit the
> JSON parser, it's not escaped anymore. This breaks the JSON parsing.
> This means I need to manipulate the input data to double-escape it.
> See bug #12320 for an example. Yuck.
>
> I propose this small patch that simply allows specifying COPY … ESCAPE
> without requiring the CSV parser. It will make it much easier to
> directly use json formatted export data for folks going forward. This
> seemed like the simplest route.
>
> Usage is simply:
>
> postgres=# copy t1 from '/Users/nok/Desktop/queries.json';
> ERROR:  invalid input syntax for type json
> DETAIL:  Token "root" is invalid.
> CONTEXT:  JSON data, line 1: ...1418066241619 AND <=1418671041621) AND
> user:"root...
> COPY t1, line 3, column bleh:
> "{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"qu..."
> postgres=# copy t1 from '/Users/nok/Desktop/queries.json' escape '';
> COPY 1966
>
>


This isn't a bug. Neither CSV format nor TEXT format are partucularly
suitable for json. I'm quite certain I could compose legal json that
will break your proposal (for example, with an embedded newline in the
white space.)

It's also unnecessary. CSV format, while not designed for this, is
nevertheless sufficiently flexible to allow successful import of json
data meeting certain criteria (essentially no newlines), like this:
   copy the_table(jsonfield)   from '/path/to/jsondata'   csv quote e'\x01' delimiter e'\x02';


You aren't the first person to encounter this problem. See
<http://adpgtech.blogspot.com/2014/09/importing-json-data.html>

Maybe we need to add something like this to the docs, or to the wiki.

Note too my comment in that blog post:
   Now this solution is a bit of a hack. I wonder if there's a case for   a COPY mode that simply treats each line as a
singledatum. I also   wonder if we need some more specialized tools for importing JSON,   possibly one or more Foreign
DataWrappers. Such things could   handle, say, embedded newline punctuation. 


cheers

andrew






Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Andrew Dunstan
Date:
On 01/08/2015 03:05 PM, Aaron Botsis wrote:
>
>
>> It's also unnecessary. CSV format, while not designed for this, is
>> nevertheless sufficiently flexible to allow successful import of json
>> data meeting certain criteria (essentially no newlines), like this:
>>
>>   copy the_table(jsonfield)
>>   from '/path/to/jsondata'
>>   csv quote e'\x01' delimiter e'\x02’;
>
> While perhaps unnecessary, given the size and simplicity of the patch,
> IMO it’s a no brainer to merge (it actually makes the code smaller by
> 3 lines). It also enables non-json use cases anytime one might want to
> preserve embedded escapes, or use different ones entirely. Do you see
> other reasons not to commit it?


Well, for one thing it's seriously incomplete. You need to be able to
change the delimiter as well. Otherwise, any embedded tab in the json
will cause you major grief.

Currently the delimiter and the escape MUST be a single byte non-nul
character, and there is a check for this in csv mode. Your patch would
allow any arbitrary string (including one of zero length) for the escape
in text mode, and would then silently ignore all but the first byte.
That's not the way we like to do things.

And, frankly, I would need to spend quite a lot more time thinking about
other implications than I have given it so far. This is an area where I
tend to be VERY cautious about making changes. This is a fairly fragile
ecosystem.

cheers

andrew



Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Andrew Dunstan
Date:
On 01/08/2015 08:42 PM, Aaron Botsis wrote:
>
>> On Jan 8, 2015, at 3:44 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>>
>> On 01/08/2015 03:05 PM, Aaron Botsis wrote:
>>>
>>>> It's also unnecessary. CSV format, while not designed for this, is nevertheless sufficiently flexible to allow
successfulimport of json data meeting certain criteria (essentially no newlines), like this: 
>>>>
>>>>   copy the_table(jsonfield)
>>>>   from '/path/to/jsondata'
>>>>   csv quote e'\x01' delimiter e'\x02’;
>>> While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s a no brainer to merge (it actually
makesthe code smaller by 3 lines). It also enables non-json use cases anytime one might want to preserve embedded
escapes,or use different ones entirely. Do you see other reasons not to commit it? 
>>
>> Well, for one thing it's seriously incomplete. You need to be able to change the delimiter as well. Otherwise, any
embeddedtab in the json will cause you major grief. 
>>
>> Currently the delimiter and the escape MUST be a single byte non-nul character, and there is a check for this in csv
mode.Your patch would allow any arbitrary string (including one of zero length) for the escape in text mode, and would
thensilently ignore all but the first byte. That's not the way we like to do things. 
>>
>> And, frankly, I would need to spend quite a lot more time thinking about other implications than I have given it so
far.This is an area where I tend to be VERY cautious about making changes. This is a fairly fragile ecosystem. 
> Good point.
>
> This version:
>
> * doesn't allow ENCODING in BINARY mode (existing bug)
> * doesn’t allow ESCAPE in BINARY mode
> * makes COPY TO work with escape
> * ensures escape char length is < 2 for text mode, 1 for CSV
>
> Couple more things to realize: setting both the escape and delimiter characters to null won’t be any different than
howyou fiddled with them in CSV mode. The code paths will be the same because we should never encounter a null in the
middleof the string. And even if we did (and the encoding didn’t catch it), we’d treat it just like any other field
delimiteror escape character. 
>
> So I’m going to do a bit more testing with another patch tomorrow with delimiters removed. If you can think of any
specificcases you think will break it let me know and I’ll make sure to add regression tests for them as well. 
>
>

Well, I still need convincing that this is the best solution to the
problem. As I said, I need to spend more time thinking about it.

cheers

andrew



Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Aaron Botsis
Date:

On Jan 7, 2015, at 2:45 PM, Andrew Dunstan <andrew@dunslane.net> wrote:


On 01/07/2015 08:25 AM, Aaron Botsis wrote:
Hi folks, I was having a problem importing json data with COPY. Lots of things export data nicely as one json blob per line. This is excellent for directly importing into a JSON/JSONB column for analysis.

...Except when there’s an embedded doublequote. Or anything that’s escaped. COPY handles this, but by the time the escaped char hit the JSON parser, it's not escaped anymore. This breaks the JSON parsing. This means I need to manipulate the input data to double-escape it. See bug #12320 for an example. Yuck.

I propose this small patch that simply allows specifying COPY … ESCAPE without requiring the CSV parser. It will make it much easier to directly use json formatted export data for folks going forward. This seemed like the simplest route.

This isn't a bug. Neither CSV format nor TEXT format are partucularly suitable for json. I'm quite certain I could compose legal json that will break your proposal (for example, with an embedded newline in the white space.)

Sorry - though I originally reported it as a bug, I didn’t mean to imply it ultimately was one. :) The patch is a feature enhancement.

It's also unnecessary. CSV format, while not designed for this, is nevertheless sufficiently flexible to allow successful import of json data meeting certain criteria (essentially no newlines), like this:

  copy the_table(jsonfield)
  from '/path/to/jsondata'
  csv quote e'\x01' delimiter e'\x02’;

While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s a no brainer to merge (it actually makes the code smaller by 3 lines). It also enables non-json use cases anytime one might want to preserve embedded escapes, or use different ones entirely. Do you see other reasons not to commit it?

You aren't the first person to encounter this problem. See <http://adpgtech.blogspot.com/2014/09/importing-json-data.html>

Maybe we need to add something like this to the docs, or to the wiki.

In your post you acknowledged a text mode copy with null escape character would have solved your problem, and to me, that was the intuitive/obvious choice as well. *shrug*

Note too my comment in that blog post:

  Now this solution is a bit of a hack. I wonder if there's a case for
  a COPY mode that simply treats each line as a single datum. I also
  wonder if we need some more specialized tools for importing JSON,
  possibly one or more Foreign Data Wrappers. Such things could
  handle, say, embedded newline punctuation.

Agree. Though https://github.com/citusdata/json_fdw already exists (I haven’t used it). How do folks feel about a COPY mode that reads a single datum until it finds a single character record terminator alone on a line? something like:

=# COPY test(data) from stdin terminator ‘}’;
End with a backslash and a period on a line by itself.
>>{
>>  "a": 123, "c": "string",  "b": [1, 2, 3
>> ]
>>}
>>{
>> [1,2,3]
>>}
>>\.
COPY 2

You could also get fancy and support multi-character record terminators which would allow the same thing in a slightly different way:
COPY test(data) from stdin terminator ‘\n}’;

COPY test(data) from stdin terminator ‘\n}’;

I don’t know how crazy you could get without a lot of rework. It might have to be used in conjunction with a more constrained mode like "COPY…RAW”.

Aaron

Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Aaron Botsis
Date:

> On Jan 8, 2015, at 3:44 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 01/08/2015 03:05 PM, Aaron Botsis wrote:
>>
>>
>>> It's also unnecessary. CSV format, while not designed for this, is nevertheless sufficiently flexible to allow
successfulimport of json data meeting certain criteria (essentially no newlines), like this: 
>>>
>>>  copy the_table(jsonfield)
>>>  from '/path/to/jsondata'
>>>  csv quote e'\x01' delimiter e'\x02’;
>>
>> While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s a no brainer to merge (it actually
makesthe code smaller by 3 lines). It also enables non-json use cases anytime one might want to preserve embedded
escapes,or use different ones entirely. Do you see other reasons not to commit it? 
>
>
> Well, for one thing it's seriously incomplete. You need to be able to change the delimiter as well. Otherwise, any
embeddedtab in the json will cause you major grief. 
>
> Currently the delimiter and the escape MUST be a single byte non-nul character, and there is a check for this in csv
mode.Your patch would allow any arbitrary string (including one of zero length) for the escape in text mode, and would
thensilently ignore all but the first byte. That's not the way we like to do things. 
>
> And, frankly, I would need to spend quite a lot more time thinking about other implications than I have given it so
far.This is an area where I tend to be VERY cautious about making changes. This is a fairly fragile ecosystem. 

Good point.

This version:

* doesn't allow ENCODING in BINARY mode (existing bug)
* doesn’t allow ESCAPE in BINARY mode
* makes COPY TO work with escape
* ensures escape char length is < 2 for text mode, 1 for CSV

Couple more things to realize: setting both the escape and delimiter characters to null won’t be any different than how
youfiddled with them in CSV mode. The code paths will be the same because we should never encounter a null in the
middleof the string. And even if we did (and the encoding didn’t catch it), we’d treat it just like any other field
delimiteror escape character. 

So I’m going to do a bit more testing with another patch tomorrow with delimiters removed. If you can think of any
specificcases you think will break it let me know and I’ll make sure to add regression tests for them as well.  

Cheers!

Aaron





Attachment

Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Aaron Botsis
Date:

On Jan 9, 2015, at 11:37 AM, Andrew Dunstan <andrew@dunslane.net> wrote:


On 01/08/2015 08:42 PM, Aaron Botsis wrote:

On Jan 8, 2015, at 3:44 PM, Andrew Dunstan <andrew@dunslane.net> wrote:


On 01/08/2015 03:05 PM, Aaron Botsis wrote:

It's also unnecessary. CSV format, while not designed for this, is nevertheless sufficiently flexible to allow successful import of json data meeting certain criteria (essentially no newlines), like this:

 copy the_table(jsonfield)
 from '/path/to/jsondata'
 csv quote e'\x01' delimiter e'\x02’;
While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s a no brainer to merge (it actually makes the code smaller by 3 lines). It also enables non-json use cases anytime one might want to preserve embedded escapes, or use different ones entirely. Do you see other reasons not to commit it?

Well, for one thing it's seriously incomplete. You need to be able to change the delimiter as well. Otherwise, any embedded tab in the json will cause you major grief.

Currently the delimiter and the escape MUST be a single byte non-nul character, and there is a check for this in csv mode. Your patch would allow any arbitrary string (including one of zero length) for the escape in text mode, and would then silently ignore all but the first byte. That's not the way we like to do things.

And, frankly, I would need to spend quite a lot more time thinking about other implications than I have given it so far. This is an area where I tend to be VERY cautious about making changes. This is a fairly fragile ecosystem.

So I’m going to do a bit more testing with another patch tomorrow with delimiters removed. If you can think of any specific cases you think will break it let me know and I’ll make sure to add regression tests for them as well.

Well, I still need convincing that this is the best solution to the problem. As I said, I need to spend more time thinking about it.

I offered an alternative (RAW mode w/record terminator) that you ignored. So in lieu of a productive discussion about “the best solution to the problem”, I went ahead and continued to complete the patch since I believe it’s a useful FE that it could be helpful for this and other use cases. There have been multiple times (not even w/json) I wished COPY would stop being so smart. 

FWIW, (if anyone’s interested in it) I also hacked up some python that’ll read a json file, and outputs a binary file suitable for use with COPY BINARY that gets around all this stuff. Obviously this only works for json (not jsonb) columns (though you could SELECT INTO a jsonb column). Happy to pass it along.

Aaron

Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From
Robert Haas
Date:
On Thu, Jan 8, 2015 at 8:42 PM, Aaron Botsis <aaron@bt-r.com> wrote:
> This version:
>
> * doesn't allow ENCODING in BINARY mode (existing bug)
> * doesn’t allow ESCAPE in BINARY mode

Hmm, it does look like ENCODING and ESCAPE don't do anything in BINARY
mode, so adding error checks for that makes sense to me.  I wondered
whether the ENCODING option would affect the interpretation of text
data that might be stored in binary-format COPY data, but I can't find
any evidence that it does.

> * makes COPY TO work with escape

I don't see anything horribly wrong with this, but it seems pretty
useless.  I mean, does anyone really want a newline encoded in the
output as %n or !n or qn rather than \n?  What problem are you trying
to solve here?

> * ensures escape char length is < 2 for text mode, 1 for CSV

So, is the idea here that we're going to have no escape character at
all?  So, the characters that would require the use of an escape
character to represent simply can't be represented at all, but we're
OK with that, because the input data doesn't contain any such
characters?

One general problem with this patch is that it doesn't update the
documentation, which makes it a bit harder than it might be to see
what you're trying to accomplish.

Also, if you want to decrease the chances that this patch will get
forgotten about, you can register it here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company