Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
Date
Msg-id 54AD8CEF.3080904@dunslane.net
Whole thread Raw
In response to Patch: [BUGS] BUG #12320: json parsing with embedded double quotes  (Aaron Botsis <aaron@bt-r.com>)
Responses Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
List pgsql-hackers

On 01/07/2015 08:25 AM, Aaron Botsis wrote:
> Hi folks, I was having a problem importing json data with COPY. Lots
> of things export data nicely as one json blob per line. This is
> excellent for directly importing into a JSON/JSONB column for analysis.
>
> ...Except when there’s an embedded doublequote. Or anything that’s
> escaped. COPY handles this, but by the time the escaped char hit the
> JSON parser, it's not escaped anymore. This breaks the JSON parsing.
> This means I need to manipulate the input data to double-escape it.
> See bug #12320 for an example. Yuck.
>
> I propose this small patch that simply allows specifying COPY … ESCAPE
> without requiring the CSV parser. It will make it much easier to
> directly use json formatted export data for folks going forward. This
> seemed like the simplest route.
>
> Usage is simply:
>
> postgres=# copy t1 from '/Users/nok/Desktop/queries.json';
> ERROR:  invalid input syntax for type json
> DETAIL:  Token "root" is invalid.
> CONTEXT:  JSON data, line 1: ...1418066241619 AND <=1418671041621) AND
> user:"root...
> COPY t1, line 3, column bleh:
> "{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"qu..."
> postgres=# copy t1 from '/Users/nok/Desktop/queries.json' escape '';
> COPY 1966
>
>


This isn't a bug. Neither CSV format nor TEXT format are partucularly
suitable for json. I'm quite certain I could compose legal json that
will break your proposal (for example, with an embedded newline in the
white space.)

It's also unnecessary. CSV format, while not designed for this, is
nevertheless sufficiently flexible to allow successful import of json
data meeting certain criteria (essentially no newlines), like this:
   copy the_table(jsonfield)   from '/path/to/jsondata'   csv quote e'\x01' delimiter e'\x02';


You aren't the first person to encounter this problem. See
<http://adpgtech.blogspot.com/2014/09/importing-json-data.html>

Maybe we need to add something like this to the docs, or to the wiki.

Note too my comment in that blog post:
   Now this solution is a bit of a hack. I wonder if there's a case for   a COPY mode that simply treats each line as a
singledatum. I also   wonder if we need some more specialized tools for importing JSON,   possibly one or more Foreign
DataWrappers. Such things could   handle, say, embedded newline punctuation. 


cheers

andrew






pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: INSERT ... ON CONFLICT UPDATE and RLS
Next
From: Robert Haas
Date:
Subject: Re: Possible typo in create_policy.sgml