Re: json api WIP patch - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: json api WIP patch
Date
Msg-id 50EC342B.2000106@dunslane.net
Whole thread Raw
In response to Re: json api WIP patch  (james <james@mansionfamily.plus.com>)
Responses Re: json api WIP patch
Re: json api WIP patch
List pgsql-hackers
On 01/08/2013 01:45 AM, james wrote:
>> The processing functions have been extended to provide 
>> populate_record() and populate_recordset() functions.The latter in 
>> particular could be useful in decomposing a piece of json 
>> representing an array of flat objects (a fairly common pattern) into 
>> a set of Postgres records in a single pass.
>
> So this would allow an 'insert into ... select ... from 
> <unpack-the-JSON>(...)'?

Yes.

>
> I had been wondering how to do such an insertion efficiently in the 
> context of SPI, but it seems that there is no SPI_copy equiv that 
> would allow a query parse and plan to be avoided.

Your query above would need to be planned too, although the plan will be 
trivial.

>
> Is this mechanism likely to be as fast as we can get at the moment in 
> contexts where copy is not feasible?
>

You should not try to use it as a general bulk load facility. And it 
will not be as fast as COPY for several reasons, including that the Json 
parsing routines are necessarily much heavier than the COPY parse 
routines, which have in any case been optimized over quite a long 
period. Also, a single json datum is limited to no more than 1Gb. If you 
have such a datum, parsing it involves having it in memory and then 
taking a copy (I wonder if we could avoid that step - will take a look). 
Then each object is decomposed into a hash table of key value pairs, 
which it then used to construct the record datum. Each field name  in 
the result record is used to look up the value in the hash table - this 
happens once in the case of populate_record() and once per object in the 
array in the case of populate_recordset(). In the latter case the 
resulting records are put into a tuplestore structure (which spills to 
disk if necessary) which is then returned to the caller when all the 
objects in the json array are processed. COPY doesn't have these sorts 
of issues. It knows without having to look things up where each datum is 
in each record, and it stashes the result straight into the target 
table. It can read and insert huge numbers of rows without significant 
memory implications.

Both these routines and COPY in non-binary mode use the data type input 
routines to convert text values. In some cases (very notably timestamps) 
these routines can easily be shown to be fantastically expensive 
compared to binary input. This is part of what has led to the creation 
of utilities like pg_bulkload.

Perhaps if you give us a higher level view of what you're trying to 
achieve we can help you better.

cheers

andrew






pgsql-hackers by date:

Previous
From: Claudio Freire
Date:
Subject: Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Next
From: Amit Kapila
Date:
Subject: Re: Extra XLOG in Checkpoint for StandbySnapshot