Re: large numbers of inserts out of memory strategy - Mailing list pgsql-general

From Christopher Browne
Subject Re: large numbers of inserts out of memory strategy
Date
Msg-id CAFNqd5WJR8Qmu-XGW52VofdhMxB4NcuYdxmSysJonm1ZHCS71A@mail.gmail.com
Whole thread Raw
In response to Re: large numbers of inserts out of memory strategy  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
List pgsql-general
On 30 November 2017 at 05:22, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> Yes I did generate 1 large DO block:
>>
>> DO $$
>> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> BEGIN
>> INSERT INTO thing
>> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> INTO thingid,thingrec;
>> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> INSERT INTO thingstatus
>>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
>> INSERT INTO thinger
>> (thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
>> VALUES
(thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.10000000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.49999881907e-10);
>> INSERT INTO thingdata
>>
(thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
>> VALUES
(thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
>> RETURNING id INTO thingdataid;
>> INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
>> (thingdataid,'013086',0,1502970401,'FOO');
>>
>> <repeated for each thing>
>>
>> END $$;
>>
>> Should I limit the number of 'thing' inserts within a DO block or
>> wrapping each 'thing' insert in it's own DO block?
>
> I would suggest getting rid of the do block entirely if that is
> possible. Just create lots of insert statements. You can get the current
> value of a sequence with currval('sequence_name').
>
> Alternately or in addition, since you are using python, you might want
> to insert directly into the database from python using psycopg2. For
> separate insert statements that should have about the same performance.
> (It is usually much faster to write to a csv file and load that with
> copy than to insert each row, but you don't do that and it might be
> difficult in your case).

A long time ago (pre version 2.2), Slony-I used to do this, having a big
stream of INSERT, UPDATE, and DELETE statements.

This seemed pretty OK when tuples weren't very large, but we did get
some reports in from people with applications with very large tuples
that would run into out-of-memory conditions if a bunch of INSERTs
involving very large tuples were processed together.

The "fix" we initially did was to have the bulk processing stream
hold off on any Very Large Tuples, so that when we'd encounter
them, we'd process the big tuples one by one.  This worked
relatively OK, but meant that we'd be switching logic every time
there was big data, and there was a pretty painful amount of
configuration to force people to worry about.

The "2.2" change was to switch to COPY-based streaming.  In
our case, we put a trigger onto the single log table and have that
initiate tossing data as it came in into the appropriate target table.

That introduced an Amount Of Complexity (e.g. - a rather complex
stored function in C/SPI), but it's notable that we got a pretty big
performance boost as well as complete elimination of memory
allocation worries out of turning the process into COPY streaming.

It may be challenging to get a suitable set of COPY requests to do
what is necessary.  (It sure would be cool if it could just be one...)
But there are three crucial things I'd observe:
a) Loading data via COPY is *way* faster,
b) Parsing INSERT statements is a *big* slowdown,
c) Parsing INSERT statements means that those statements must
be drawn into memory, and that chews a lot of memory if the query
has very large attributes.  (COPY doesn't chew that memory!)

At one point, Jan Wieck was looking into an alternative COPY
protocol that would have allowed more actions, notably:

- You could specify multiple tables to stream into, and switch between them on a tuple by tuple basis.

- You could specify actions of INSERT, UPDATE, or DELETE.

It seemed like a really interesting idea at the time; it was intended
to be particularly useful for Slony, but some folks designing data
warehouse ETL systems observed that it would be useful to them,
too.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


pgsql-general by date:

Previous
From: Christopher Browne
Date:
Subject: Re: large numbers of inserts out of memory strategy
Next
From: Ted Toth
Date:
Subject: Re: large numbers of inserts out of memory strategy