Re: large numbers of inserts out of memory strategy - Mailing list pgsql-general

From Ted Toth
Subject Re: large numbers of inserts out of memory strategy
Date
Msg-id CAFPpqQEuDCK4s=rBRvoz4V1d1FKaTM5-0UYOQpUGj30oZ6zksg@mail.gmail.com
Whole thread Raw
In response to Re: large numbers of inserts out of memory strategy  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Responses Re: large numbers of inserts out of memory strategy
List pgsql-general
On Thu, Nov 30, 2017 at 11:40 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2017-11-30 08:43:32 -0600, Ted Toth wrote:
>> Date: Thu, 30 Nov 2017 08:43:32 -0600
>> From: Ted Toth <txtoth@gmail.com>
>> To: "Peter J. Holzer" <hjp-pgsql@hjp.at>
>> Cc: pgsql-general@lists.postgresql.org
>> Subject: Re: large numbers of inserts out of memory strategy
>>
>> On Thu, Nov 30, 2017 at 4:22 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>> > On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> >> Yes I did generate 1 large DO block:
>> >>
>> >> DO $$
>> >> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> >> BEGIN
>> >> INSERT INTO thing
>> >> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> >> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> >> INTO thingid,thingrec;
>> >> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> >> INSERT INTO thingstatus
>> >>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> >> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
> [...]
>> >>
>> >> <repeated for each thing>
>> >>
>> >> END $$;
>> >>
>> >> Should I limit the number of 'thing' inserts within a DO block or
>> >> wrapping each 'thing' insert in it's own DO block?
>> >
>>
>> Thanks for the specific suggestions.
>>
>> > I would suggest getting rid of the do block entirely if that is
>> > possible. Just create lots of insert statements. You can get the current
>> > value of a sequence with currval('sequence_name').
>>
>> What is the downside of using a DO block?
>
> As you discovered, the downside is that the whole do block needs
> to be parsed before it can be executed. If you generate a huge do block,
> that can be an issue. If you create lots of little do blocks, the
> overhead is probably not noticable.
>
> But I'd like to pose the question the other way around: What are the
> advantages of using a do block? If there aren't any, I wouldn't use it
> (keep it simple). One advantage is that you can use variables. But I
> don't think you need that. Other possible advantages could be speed or
> transactions: I don't know if these exist.
>
>> I'd have to do a nextval on each sequence before I could use currval,
>> right?
>
> That happens automatically if you insert the default value into a serial
> column. You seem to do that here:
>
>> >> INSERT INTO thing (ltn,classification,machine,source,thgrec,flags,serial,type)
>> >> VALUES ('T007336','THING',0,1025,7336,7,'XXX869977564',1)
>> >> RETURNING id,thgrec INTO thingid,thingrec;
>
> The column id is probably defined as serial (or explicitely associated
> with a sequence). So you can get the value you just inserted with
> currval('thing_id_seq') (Check the table definition to get the sequence
> name, but unless you explicitely specified the sequence name that's
> it).
>
> So you could ditch the "RETURNING ..." clause and replace the next
> statement with:
>
>     INSERT INTO recnum_thing (recnum, thing_id) VALUES (7336, currval('thing_id_seq'));

I only did the DO to get the DECLARE of the variables so I'll change
it to use this method instead, thanks.

>
> And similar for the other statements where you use thingid and
> thingdataid (thingrec is already known as far as i can see).
>
> Note that this doesn't work if you need two values from the same
> sequence. So you can't replace
>
>     insert into thing(..) values(...) returning id as parent_id;
>     insert into thing(..) values(...) returning id as child_id;
>     insert into structure(parent, child) values(parent_id, child_id);
>
> in the same way. But you don't seem to need anything like that.
>
>
>> One thing that is unclear to me is when commits occur while using psql
>> would you know where in the docs I can find information on this
>> subject?
>
> By default psql enables autocommit which causes an implicit commit after
> every statement. With a do block I'm not sure whether that means after
> the do block or after each statement within the do block. I'd just turn
> autocommit off and add explicit commits wherever I wanted them.

So you think I can turn off autocommit and put BEGIN/COMMITs in a
large file and then postmaster won't have to parse the whole thing
when I feed it to it via psql? I did change my generator to create
multiple files with limited numbers of inserts. Somebody had mentioned
by chunking it this way it would be easier to know when and where if
something went wrong which I thought was a good idea.

>
>
>> > Alternately or in addition, since you are using python, you might want
>> > to insert directly into the database from python using psycopg2. For
>> > separate insert statements that should have about the same performance.
>> > (It is usually much faster to write to a csv file and load that with
>> > copy than to insert each row, but you don't do that and it might be
>> > difficult in your case).
>>
>> Yes, I thought about generating csv files but didn't see a way to deal
>> with the foreign keys.
>
> Still: Is there a reason why you use a python script to create an sql
> script instead of directly issuing the sql queries from your python
> script?

I already had code that generated JSON so it was relatively easy to
add code and a cmd line arg to generate SQL instead. Theoretically
this will be a one time offline operation and performance is not a big
factor. Also my target machine will be significantly beefier.

>
>         hp
>
> --
>    _  | Peter J. Holzer    | we build much bigger, better disasters now
> |_|_) |                    | because we have much more sophisticated
> | |   | hjp@hjp.at         | management tools.
> __/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>


pgsql-general by date:

Previous
From: Christopher Browne
Date:
Subject: Re: large numbers of inserts out of memory strategy
Next
From: Steven Lembark
Date:
Subject: Re: large numbers of inserts out of memory strategy