Re: Running lots of inserts from selects on 9.4.5

From: Dan Langille
Subject: Re: Running lots of inserts from selects on 9.4.5
Date: ,
Msg-id: 6CB3CC7F-859F-4550-983C-074F3617A90F@langille.org
(view: Whole thread, Raw)
In response to: Re: Running lots of inserts from selects on 9.4.5  (Dan Langille)
Responses: Re: Running lots of inserts from selects on 9.4.5  (Dan Langille)
List: pgsql-performance

Tree view

Running lots of inserts from selects on 9.4.5  (Dan Langille, )
 Re: Running lots of inserts from selects on 9.4.5  (Jeff Janes, )
  Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )
   Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )
    Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )
     Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )

> On Feb 10, 2016, at 5:13 AM, Dan Langille <> wrote:
>
>> On Feb 10, 2016, at 2:47 AM, Jeff Janes <> wrote:
>>
>> On Tue, Feb 9, 2016 at 4:09 PM, Dan Langille <> wrote:
>>> I have a wee database server which regularly tries to insert 1.5 million or
>>> even 15 million new rows into a 400 million row table.  Sometimes these
>>> inserts take hours.
>>>
>>> The actual query to produces the join is fast.  It's the insert which is
>>> slow.
>>>
>>> INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
>>> DeltaSeq)
>>> SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId,
>>> Filename.FilenameId, batch_testing.LStat, batch_testing.MD5,
>>> batch_testing.DeltaSeq
>>>   FROM batch_testing JOIN Path     ON (batch_testing.Path = Path.Path)
>>>                      JOIN Filename ON (batch_testing.Name =
>>> Filename.Name);
>>>
>>> This is part of the plan: http://img.ly/images/9374145/full  created via
>>> http://tatiyants.com/pev/#/plans
>>>
>>> This gist contains postgresql.conf, zfs settings, slog, disk partitions.
>>>
>>>  https://gist.github.com/dlangille/33331a8c8cc62fa13b9f
>>
>> The table you are inserting into has 7 indexes, all of which have to
>> be maintained.  The index on the sequence column should be efficient
>> to maintain.  But for the rest, if the inserted rows are not naturally
>> ordered by any of the indexed columns then it would end up reading 6
>> random scattered leaf pages in order to insert row pointers.  If none
>> those pages are in memory, that is going to be slow to read off from
>> hdd in single-file.  Also, you are going dirty all of those scattered
>> pages, and they will be slow to write back to hdd because there
>> probably won't be much opportunity for write-combining.
>>
>> Do you really need all of those indexes?
>>
>> Won't the index on (jobid, pathid, filenameid) service any query that
>> (jobid) does, so you can get rid of the latter?
>>
>> And unless you have range queries on fileindex, like "where jobid = 12
>> and fileindex between 4 and 24" then you should be able to replace
>> (jobid, fileindex) with (fileindex,jobid) and then get rid of the
>> stand-alone index on (fileindex).
>>
>> If you add an "order by" to the select statement which order by the
>> fields of one of the remaining indexes, than you could make the
>> maintenance of that index become much cheaper.
>
> I will make these changes one-by-one and test each.  This will be interesting.

On a test server, the original insert takes about 45 minutes.  I removed all indexes.  25 minutes.

Thank you.

--
Dan Langille - BSDCan / PGCon









pgsql-performance by date:

From: Dan Langille
Date:
Subject: Re: Running lots of inserts from selects on 9.4.5
From: Dan Langille
Date:
Subject: Re: Running lots of inserts from selects on 9.4.5