Re: Running lots of inserts from selects on 9.4.5

From: Dan Langille
Subject: Re: Running lots of inserts from selects on 9.4.5
Date: ,
Msg-id: E4396D5A-BA28-4534-B6E0-D72BB53E7DFC@langille.org
(view: Whole thread, Raw)
In response to: Re: Running lots of inserts from selects on 9.4.5  (Dan Langille)
Responses: Re: Running lots of inserts from selects on 9.4.5  (Dan Langille)
List: pgsql-performance

Tree view

Running lots of inserts from selects on 9.4.5  (Dan Langille, )
 Re: Running lots of inserts from selects on 9.4.5  (Jeff Janes, )
  Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )
   Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )
    Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )
     Re: Running lots of inserts from selects on 9.4.5  (Dan Langille, )


On Feb 11, 2016, at 4:41 PM, Dan Langille <> wrote:

On Feb 10, 2016, at 5:13 AM, Dan Langille <> wrote:

On Feb 10, 2016, at 2:47 AM, Jeff Janes <> wrote:

On Tue, Feb 9, 2016 at 4:09 PM, Dan Langille <> wrote:
I have a wee database server which regularly tries to insert 1.5 million or
even 15 million new rows into a 400 million row table.  Sometimes these
inserts take hours.

The actual query to produces the join is fast.  It's the insert which is
slow.

INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5,
DeltaSeq)
SELECT batch_testing.FileIndex, batch_testing.JobId, Path.PathId,
Filename.FilenameId, batch_testing.LStat, batch_testing.MD5,
batch_testing.DeltaSeq
 FROM batch_testing JOIN Path     ON (batch_testing.Path = Path.Path)
                    JOIN Filename ON (batch_testing.Name =
Filename.Name);

This is part of the plan: http://img.ly/images/9374145/full  created via
http://tatiyants.com/pev/#/plans

This gist contains postgresql.conf, zfs settings, slog, disk partitions.

https://gist.github.com/dlangille/33331a8c8cc62fa13b9f

The table you are inserting into has 7 indexes, all of which have to
be maintained.  The index on the sequence column should be efficient
to maintain.  But for the rest, if the inserted rows are not naturally
ordered by any of the indexed columns then it would end up reading 6
random scattered leaf pages in order to insert row pointers.  If none
those pages are in memory, that is going to be slow to read off from
hdd in single-file.  Also, you are going dirty all of those scattered
pages, and they will be slow to write back to hdd because there
probably won't be much opportunity for write-combining.

Do you really need all of those indexes?

Won't the index on (jobid, pathid, filenameid) service any query that
(jobid) does, so you can get rid of the latter?

And unless you have range queries on fileindex, like "where jobid = 12
and fileindex between 4 and 24" then you should be able to replace
(jobid, fileindex) with (fileindex,jobid) and then get rid of the
stand-alone index on (fileindex).

If you add an "order by" to the select statement which order by the
fields of one of the remaining indexes, than you could make the
maintenance of that index become much cheaper.

I will make these changes one-by-one and test each.  This will be interesting.

On a test server, the original insert takes about 45 minutes.  I removed all indexes.  25 minutes.

Thank you.

Today I tackled the production server.  After discussion on the Bacula devel mailing list (http://marc.info/?l=bacula-devel&m=145537742804482&w=2)
I compared my schema to the stock schema provided with Bacula.  Yes, I found 
extra indexes.  I saved the existing schema and proceeded to remove the indexes
from prod not found in the default.

The query time went from 223 minute to 4.5 minutes.  That is 50 times faster.

I think I can live with that. :)

Jeff: if you show up at PGCon, dinner is on me.  Thank you.

-- 
Dan Langille - BSDCan / PGCon



Attachment

pgsql-performance by date:

From: Dan Langille
Date:
Subject: Re: Running lots of inserts from selects on 9.4.5
From: jaime soler
Date:
Subject: Re: Primary key index suddenly became very slow