Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Fast insertion indexes: why no developments
Date
Msg-id CAMkU=1xtdHLwLQHJ2NOQEaNGPFjw2FtNiAX9MZ2n1i+YtphCrA@mail.gmail.com
Whole thread Raw
In response to Re: Fast insertion indexes: why no developments  (Leonardo Francalanci <m_lists@yahoo.it>)
List pgsql-hackers
On Tue, Nov 5, 2013 at 9:52 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
Jeff Janes wrote
> Some experiments I did a few years ago showed that applying sorts to the
> data to be inserted could be helpful even when the sort batch size was as
> small as one tuple per 5 pages of existing index.  Maybe even less.

Cool!!! Do you have any idea/hint on how I could try and replicate that?
Do you remember how you did it?

I can't find my notes but I remember more or less how I did it.

Since we don't yet have an insertion buffer that allows the rows to be sorted in different order for different indexes, I had to simulate it just by using a table with a single index and hoping that that would extrapolate.

create table foo (x bigint);

To speed things up, you may want to prepopulate this with random data so that the size of the index-to-be will exceed shared_buffers, or physical RAM, before making the index.  Also, the effectiveness might depend on how much the index has grown since its creations, since leaf pages are initially correlated between physical order and logical order, but that decreases over time.  So you may want to try different initial seed sizes.

create index on foo (x);

Then I use perl to make run-sorted data with different run sizes, and load that via \copy.  I put all the data points in memory up front rather than generating it per-run on the fly, so that perl consumes about the same amount of memory regardless of the run size.  You would want to use more than 1..1e6 if you are on a very large RAM machine.


Something like: 

for $run_size in 1 10 100 1000 10000 100000; do
  perl -le 'my @x; push @x, int(rand()*1e8) foreach 1..1e6; while (@x) {print foreach sort {$a<=>$b} splice @x,0,'$run_size'; }'| time psql -c '\copy foo from stdin'; 
done

But you probably want another inner loop so that the \copy gets executed multiple times per run_size, so that each run_size executes for at least a couple checkpoint cycles.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: pg_dump and pg_dumpall in real life
Next
From: Jeffrey Walton
Date:
Subject: Re: Clang 3.3 Analyzer Results