Home > mailing lists

Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: Fast insertion indexes: why no developments
Date	November 12, 2013 02:35:08
Msg-id	CAMkU=1xtdHLwLQHJ2NOQEaNGPFjw2FtNiAX9MZ2n1i+YtphCrA@mail.gmail.com Whole thread Raw
In response to	Re: Fast insertion indexes: why no developments (Leonardo Francalanci <m_lists@yahoo.it>)
List	pgsql-hackers

Tree view

On Tue, Nov 5, 2013 at 9:52 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

Jeff Janes wrote
> Some experiments I did a few years ago showed that applying sorts to the
> data to be inserted could be helpful even when the sort batch size was as
> small as one tuple per 5 pages of existing index. Maybe even less.

Cool!!! Do you have any idea/hint on how I could try and replicate that?
Do you remember how you did it?

I can't find my notes but I remember more or less how I did it.

Since we don't yet have an insertion buffer that allows the rows to be sorted in different order for different indexes, I had to simulate it just by using a table with a single index and hoping that that would extrapolate.

create table foo (x bigint);

To speed things up, you may want to prepopulate this with random data so that the size of the index-to-be will exceed shared_buffers, or physical RAM, before making the index. Also, the effectiveness might depend on how much the index has grown since its creations, since leaf pages are initially correlated between physical order and logical order, but that decreases over time. So you may want to try different initial seed sizes.

create index on foo (x);

Then I use perl to make run-sorted data with different run sizes, and load that via \copy. I put all the data points in memory up front rather than generating it per-run on the fly, so that perl consumes about the same amount of memory regardless of the run size. You would want to use more than 1..1e6 if you are on a very large RAM machine.

Something like:

for $run_size in 1 10 100 1000 10000 100000; do

perl -le 'my @x; push @x, int(rand()*1e8) foreach 1..1e6; while (@x) {print foreach sort {$a<=>$b} splice @x,0,'$run_size'; }'| time psql -c '\copy foo from stdin';

done

But you probably want another inner loop so that the \copy gets executed multiple times per run_size, so that each run_size executes for at least a couple checkpoint cycles.

Cheers,

Jeff

pgsql-hackers by date:

From: Josh Berkus
Date: 12 November 2013, 02:16:41
Subject: Re: pg_dump and pg_dumpall in real life

From: Jeffrey Walton
Date: 12 November 2013, 03:03:00
Subject: Re: Clang 3.3 Analyzer Results

Re: Fast insertion indexes: why no developments - Mailing list pgsql-hackers

Previous

Next