Re: Performance on inserts - Mailing list pgsql-hackers

From Jules Bean
Subject Re: Performance on inserts
Date
Msg-id 20000826114855.A1524@pear.presence.net.uk
Whole thread Raw
In response to Re: Performance on inserts  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Performance on inserts
Re: Performance on inserts
Re: Performance on inserts
List pgsql-hackers
On Fri, Aug 25, 2000 at 07:00:22PM -0400, Tom Lane wrote:
> The algorithm does seem to work quite nicely just the way I described
> it, although it turns out I was off base about a good probability
> setting.  I find that something up around 0.99 seems to be good.
> Using the same (perhaps overly simplistic) test case:
> 
> # tuples inserted        6.5        current+random hack @ 0.99
>             Time    index size    Time    index size
> 1536            <1sec    90112        <1sec    106496
> 3072            1.56    163840        <1sec    188416
> 6144            3.70    286720        1.40    376832
> 12288            9.73    532480        2.65    688128
> 24576            93.26    1024000        5.22    1368064
> 49152            363.23    2007040        10.34    2727936
> 98304                        22.07    5545984
> 196608                        45.60    11141120
> 393216                        92.53    22290432
> 
> I tried probabilities from 0.67 to 0.999 and found that runtimes didn't
> vary a whole lot (though this is near the minimum), while index size
> consistently got larger as the probability of moving right decreased.
> The runtime is nicely linear throughout the range.

That looks brilliant!! (Bearing in mind that I have over 10 million
tuples in my table, you can imagine what performance was like for me!)
Is there any chance you could generate a patch against released 7.0.2
to add just this functionality... It would be the kiss of life for my
code!

(Not in a hurry, I'm not back in work until Wednesday, as it happens)

And, of course, what would /really/ get my code going speedily would
be the partial indices mentioned elsewhere in this thread.  If the
backend could automagically drop keys containing > 10% (tunable) of
the rows from the index, then my index would be (a) about 70% smaller!
and (b) only used when it's faster. [This means it would have to
update some simple histogram data.  However, I can't see that being
much of an overhead]

For the short term, if I can get a working version of the above
randomisation patch, I think I shall 'fake' a partial index by
manually setting 'enable_seqscan=off' for all but the 4 or 5 most
common categories. Those two factors combined will speed up my bulk
inserts a lot.

One other idea, though:

Is there any simple way for Pg to combine inserts into one bulk?
Specifically, their effect on the index files.  It has always seemed
to me to be one of the (many) glaring flaws in SQL that the INSERT
statement only takes one row at a time.  But, using INSERT ... SELECT,
I can imagine that it might be possible to do 'bulk' index
updating. so that scanning process is done once per 'batch'.

If I can make an analogy with sorted files (which indices are rather
like), if I wanted to add another 100 lines to a 1000 line sorted
file, I'd sort the 100 first, and then 'merge' them in.  Whilst I
realise that indices aren't stored sorted (no need), I think it ought
to be possible to construct an efficient algorithm for merging two
btrees?

Jules

-- 
Jules Bean                          |        Any sufficiently advanced 
jules@debian.org                    |  technology is indistinguishable
jules@jellybean.co.uk               |               from a perl script


pgsql-hackers by date:

Previous
From: Mario Weilguni
Date:
Subject: TNS Services like Oracle?
Next
From: Matthew Kirkwood
Date:
Subject: Re: Performance on inserts