Thread: Sequential vs. random values - number of pages in B-tree

Sequential vs. random values - number of pages in B-tree

From

pinker

Date:

18 August 2016, 11:32:19

Hi!
After doing a quick test:
with sequential values:
create table t01 (id bigint);
create index i01 on t01(id);
insert into t01 SELECT s from generate_series(1,10000000) as s;

and random values:
create table t02 (id bigint);
create index i02 on t02(id);
insert into t02 SELECT random()*100 from generate_series(1,10000000) as s;

The page counts for tables remain the same:
 relpages |         relname
----------+--------------------------
    44248 | t01
    44248 | t02

But for indexes are different:
 relpages |             relname
----------+---------------------------------
    27421 | i01
    34745 | i02

Plus, postgres does 5 times more writes to disk with random data.
What's the reason that postgres needs more index pages to store random data
than sequential ones?



--
View this message in context:
http://postgresql.nabble.com/Sequential-vs-random-values-number-of-pages-in-B-tree-tp5916956.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Sequential vs. random values - number of pages in B-tree

From

"Ilya Kazakevich"

Date:

18 August 2016, 12:00:43

Hi,

>What's the reason that postgres needs more index pages to store random
>data
>than sequential ones?

I assume that is because B-Tree is self-balanced tree, so it needs to be
rebalanced after each insertion.
Random insertions may go to the head of index where no space left leading to
huge data moving.
https://en.wikipedia.org/wiki/B-tree#Insertions_and_deletions



Ilya Kazakevich

JetBrains
http://www.jetbrains.com
The Drive to Develop

Re: Sequential vs. random values - number of pages in B-tree

From

Francisco Olarte

Date:

18 August 2016, 14:03:14

Hi:

On Thu, Aug 18, 2016 at 1:32 PM, pinker <pinker@onet.eu> wrote:
...
> create table t01 (id bigint);
> create index i01 on t01(id);
> insert into t01 SELECT s from generate_series(1,10000000) as s;
>
> and random values:
> create table t02 (id bigint);
> create index i02 on t02(id);
> insert into t02 SELECT random()*100 from generate_series(1,10000000) as s;

It's already been told that btrees work that way, if you find it
strange read a bit about them, this is completely normal, but ...

... what I come to point is your test is severely flawed. It probably
does not matter in this case, but you are inserting 10M DIFFERENT
VALUES in the first case and only 100 in the second one, which an
average of 100K DUPLICATES of each. This affects btrees too. You could
try using random*1G, or at least 100M, for a better test ( which may
have even worse behaviour, ideally I would just write 10M integers to
a disk file, then shuffle it and compare COPY FROM times from both ) (
unless you know of an easy way to generate a random permutation on the
fly without using a lot of memory, I do not ).

Francisco Olarte.

Re: Sequential vs. random values - number of pages in B-tree

From

"Daniel Verite"

Date:

18 August 2016, 15:24:51

    Francisco Olarte wrote:

> unless you know of an easy way to generate a random permutation on the
> fly without using a lot of memory, I do not.

It could be done by encrypting the stream.

For 32 bits integers:
https://wiki.postgresql.org/wiki/Skip32

For 64 bits integers:
https://wiki.postgresql.org/wiki/XTEA


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Sequential vs. random values - number of pages in B-tree

From

Francisco Olarte

Date:

18 August 2016, 16:42:06

Daniel:
On Thu, Aug 18, 2016 at 5:24 PM, Daniel Verite <daniel@manitou-mail.org> wrote:
>> unless you know of an easy way to generate a random permutation on the
>> fly without using a lot of memory, I do not.
> It could be done by encrypting the stream.
> For 32 bits integers:
> https://wiki.postgresql.org/wiki/Skip32
> For 64 bits integers:
> https://wiki.postgresql.org/wiki/XTEA

Nearly, probably good enough for tests, but only generates a
pseudorandom permutation if you encrypt 2**32/64 values, not with the
1..1E7 range, it will map them into 1E7 different numbers in the range
2**32/64. I think there are some pseudo-random number generators which
can be made to work with any range, but do not recall which ones right
now.

Francisco Olarte.

Re: Sequential vs. random values - number of pages in B-tree

From

"Daniel Verite"

Date:

19 August 2016, 13:20:23

    Francisco Olarte wrote:

> I think there are some pseudo-random number generators which
> can be made to work with any range, but do not recall which ones right
> now.

There's a simple technique that works on top of a Feistel network,
called the cycle-walking cipher. Described for instance at:
http://web.cs.ucdavis.edu/~rogaway/papers/subset.pdf

I'm using the opportunity to add a wiki page:
https://wiki.postgresql.org/wiki/Pseudo_encrypt_constrained_to_an_arbitrary_range
with sample plgsql code for the [0..10,000,000] range that might be useful
for other cases.

But for the btree fragmentation and final size issue, TBH I don't expect
that constraining the values within that smaller range will make any
difference
in the tests, because it's the dispersion that matters, not the values
themselves.

I mean that, whether the values are well dispersed in the [0..1e7] range or
equally well dispersed in the [0..2**32] range, the probability of a newly
inserted value to compare greater or lower to any previous values of the list
should be the same, so shouldn't the page splits be the same, statistically
speaking?

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Sequential vs. random values - number of pages in B-tree

From

Francisco Olarte

Date:

19 August 2016, 17:03:46

On Fri, Aug 19, 2016 at 3:20 PM, Daniel Verite <daniel@manitou-mail.org> wrote:
> There's a simple technique that works on top of a Feistel network,
> called the cycle-walking cipher. Described for instance at:
> http://web.cs.ucdavis.edu/~rogaway/papers/subset.pdf
> I'm using the opportunity to add a wiki page:
> https://wiki.postgresql.org/wiki/Pseudo_encrypt_constrained_to_an_arbitrary_range
> with sample plgsql code for the [0..10,000,000] range that might be useful
> for other cases.

Nice reference, nice WikiPage, nice job. Bookmarking every thing.

> But for the btree fragmentation and final size issue, TBH I don't expect
> that constraining the values within that smaller range will make any
> difference
> in the tests, because it's the dispersion that matters, not the values
> themselves.

Neither do I, that is why I stated probabley good enough for tests,  but....

> I mean that, whether the values are well dispersed in the [0..1e7] range or
> equally well dispersed in the [0..2**32] range, the probability of a newly
> inserted value to compare greater or lower to any previous values of the list
> should be the same, so shouldn't the page splits be the same, statistically
> speaking?

I know some btrees do prefix-coding of the keys, and some do it even
with long integers, and some others do delta-coding for integer keys.
I seem to recall postgres does not, that's whay I do not think it will
make a difference.

But anyway, to compare two things like that, as the original poster
was doing, I normally prefer to test just one thing at a time, that's
why I would normally try to do it by writing a sorted file, shuffling
it with sort -R, and copying it, server side if posible, to eliminate
so both

Francisco Olarte.

Re: Sequential vs. random values - number of pages in B-tree

From

pinker

Date:

23 August 2016, 12:26:51

Francisco Olarte wrote
It's already been told that btrees work that way, if you find it strange read a bit about them, this is completely normal, but ...

I am just surprised by the order of magnitude in the difference though. 2 and 27 minutes that's the huge difference... I did another, simplified test, to make sure there is no duplicates and the only difference between both sets is the order:

CREATE TABLE source_sequential AS SELECT s from generate_series(1,10000000) as s; 
CREATE TABLE  source_random    AS SELECT * from source_sequential ORDER BY random();

CREATE TABLE t_sequential (id bigint);
CREATE INDEX i_sequential ON t_sequential (id);

CREATE TABLE t_random (id bigint);
CREATE INDEX i_random ON t_random (id);

INSERT INTO t_sequential SELECT * FROM source_sequential;
102258,949 ms
INSERT INTO t_random SELECT * FROM source_random;
1657575,699 ms

View this message in context: Re: Sequential vs. random values - number of pages in B-tree
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Sequential vs. random values - number of pages in B-tree

From

Francisco Olarte

Date:

23 August 2016, 13:45:46

Hi pinker:

On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote:
> I am just surprised by the order of magnitude in the difference though. 2
> and 27 minutes that's the huge difference... I did another, simplified test,
> to make sure there is no duplicates and the only difference between both
> sets is the order:
...
> INSERT INTO t_sequential SELECT * FROM source_sequential;
> 102258,949 ms
> INSERT INTO t_random SELECT * FROM source_random;
> 1657575,699 ms

If I read correctly, you are getting 100s/10Mkeys=10us/key in
sequential, and 165 in random.

I'm not surprissed at all. I've got greater differences on a memory
tree, sorted insertion can be easily optimized to be very fast. AS an
example, sequential insertion can easily avoid moving data while
filling the pages and, with a little care, it can also avoid some of
them when splitting. I'm not current with the current postgres
details, but it does not surprise me they have big optimizations for
this, especially when index ordered insertion is quite common in
things like bulk loads or timestamped log lines.

Francisco Olarte.

Re: Sequential vs. random values - number of pages in B-tree

From

Rob Sargent

Date:

23 August 2016, 14:28:01


On 08/23/2016 07:44 AM, Francisco Olarte wrote:
> Hi pinker:
>
> On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote:
>> I am just surprised by the order of magnitude in the difference though. 2
>> and 27 minutes that's the huge difference... I did another, simplified test,
>> to make sure there is no duplicates and the only difference between both
>> sets is the order:
> ...
>> INSERT INTO t_sequential SELECT * FROM source_sequential;
>> 102258,949 ms
>> INSERT INTO t_random SELECT * FROM source_random;
>> 1657575,699 ms
> If I read correctly, you are getting 100s/10Mkeys=10us/key in
> sequential, and 165 in random.
>
> I'm not surprissed at all. I've got greater differences on a memory
> tree, sorted insertion can be easily optimized to be very fast. AS an
> example, sequential insertion can easily avoid moving data while
> filling the pages and, with a little care, it can also avoid some of
> them when splitting. I'm not current with the current postgres
> details, but it does not surprise me they have big optimizations for
> this, especially when index ordered insertion is quite common in
> things like bulk loads or timestamped log lines.
>
> Francisco Olarte.
>
>
And if each insert is in a separate transaction, does this still hold true?

Re: Sequential vs. random values - number of pages in B-tree

From

Francisco Olarte

Date:

23 August 2016, 14:35:03

On Tue, Aug 23, 2016 at 4:28 PM, Rob Sargent <robjsargent@gmail.com> wrote:
> On 08/23/2016 07:44 AM, Francisco Olarte wrote:
>> On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote:
>>> I am just surprised by the order of magnitude in the difference though. 2
>>> and 27 minutes that's the huge difference... I did another, simplified
>>> test,
>>> to make sure there is no duplicates and the only difference between both
>>> sets is the order:
>>
>> ...
>>>
>>> INSERT INTO t_sequential SELECT * FROM source_sequential;
>>> 102258,949 ms
>>> INSERT INTO t_random SELECT * FROM source_random;
>>> 1657575,699 ms
>>
>> If I read correctly, you are getting 100s/10Mkeys=10us/key in
>> sequential, and 165 in random.
>>
>> I'm not surprissed at all. I've got greater differences on a memory
>> tree, sorted insertion can be easily optimized to be very fast. AS an
>> example, sequential insertion can easily avoid moving data while
>> filling the pages and, with a little care, it can also avoid some of
>> them when splitting. I'm not current with the current postgres
>> details, but it does not surprise me they have big optimizations for
>> this, especially when index ordered insertion is quite common in
>> things like bulk loads or timestamped log lines.

> And if each insert is in a separate transaction, does this still hold true?

What are you referring to by 'this'? ( BTW, bear in mind one
transaction needs at least a disk flush, and, if done via network, at
least one RTT, so I doubt you can achieve 10us/transaction unless you
have very special conditions ).

Francisco Olarte.

Re: Sequential vs. random values - number of pages in B-tree

From

Rob Sargent

Date:

23 August 2016, 14:52:28

On 08/23/2016 08:34 AM, Francisco Olarte wrote:
> On Tue, Aug 23, 2016 at 4:28 PM, Rob Sargent <robjsargent@gmail.com> wrote:
>> On 08/23/2016 07:44 AM, Francisco Olarte wrote:
>>> On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote:
>>>> I am just surprised by the order of magnitude in the difference though. 2
>>>> and 27 minutes that's the huge difference... I did another, simplified
>>>> test,
>>>> to make sure there is no duplicates and the only difference between both
>>>> sets is the order:
>>> ...
>>>> INSERT INTO t_sequential SELECT * FROM source_sequential;
>>>> 102258,949 ms
>>>> INSERT INTO t_random SELECT * FROM source_random;
>>>> 1657575,699 ms
>>> If I read correctly, you are getting 100s/10Mkeys=10us/key in
>>> sequential, and 165 in random.
>>>
>>> I'm not surprissed at all. I've got greater differences on a memory
>>> tree, sorted insertion can be easily optimized to be very fast. AS an
>>> example, sequential insertion can easily avoid moving data while
>>> filling the pages and, with a little care, it can also avoid some of
>>> them when splitting. I'm not current with the current postgres
>>> details, but it does not surprise me they have big optimizations for
>>> this, especially when index ordered insertion is quite common in
>>> things like bulk loads or timestamped log lines.
>> And if each insert is in a separate transaction, does this still hold true?
> What are you referring to by 'this'? ( BTW, bear in mind one
> transaction needs at least a disk flush, and, if done via network, at
> least one RTT, so I doubt you can achieve 10us/transaction unless you
> have very special conditions ).
>
> Francisco Olarte.
By 'this' I was referring to the optimizations mentioned, and am
wondering if this holds true under user load.  Much magic can happen in
a custom data load, but do these optimization apply to an application
loading single (or perhaps several) records per transaction.  Does one,
in that scenario, not suffer any consequence for continuously loading
one side of the tree (the rightmost node?).

Re: Sequential vs. random values - number of pages in B-tree

From

Francisco Olarte

Date:

23 August 2016, 15:08:21

Hi Rob:

On Tue, Aug 23, 2016 at 4:52 PM, Rob Sargent <robjsargent@gmail.com> wrote:
> By 'this' I was referring to the optimizations mentioned, and am wondering
> if this holds true under user load.

For that you'll have to refer to the source, or ask someone more
versed in pg source arcanes.

> Much magic can happen in a custom data
> load, but do these optimization apply to an application loading single (or
> perhaps several) records per transaction.  Does one, in that scenario, not
> suffer any consequence for continuously loading one side of the tree (the
> rightmost node?).

Not that much magic is neccesary. The time I did it I just needed to
detect on every insertion whether I was at the rightmost position (
made easier because I had minimum/maximum keys cached in the tree
object header ), and have a special routine for inserting a new last
node ( put in last page, whose pointer I had, grabbing a new one of
needed, whose pointer will be appended at the tail of the parent,
etc.., it was just a pruned down version of the general insert
routine, but made insertions run easily 20 times faster by avoiding
nearly every check knowing I was on the right edge ). I do not know if
pg inserts several items at a time in bulk loading, but I doubt it.
Normally every btree indexing library has some optimization for this
cases, as they are common, just like every real sort routine has some
optimization for presorted input.

Francisco Olarte.