Thread: Sequential vs. random values - number of pages in B-tree
Hi! After doing a quick test: with sequential values: create table t01 (id bigint); create index i01 on t01(id); insert into t01 SELECT s from generate_series(1,10000000) as s; and random values: create table t02 (id bigint); create index i02 on t02(id); insert into t02 SELECT random()*100 from generate_series(1,10000000) as s; The page counts for tables remain the same: relpages | relname ----------+-------------------------- 44248 | t01 44248 | t02 But for indexes are different: relpages | relname ----------+--------------------------------- 27421 | i01 34745 | i02 Plus, postgres does 5 times more writes to disk with random data. What's the reason that postgres needs more index pages to store random data than sequential ones? -- View this message in context: http://postgresql.nabble.com/Sequential-vs-random-values-number-of-pages-in-B-tree-tp5916956.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Hi, >What's the reason that postgres needs more index pages to store random >data >than sequential ones? I assume that is because B-Tree is self-balanced tree, so it needs to be rebalanced after each insertion. Random insertions may go to the head of index where no space left leading to huge data moving. https://en.wikipedia.org/wiki/B-tree#Insertions_and_deletions Ilya Kazakevich JetBrains http://www.jetbrains.com The Drive to Develop
Hi: On Thu, Aug 18, 2016 at 1:32 PM, pinker <pinker@onet.eu> wrote: ... > create table t01 (id bigint); > create index i01 on t01(id); > insert into t01 SELECT s from generate_series(1,10000000) as s; > > and random values: > create table t02 (id bigint); > create index i02 on t02(id); > insert into t02 SELECT random()*100 from generate_series(1,10000000) as s; It's already been told that btrees work that way, if you find it strange read a bit about them, this is completely normal, but ... ... what I come to point is your test is severely flawed. It probably does not matter in this case, but you are inserting 10M DIFFERENT VALUES in the first case and only 100 in the second one, which an average of 100K DUPLICATES of each. This affects btrees too. You could try using random*1G, or at least 100M, for a better test ( which may have even worse behaviour, ideally I would just write 10M integers to a disk file, then shuffle it and compare COPY FROM times from both ) ( unless you know of an easy way to generate a random permutation on the fly without using a lot of memory, I do not ). Francisco Olarte.
Francisco Olarte wrote: > unless you know of an easy way to generate a random permutation on the > fly without using a lot of memory, I do not. It could be done by encrypting the stream. For 32 bits integers: https://wiki.postgresql.org/wiki/Skip32 For 64 bits integers: https://wiki.postgresql.org/wiki/XTEA Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Daniel: On Thu, Aug 18, 2016 at 5:24 PM, Daniel Verite <daniel@manitou-mail.org> wrote: >> unless you know of an easy way to generate a random permutation on the >> fly without using a lot of memory, I do not. > It could be done by encrypting the stream. > For 32 bits integers: > https://wiki.postgresql.org/wiki/Skip32 > For 64 bits integers: > https://wiki.postgresql.org/wiki/XTEA Nearly, probably good enough for tests, but only generates a pseudorandom permutation if you encrypt 2**32/64 values, not with the 1..1E7 range, it will map them into 1E7 different numbers in the range 2**32/64. I think there are some pseudo-random number generators which can be made to work with any range, but do not recall which ones right now. Francisco Olarte.
Francisco Olarte wrote: > I think there are some pseudo-random number generators which > can be made to work with any range, but do not recall which ones right > now. There's a simple technique that works on top of a Feistel network, called the cycle-walking cipher. Described for instance at: http://web.cs.ucdavis.edu/~rogaway/papers/subset.pdf I'm using the opportunity to add a wiki page: https://wiki.postgresql.org/wiki/Pseudo_encrypt_constrained_to_an_arbitrary_range with sample plgsql code for the [0..10,000,000] range that might be useful for other cases. But for the btree fragmentation and final size issue, TBH I don't expect that constraining the values within that smaller range will make any difference in the tests, because it's the dispersion that matters, not the values themselves. I mean that, whether the values are well dispersed in the [0..1e7] range or equally well dispersed in the [0..2**32] range, the probability of a newly inserted value to compare greater or lower to any previous values of the list should be the same, so shouldn't the page splits be the same, statistically speaking? Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Fri, Aug 19, 2016 at 3:20 PM, Daniel Verite <daniel@manitou-mail.org> wrote: > There's a simple technique that works on top of a Feistel network, > called the cycle-walking cipher. Described for instance at: > http://web.cs.ucdavis.edu/~rogaway/papers/subset.pdf > I'm using the opportunity to add a wiki page: > https://wiki.postgresql.org/wiki/Pseudo_encrypt_constrained_to_an_arbitrary_range > with sample plgsql code for the [0..10,000,000] range that might be useful > for other cases. Nice reference, nice WikiPage, nice job. Bookmarking every thing. > But for the btree fragmentation and final size issue, TBH I don't expect > that constraining the values within that smaller range will make any > difference > in the tests, because it's the dispersion that matters, not the values > themselves. Neither do I, that is why I stated probabley good enough for tests, but.... > I mean that, whether the values are well dispersed in the [0..1e7] range or > equally well dispersed in the [0..2**32] range, the probability of a newly > inserted value to compare greater or lower to any previous values of the list > should be the same, so shouldn't the page splits be the same, statistically > speaking? I know some btrees do prefix-coding of the keys, and some do it even with long integers, and some others do delta-coding for integer keys. I seem to recall postgres does not, that's whay I do not think it will make a difference. But anyway, to compare two things like that, as the original poster was doing, I normally prefer to test just one thing at a time, that's why I would normally try to do it by writing a sorted file, shuffling it with sort -R, and copying it, server side if posible, to eliminate so both Francisco Olarte.
I am just surprised by the order of magnitude in the difference though. 2 and 27 minutes that's the huge difference... I did another, simplified test, to make sure there is no duplicates and the only difference between both sets is the order:Francisco Olarte wroteIt's already been told that btrees work that way, if you find it strange read a bit about them, this is completely normal, but ...
CREATE TABLE source_sequential AS SELECT s from generate_series(1,10000000) as s; CREATE TABLE source_random AS SELECT * from source_sequential ORDER BY random(); CREATE TABLE t_sequential (id bigint); CREATE INDEX i_sequential ON t_sequential (id); CREATE TABLE t_random (id bigint); CREATE INDEX i_random ON t_random (id); INSERT INTO t_sequential SELECT * FROM source_sequential; 102258,949 ms INSERT INTO t_random SELECT * FROM source_random; 1657575,699 ms
View this message in context: Re: Sequential vs. random values - number of pages in B-tree
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Hi pinker: On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote: > I am just surprised by the order of magnitude in the difference though. 2 > and 27 minutes that's the huge difference... I did another, simplified test, > to make sure there is no duplicates and the only difference between both > sets is the order: ... > INSERT INTO t_sequential SELECT * FROM source_sequential; > 102258,949 ms > INSERT INTO t_random SELECT * FROM source_random; > 1657575,699 ms If I read correctly, you are getting 100s/10Mkeys=10us/key in sequential, and 165 in random. I'm not surprissed at all. I've got greater differences on a memory tree, sorted insertion can be easily optimized to be very fast. AS an example, sequential insertion can easily avoid moving data while filling the pages and, with a little care, it can also avoid some of them when splitting. I'm not current with the current postgres details, but it does not surprise me they have big optimizations for this, especially when index ordered insertion is quite common in things like bulk loads or timestamped log lines. Francisco Olarte.
On 08/23/2016 07:44 AM, Francisco Olarte wrote: > Hi pinker: > > On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote: >> I am just surprised by the order of magnitude in the difference though. 2 >> and 27 minutes that's the huge difference... I did another, simplified test, >> to make sure there is no duplicates and the only difference between both >> sets is the order: > ... >> INSERT INTO t_sequential SELECT * FROM source_sequential; >> 102258,949 ms >> INSERT INTO t_random SELECT * FROM source_random; >> 1657575,699 ms > If I read correctly, you are getting 100s/10Mkeys=10us/key in > sequential, and 165 in random. > > I'm not surprissed at all. I've got greater differences on a memory > tree, sorted insertion can be easily optimized to be very fast. AS an > example, sequential insertion can easily avoid moving data while > filling the pages and, with a little care, it can also avoid some of > them when splitting. I'm not current with the current postgres > details, but it does not surprise me they have big optimizations for > this, especially when index ordered insertion is quite common in > things like bulk loads or timestamped log lines. > > Francisco Olarte. > > And if each insert is in a separate transaction, does this still hold true?
On Tue, Aug 23, 2016 at 4:28 PM, Rob Sargent <robjsargent@gmail.com> wrote: > On 08/23/2016 07:44 AM, Francisco Olarte wrote: >> On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote: >>> I am just surprised by the order of magnitude in the difference though. 2 >>> and 27 minutes that's the huge difference... I did another, simplified >>> test, >>> to make sure there is no duplicates and the only difference between both >>> sets is the order: >> >> ... >>> >>> INSERT INTO t_sequential SELECT * FROM source_sequential; >>> 102258,949 ms >>> INSERT INTO t_random SELECT * FROM source_random; >>> 1657575,699 ms >> >> If I read correctly, you are getting 100s/10Mkeys=10us/key in >> sequential, and 165 in random. >> >> I'm not surprissed at all. I've got greater differences on a memory >> tree, sorted insertion can be easily optimized to be very fast. AS an >> example, sequential insertion can easily avoid moving data while >> filling the pages and, with a little care, it can also avoid some of >> them when splitting. I'm not current with the current postgres >> details, but it does not surprise me they have big optimizations for >> this, especially when index ordered insertion is quite common in >> things like bulk loads or timestamped log lines. > And if each insert is in a separate transaction, does this still hold true? What are you referring to by 'this'? ( BTW, bear in mind one transaction needs at least a disk flush, and, if done via network, at least one RTT, so I doubt you can achieve 10us/transaction unless you have very special conditions ). Francisco Olarte.
On 08/23/2016 08:34 AM, Francisco Olarte wrote: > On Tue, Aug 23, 2016 at 4:28 PM, Rob Sargent <robjsargent@gmail.com> wrote: >> On 08/23/2016 07:44 AM, Francisco Olarte wrote: >>> On Tue, Aug 23, 2016 at 2:26 PM, pinker <pinker@onet.eu> wrote: >>>> I am just surprised by the order of magnitude in the difference though. 2 >>>> and 27 minutes that's the huge difference... I did another, simplified >>>> test, >>>> to make sure there is no duplicates and the only difference between both >>>> sets is the order: >>> ... >>>> INSERT INTO t_sequential SELECT * FROM source_sequential; >>>> 102258,949 ms >>>> INSERT INTO t_random SELECT * FROM source_random; >>>> 1657575,699 ms >>> If I read correctly, you are getting 100s/10Mkeys=10us/key in >>> sequential, and 165 in random. >>> >>> I'm not surprissed at all. I've got greater differences on a memory >>> tree, sorted insertion can be easily optimized to be very fast. AS an >>> example, sequential insertion can easily avoid moving data while >>> filling the pages and, with a little care, it can also avoid some of >>> them when splitting. I'm not current with the current postgres >>> details, but it does not surprise me they have big optimizations for >>> this, especially when index ordered insertion is quite common in >>> things like bulk loads or timestamped log lines. >> And if each insert is in a separate transaction, does this still hold true? > What are you referring to by 'this'? ( BTW, bear in mind one > transaction needs at least a disk flush, and, if done via network, at > least one RTT, so I doubt you can achieve 10us/transaction unless you > have very special conditions ). > > Francisco Olarte. By 'this' I was referring to the optimizations mentioned, and am wondering if this holds true under user load. Much magic can happen in a custom data load, but do these optimization apply to an application loading single (or perhaps several) records per transaction. Does one, in that scenario, not suffer any consequence for continuously loading one side of the tree (the rightmost node?).
Hi Rob: On Tue, Aug 23, 2016 at 4:52 PM, Rob Sargent <robjsargent@gmail.com> wrote: > By 'this' I was referring to the optimizations mentioned, and am wondering > if this holds true under user load. For that you'll have to refer to the source, or ask someone more versed in pg source arcanes. > Much magic can happen in a custom data > load, but do these optimization apply to an application loading single (or > perhaps several) records per transaction. Does one, in that scenario, not > suffer any consequence for continuously loading one side of the tree (the > rightmost node?). Not that much magic is neccesary. The time I did it I just needed to detect on every insertion whether I was at the rightmost position ( made easier because I had minimum/maximum keys cached in the tree object header ), and have a special routine for inserting a new last node ( put in last page, whose pointer I had, grabbing a new one of needed, whose pointer will be appended at the tail of the parent, etc.., it was just a pruned down version of the general insert routine, but made insertions run easily 20 times faster by avoiding nearly every check knowing I was on the right edge ). I do not know if pg inserts several items at a time in bulk loading, but I doubt it. Normally every btree indexing library has some optimization for this cases, as they are common, just like every real sort routine has some optimization for presorted input. Francisco Olarte.