Thread: [HACKERS] [WIP] Zipfian distribution in pgbench
Hello! PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in PostgreSQL.MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number generator,everyone will be able to make research on this topic without using YCSB. This is the reason why I am currently working on random_zipfian function. The bottleneck of algorithm that I use is that it calculates zeta function (it has linear complexity - https://en.wikipedia.org/wiki/Riemann_zeta_function).It my cause problems on generating huge amount of big numbers. That’s why I added caching for zeta value. And it works good for cases when random_zipfian called with same parameters inscript. For example: … \set a random_zipfian(1, 100, 1.2) \set b random_zipfian(1, 100, 1.2) … In other case, second call will override cache of first and caching does not make any sense: … \set a random_zipfian(1, 100, 1.2) \set b random_zipfian(1, 200, 1.4) … That’s why I have a question: should I implement support of caching zeta values for calls with different parameters, or not? P.S. I attaching patch and script - analogue of YCSB Workload A. Run benchmark with command: $ pgbench -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql On scale = 10(1 million rows) it gives following results on machine with 144 cores(with synchronous_commit=off): nclients tps 1 8842.401870 2 18358.140869 4 45999.378785 8 88713.743199 16 170166.998212 32 290069.221493 64 178128.030553 128 88712.825602 256 38364.937573 512 13512.765878 1000 6188.136736 — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello Alik, > PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% > UPDATE of random row by PK) on benchmarking with big number of clients > using Zipfian distribution. MySQL also has decline but it is not > significant as it is in PostgreSQL. MongoDB does not have decline at > all. And if pgbench would have Zipfian distribution random number > generator, everyone will be able to make research on this topic without > using YCSB. Your description is not very precise. What version of Postgres is used? If there is a decline, compared to which version? Is there a link to these results? > This is the reason why I am currently working on random_zipfian function. > The bottleneck of algorithm that I use is that it calculates zeta > function (it has linear complexity - > https://en.wikipedia.org/wiki/Riemann_zeta_function). It my cause > problems on generating huge amount of big numbers. Indeed, the function computation is over expensive, and the numerical precision of the implementation is doubtful. If there is no better way to compute this function, ISTM that it should be summed in reverse order to accumulate small values first, from (1/n)^s + ... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in calling pow for this one, so it could be: double ans = 0.0; for (i = n; i >= 2; i--) ans += pow(1. / i, theta); return 1.0 + ans; > That’s why I added caching for zeta value. And it works good for cases > when random_zipfian called with same parameters in script. For example: > That’s why I have a question: should I implement support of caching zeta > values for calls with different parameters, or not? I do not envision the random_zipfian function to be used widely, but if it is useful to someone this is fine with me. Could an inexpensive exponential distribution be used instead for the same benchmarking purpose? If the functions when actually used is likely to be called with different parameters, then some caching beyond the last value would seem in order. Maybe a small fixed size array? However, it should be somehow thread safe, which does not seem to be the case with the current implementation. Maybe a per-thread cache? Or use a lock only to update a shared cache? At least it should avoid locking to read values... > P.S. I attaching patch and script - analogue of YCSB Workload A. > Run benchmark with command: > $ pgbench -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql Given the explanations, the random draw mostly hits values at the beginning of the interval, so when the number of client goes higher one just get locking contention on the updated row? ISTM that also having the tps achieved with a flat distribution would allow to check this hypothesis. > On scale = 10(1 million rows) it gives following results on machine with > 144 cores(with synchronous_commit=off): > nclients tps > 1 8842.401870 > 2 18358.140869 > 4 45999.378785 > 8 88713.743199 > 16 170166.998212 > 32 290069.221493 > 64 178128.030553 > 128 88712.825602 > 256 38364.937573 > 512 13512.765878 > 1000 6188.136736 -- Fabien.
On Fri, Jul 7, 2017 at 3:45 AM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in PostgreSQL.MongoDB does not have decline at all. How is that possible? In a Zipfian distribution, no matter how big the table is, almost all of the updates will be concentrated on a handful of rows - and updates to any given row are necessarily serialized, or so I would think. Maybe MongoDB can be fast there since there are no transactions, so it can just lock the row slam in the new value and unlock the row, all (I suppose) without writing WAL or doing anything hard. But MySQL is going to have to hold the row lock until transaction commit just like we do, or so I would think. Is it just that their row locking is way faster than ours? I'm more curious about why we're performing badly than I am about a general-purpose random_zipfian function. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 7, 2017 at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > How is that possible? In a Zipfian distribution, no matter how big > the table is, almost all of the updates will be concentrated on a > handful of rows - and updates to any given row are necessarily > serialized, or so I would think. Maybe MongoDB can be fast there > since there are no transactions, so it can just lock the row slam in > the new value and unlock the row, all (I suppose) without writing WAL > or doing anything hard. If you're not using the Wired Tiger storage engine, than the locking is at the document level, which means that a Zipfian distribution is no worse than any other as far as lock contention goes. That's one possible explanation. Another is that indexed organized tables naturally have much better locality, which matters at every level of the memory hierarchy. > I'm more curious about why we're performing badly than I am about a > general-purpose random_zipfian function. :-) I'm interested in both. I think that a random_zipfian function would be quite helpful for modeling certain kinds of performance problems, like CPU cache misses incurred at the page level. -- Peter Geoghegan
On Fri, Jul 7, 2017 at 12:45 AM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > On scale = 10(1 million rows) it gives following results on machine with 144 cores(with synchronous_commit=off): > nclients tps > 1 8842.401870 > 2 18358.140869 > 4 45999.378785 > 8 88713.743199 > 16 170166.998212 > 32 290069.221493 > 64 178128.030553 > 128 88712.825602 > 256 38364.937573 > 512 13512.765878 > 1000 6188.136736 Is it possible for you to instrument the number of B-Tree page accesses using custom instrumentation for pgbench_accounts_pkey? If that seems like too much work, then it would still be interesting to see what the B-Tree keyspace looks like before and after varying the "nclient" count from, say, 32 to 128. Maybe there is a significant difference in how balanced or skewed it is in each case. Or, the index could simply be more bloated. There is a query that I sometimes use, that itself uses pageinspect, to summarize the keyspace quickly. It shows you the highkey for every internal page, starting from the root and working down to the lowest internal page level (the one just before the leaf level -- level 1), in logical/keyspace order. You can use it to visualize the distribution of values. It could easily include the leaf level, too, but that's less interesting and tends to make the query take ages. I wonder what the query will show here. Here is the query: with recursive index_details as ( select 'pgbench_accounts_pkey'::text idx ), size_in_pages_index as ( select (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages from index_details ), page_stats as ( select index_details.*, stats.* from index_details, size_in_pages_index, lateral (select i fromgenerate_series(1, size_pages - 1) i) series, lateral (select * from bt_page_stats(idx, i)) stats), internal_page_stats as ( select * from page_stats where type != 'l'), meta_stats as ( select * from index_details s, lateral (select * from bt_metap(s.idx)) meta), internal_items as ( select * from internal_page_stats order by btpo desc), -- XXX: Note ordering dependency within this CTE, on internal_items ordered_internal_items(item, blk, level) as ( select 1, blkno, btpo from internal_items where btpo_prev = 0 andbtpo = (select level from meta_stats) union select case when level = btpo then o.item + 1 else 1 end, blkno, btpofrom internal_items i, ordered_internal_items o where i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level- 1) ) select idx, btpo as level, item as l_item, blkno, btpo_prev, btpo_next, btpo_flags, type, live_items, dead_items, avg_item_size,page_size, free_size, -- Only non-rightmost pages have high key. case when btpo_next != 0 then (select datafrom bt_page_items(idx, blkno) where itemoffset = 1) end as highkey from ordered_internal_items o join internal_items i on o.blk = i.blkno order by btpo desc, item; -- Peter Geoghegan
Peter Geoghegan wrote: > Here is the query: > > with recursive index_details as ( > select > 'pgbench_accounts_pkey'::text idx > ), [...] Hmm, this seems potentially very useful. Care to upload it to https://wiki.postgresql.org/wiki/Category:Snippets ? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 7, 2017 at 12:59 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Hmm, this seems potentially very useful. Care to upload it to > https://wiki.postgresql.org/wiki/Category:Snippets ? Sure. I've added it here, under "index maintenance": https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index It would be a nice additional touch if there was an easy way of taking the on-disk representation of index tuples (in this case that would be little-endian signed integers from bt_page_items()), and from that output actual typed values. Maybe just for a few select datatypes. -- Peter Geoghegan
Hello, Fabien!
You are right, it’s better to reverse order.
Your description is not very precise. What version of Postgres is used? If there is a decline, compared to which version? Is there a link to these results?
Benchmark have been done in master v10. I am attaching image with results:
.
Indeed, the function computation is over expensive, and the numerical precision of the implementation is doubtful.
If there is no better way to compute this function, ISTM that it should be summed in reverse order to accumulate small values first, from (1/n)^s + ... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in calling pow for this one, so it could be:
double ans = 0.0;
for (i = n; i >= 2; i--)
ans += pow(1. / i, theta);
return 1.0 + ans;
You are right, it’s better to reverse order.
If the functions when actually used is likely to be called with different parameters, then some caching beyond the last value would seem in order. Maybe a small fixed size array?
However, it should be somehow thread safe, which does not seem to be the case with the current implementation. Maybe a per-thread cache? Or use a lock only to update a shared cache? At least it should avoid locking to read values…
Yea, I forget about thread-safety. I will implement per-thread cache with small fixed array.
Given the explanations, the random draw mostly hits values at the beginning of the interval, so when the number of client goes higher one just get locking contention on the updated row?
Yes, exactly.
ISTM that also having the tps achieved with a flat distribution would allow to check this hypothesis.
On Workload A with uniform distribution PostgreSQL shows better results than MongoDB and MySQL(see attachment). Also you can notice that for small number of clients type of distribution does not affect on tps on MySQL.
And it’s important to mention that postgres run with option synchronous_commit=off, to satisfy durability MongoDB writeConcern=1&journaled=false. In this mode there is possibility to lose all changes in the last second. If we run postgres with max durability MongoDB will lag far behind.
---
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
On Mon, Jul 10, 2017 at 12:19 PM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote:
Hello, Fabien!Your description is not very precise. What version of Postgres is used? If there is a decline, compared to which version? Is there a link to these results?Benchmark have been done in master v10. I am attaching image with results:.
It will be interesting to see what the profiling data with perf says about this for PostgreSQL. Can you try to get the perf report?
Hello Alik, >> Your description is not very precise. What version of Postgres is used? >> If there is a decline, compared to which version? Is there a link to >> these results? > > Benchmark have been done in master v10. I am attaching image with results: > . Ok, thanks. More precision would be helpful, such as the exact pgbench option used (eg how many client per thread in pgbench, how long does it run, prepared transactions, ...). Intuitively, contention should explain a saturation of the tps performance, because more clients are not effective to improve tps as the wait for other clients, and the latency would degrade. But it is unclear to me why the tps would be reduced even with lock contention, so something seems amiss. Performance debugging by mail is an uneasy task. Maybe you could try zipf with unlogged tables, to check whether skipping the WAL write does something. Also Amit advice about the perf report looks useful. >> Given the explanations, the random draw mostly hits values at the >> beginning of the interval, so when the number of client goes higher one >> just get locking contention on the updated row? > > Yes, exactly. Ok. The uniform distribution run, if all other parameters are equal, gives a hint about the potential performance when the performance bottleneck is hit. > On Workload A with uniform distribution PostgreSQL shows better results > than MongoDB and MySQL(see attachment). Also you can notice that for > small number of clients type of distribution does not affect on tps on > MySQL. Ok. I assume that you use pgbench for pg and other ad-hoc tools for the other targets (mysql & mongodb). -- Fabien.
On 7 Jul 2017, at 21:53, Peter Geoghegan <pg@bowt.ie> wrote:
Is it possible for you to instrument the number of B-Tree page
accesses using custom instrumentation for pgbench_accounts_pkey?
If that seems like too much work, then it would still be interesting
to see what the B-Tree keyspace looks like before and after varying
the "nclient" count from, say, 32 to 128. Maybe there is a significant
difference in how balanced or skewed it is in each case. Or, the index
could simply be more bloated.
There is a query that I sometimes use, that itself uses pageinspect,
to summarize the keyspace quickly. It shows you the highkey for every
internal page, starting from the root and working down to the lowest
internal page level (the one just before the leaf level -- level 1),
in logical/keyspace order. You can use it to visualize the
distribution of values. It could easily include the leaf level, too,
but that's less interesting and tends to make the query take ages. I
wonder what the query will show here.
Here is the query:
…
I am attaching results of query that you sent. It shows that there is nothing have changed after executing tests.
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hello! I want to say that our company is already engaged in the search for the causes of the problem and their solution. And alsowe have few experimental patches that increases performance for 1000 clients by several times. In addition, I have fixed threadsafety issues and implemented per-thread cache for zeta values. See attached patch. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Jul 12, 2017 at 4:28 AM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > I am attaching results of query that you sent. It shows that there is > nothing have changed after executing tests. But something did change! In the case where performance was good, all internal pages on the level above the leaf level have exactly 285 items, excluding the rightmost page, which unsurprisingly didn't fill. However, after increasing client count to get the slow case, the "hot" part of the keyspace (the leaf pages owned by the first/leftmost internal page on that level) has 353 items rather than 285. Now, that might not seem like that much of a difference, but if you consider how duplicates are handled in the B-Tree code, and how unique index enforcement works, I think it could be. It could lead to heavy buffer lock contention, because we sometimes do a lot of work with an exclusive buffer lock held. This is something that I go into a lot of detail on in the Wiki page on Key normalization: https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement Now, I'm not saying that I know for sure that that's what it is. It seems like a good area to investigate, though. Even if it wasn't buffer lock contention, we're still talking about the difference between the hot part of the B-Tree being about 353 pages, versus 285. Buffer lock contention could naturally limit the growth in size to "only" 353, by slowing everything down. -- Peter Geoghegan
On Wed, Jul 12, 2017 at 12:30 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Jul 12, 2017 at 4:28 AM, Alik Khilazhev > <a.khilazhev@postgrespro.ru> wrote: >> I am attaching results of query that you sent. It shows that there is >> nothing have changed after executing tests. > > But something did change! In the case where performance was good, all > internal pages on the level above the leaf level have exactly 285 > items, excluding the rightmost page, which unsurprisingly didn't fill. > However, after increasing client count to get the slow case, the "hot" > part of the keyspace (the leaf pages owned by the first/leftmost > internal page on that level) has 353 items rather than 285. Could you please run the query again for both cases, but this time change it to get items from the leaf level too, and not just internal levels? Just remove the "wheretype != 'l' " clause in one of the CTEs. That should do it. It's probably the case that the hot part of the B-tree is actually significantly less than 353 items (or 285 items), and so the "before" and "after" difference in how many pages are used for the hot part of the keyspace could be quite a lot larger than my initial estimate. It could be that the granularity that is shown on the internal pages is insufficient to see just how bad the problem is. -- Peter Geoghegan
Peter Geoghegan wrote: > Now, that might not seem like that much of a difference, but if you > consider how duplicates are handled in the B-Tree code, and how unique > index enforcement works, I think it could be. It could lead to heavy > buffer lock contention, because we sometimes do a lot of work with an > exclusive buffer lock held. Not to mention work done with a "buffer cleanup lock" held -- which is compounded by the fact that acquiring such a lock is prone to starvation if there are many scanners of that index. I've seen a case where a very hot table is scanned so heavily that vacuum is starved for days waiting to acquire cleanup on a single page (vacuum was only able to finish because the app using the table was restarted). I'm sure that a uniform distribution of keys, with a uniform distribution of values scanned, would give a completely different behavior than a highly skewed distribution where a single key receives a large fraction of the scans. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 12, 2017 at 1:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Not to mention work done with a "buffer cleanup lock" held -- which is > compounded by the fact that acquiring such a lock is prone to starvation > if there are many scanners of that index. I've seen a case where a very > hot table is scanned so heavily that vacuum is starved for days waiting > to acquire cleanup on a single page (vacuum was only able to finish > because the app using the table was restarted). I'm sure that a uniform > distribution of keys, with a uniform distribution of values scanned, > would give a completely different behavior than a highly skewed > distribution where a single key receives a large fraction of the scans. Good point. I'd be interested in seeing the difference it makes if Postgres is built with the call to _bt_check_unique() commented out within nbtinsert.c. The UPDATE query doesn't ever change indexed columns, so there should be no real need for the checking it does in this case. We're seeing a lot of duplicates in at least part of the keyspace, and yet the queries themselves should be as HOT-safe as possible. Another possibly weird thing that I noticed is latency standard deviation. This is from Alik's response to my request to run that SQL query: latency average = 1.375 ms tps = 93085.250384 (including connections establishing) tps = 93125.724773 (excluding connections establishing) SQL script 1: /home/nglukhov/ycsb_read_zipf.sql- weight: 1 (targets 50.0% of total)- 2782999 transactions (49.8% of total,tps = 46364.447705)- latency average = 0.131 ms- latency stddev = 0.087 ms SQL script 2: /home/nglukhov/ycsb_update_zipf.sql- weight: 1 (targets 50.0% of total)- 2780197 transactions (49.8% of total,tps = 46317.766703)- latency average = 2.630 ms- latency stddev = 14.092 ms This is from the 128 client case -- the slow case. Notice that the standard deviation is very high for ycsb_update_zipf.sql. I wonder if this degrades because of some kind of feedback loop, that reaches a kind of tipping point at higher client counts. -- Peter Geoghegan
On Wed, Jul 12, 2017 at 2:17 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I'd be interested in seeing the difference it makes if Postgres is > built with the call to _bt_check_unique() commented out within > nbtinsert.c. Actually, I mean that I wonder how much of a difference it would make if this entire block was commented out within _bt_doinsert(): if (checkUnique != UNIQUE_CHECK_NO) { ... } -- Peter Geoghegan
On 13 Jul 2017, at 00:20, Peter Geoghegan <pg@bowt.ie> wrote:
Actually, I mean that I wonder how much of a difference it would make
if this entire block was commented out within _bt_doinsert():
if (checkUnique != UNIQUE_CHECK_NO)
{
…
}
I am attaching results of test for 32 and 128 clients for original and patched(_bt_doinsert) variants.
—
Thanks and Regards,Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hello Alik, A few comments about the patch v2. Patch applies and compiles. Documentation says that the closer theta is from 0 the flatter the distribution but the implementation requires at least 1, including strange error messages: zipfian parameter must be greater than 1.000000 (not 1.000000) Could theta be allowed between 0 & 1 ? I've tried forcing with theta = 0.1 and it worked well, so I'm not sure that I understand the restriction. I also tried with theta=0.001 but it seemed less good. I have also tried to check the distribution wrt the explanations, with the attached scripts, n=100, theta=1.000001/1.5/3.0: It does not seem to work, there is repeatable +15% bias on i=3 and repeatable -3% to -30% bias for values in i=10-100, this for different values of theta (1.000001,1.5, 3.0). If you try the script, beware to set parameters (theta, n) consistently. About the code: Remove spaces and tabs at end of lines or on empty lines. zipfn: I would suggest to move the pg_erand48 out and pass the result instead of passing the thread. the erand call would move to getZipfianRand. I'm wondering whether the "nearest" hack could be removed so as to simplify the cache functions code... Avoid editing lines without changes (spacesn/tabs?) - thread->logfile = NULL; /* filled in later */ + thread->logfile = NULL; /* filled in later */ The documentation explaining the intuitive distribution talks about a N but does not provide any hint about its value depending on the parameters. There is an useless empty lien before "</para>" after that. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Jul 13, 2017 at 4:38 AM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > I am attaching results of test for 32 and 128 clients for original and > patched(_bt_doinsert) variants. Thanks. The number of leaf pages at the left hand side of the leaf level seems to be ~50 less than the unpatched 128 client case was the first time around, which seems like a significant difference. I wonder why. Maybe autovacuum ran at the right/wrong time this time around? Did you happen to record TPS for this most recent set of tests? I notice one possibly interesting thing from these new numbers: the 32 client case is slightly more bloated when unique index enforcement is removed. -- Peter Geoghegan
On Thu, Jul 13, 2017 at 10:02 AM, Peter Geoghegan <pg@bowt.ie> wrote: > The number of leaf pages at the left hand side of the leaf level seems > to be ~50 less than the unpatched 128 client case was the first time > around, which seems like a significant difference. I wonder why. Maybe > autovacuum ran at the right/wrong time this time around? To reiterate what I say above: The number of leaf pages with dead items is 20 with this most recent run (128 clients, patched + unpatched). The leftmost internal page one level up from the leaf level contains 289 items. Whereas last time it was 353 items. That's a difference between having 20 hot/bloated leaf pages, and probably 84 hot/bloated pages, which I infer must have been the total number of bloated leaf pages within "result.txt". I think that something about all the "pgbench_index_*txt" tests are very different to what we see within "result.txt". It's as if there was a problem when "result.txt" ran, but that problem somehow didn't come up when you did new tests. -- Peter Geoghegan
On Thu, Jul 13, 2017 at 12:49 PM, Peter Geoghegan <pg@bowt.ie> wrote: > To reiterate what I say above: > > The number of leaf pages with dead items is 20 with this most recent > run (128 clients, patched + unpatched). The leftmost internal page one > level up from the leaf level contains 289 items. Whereas last time it > was 353 items. > > That's a difference between having 20 hot/bloated leaf pages, and > probably 84 hot/bloated pages, which I infer must have been the total > number of bloated leaf pages within "result.txt". I think that > something about all the "pgbench_index_*txt" tests are very different > to what we see within "result.txt". It's as if there was a problem > when "result.txt" ran, but that problem somehow didn't come up when > you did new tests. I just figured out that "result.txt" is only a 60 second pgbench run. Is the same true for other tests? It would be much more interesting to see runs that lasted 10 minutes or more. Maybe with pgbench-tools. I would expect that a decline in performance due to bloating the index could happen only after a certain threshold was crossed. Things like HOT pruning and LP_DEAD cleanup could be pretty effective, until finally a tipping point is reached and they're much less effective. -- Peter Geoghegan
On 13 Jul 2017, at 19:14, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Documentation says that the closer theta is from 0 the flatter the distribution
but the implementation requires at least 1, including strange error messages:
zipfian parameter must be greater than 1.000000 (not 1.000000)
Could theta be allowed between 0 & 1 ? I've tried forcing with theta = 0.1
and it worked well, so I'm not sure that I understand the restriction.
I also tried with theta=0.001 but it seemed less good.
Algorithm works with theta less than 1. The only problem here is that theta can not be 1, because of next line of code
cell->alpha = 1. / (1 - theta);
That’s why I put such restriction. Now I see 2 possible solutions for that:
1) Exclude 1, and allow everything in range (0;+∞).
2) Or just increase/decrease theta by very small number if it is 1.
I have also tried to check the distribution wrt the explanations, with the attached scripts, n=100, theta=1.000001/1.5/3.0: It does not seem to work, there is repeatable +15% bias on i=3 and repeatable -3% to -30% bias for values in i=10-100, this for different values of theta (1.000001,1.5, 3.0).
If you try the script, beware to set parameters (theta, n) consistently.
I've executed scripts that you attached with different theta and number of outcomes(not n, n remains the same = 100) and I found out that for theta = 0.1 and big number of outcomes it gives distribution very similar to zipfian(for number of outcomes = 100 000, bias -6% to 8% in whole range and for NOO = 1000 000, bias is -2% to 2%).
By, number of outcomes(NOO) I mean how many times random_zipfian was called. For example:
pgbench -f compte_bench.sql -t 100000
—
Thanks and Regards,Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
> On 13 Jul 2017, at 23:13, Peter Geoghegan <pg@bowt.ie> wrote: > > I just figured out that "result.txt" is only a 60 second pgbench run. > Is the same true for other tests? Yes, other tests ran 60 seconds too. > > It would be much more interesting to see runs that lasted 10 minutes > or more. I am attaching results of tests for 32 and 128 clients that were running for 10 minutes, and TPS remains 305 and 115 ktpsrespectively. Tests was executed with configuration set for YCSB. And there is very aggressively autovacuum, this can be reason why thereis no decline appears with increasing working time. Autovacuum config: autovacuum_max_workers = 8 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.1 autovacuum_vacuum_cost_delay = 0ms autovacuum_vacuum_cost_limit = 10000 — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> Algorithm works with theta less than 1. The only problem here is that > theta can not be 1, because of next line of code > > cell->alpha = 1. / (1 - theta); > > That’s why I put such restriction. Now I see 2 possible solutions for that: > 1) Exclude 1, and allow everything in range (0;+∞). Yep. > 2) Or just increase/decrease theta by very small number if it is 1. Nope, this seems quite arbitrary. > I've executed scripts that you attached with different theta and number > of outcomes(not n, n remains the same = 100) and I found out that for > theta = 0.1 and big number of outcomes it gives distribution very > similar to zipfian(for number of outcomes = 100 000, bias -6% to 8% in > whole range and for NOO = 1000 000, bias is -2% to 2%). Ok, so you did not get the large bias for i=3. Strange. > By, number of outcomes(NOO) I mean how many times random_zipfian was > called. For example: pgbench -f compte_bench.sql -t 100000 So, I think > it works but works worse for small number of outcomes. And also we need > to find optimal theta for better results. Hmmm. I've run one million outcomes each time. I'll check on the next version. -- Fabien.
On Fri, Jul 14, 2017 at 6:37 AM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > I am attaching results of tests for 32 and 128 clients that were running for 10 minutes, and TPS remains 305 and 115 ktpsrespectively. > > Tests was executed with configuration set for YCSB. And there is very aggressively autovacuum, this can be reason why thereis no decline appears with increasing working time. > Autovacuum config: > > autovacuum_max_workers = 8 > autovacuum_naptime = 10s > autovacuum_vacuum_scale_factor = 0.1 > autovacuum_vacuum_cost_delay = 0ms > autovacuum_vacuum_cost_limit = 10000 I think that what this probably comes down to, more than anything else, is that you have leftmost hot/bloated leaf pages like this: idx | level | l_item | blkno | btpo_prev | btpo_next | btpo_flags | type | live_items | dead_items | avg_item_size | page_size | free_size | highkey -----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------...pgbench_accounts_pkey | 0 | 1 | 1 | 0 | 2751 | 65 | l | 100 | 41 | 16 | 8192 | 5328 | 11 00 00 00 00 00 00 00pgbench_accounts_pkey| 0 | 2 | 2751 | 1 | 2746 | 65 | l | 48 | 90 | 16 | 8192 | 5388 | 32 00 00 00 00 00 00 00 ... The high key for the leftmost shows that only values below 0x11 belong on the first page. This is about 16 or 17 possible distinct values, and yet the page has 100 live items, and 41 dead items; in total, there is room for 367 items. It's only slightly better with other nearby pages. This is especially bad because once the keyspace gets split up this finely, it's *impossible* to reverse it -- it's more or less a permanent problem, at least until a REINDEX. You cannot merge pages back together until one is completely empty, which in this case and many cases will in fact never happen. Aggressive vacuuming is probably helpful in part because it prevents the problem from ever getting completely out of hand. That doesn't seem like a very practical solution, though. We should probably be treating unique indexes in a special way, since inserting a new conflicting tuple necessarily supersedes whatever it conflicted with. Postgres holds on to the old tuple today, but only because the transaction might abort, or because an old snapshot might be interested in old tuple versions. However, the general pattern with unique indexes is that there must be one tuple visible to new snapshots, and old versions are almost certainly going to became garbage very quickly. Unique indexes really are quite special, which nbtree doesn't take advantage of at all. If you read the coverage of B-Trees within "Transaction Processing: Concepts and Techniques", and many other publications, the general trend seems to be that unique indexes have truly unique keys, based only on the user-visible key values. They make a sharp distinction between primary and secondary indexes, which doesn't really exist in Postgres (at least, not at the access method level). I suspect that the best long term solution is to add GIN-style duplicate handling within nbtree for unique indexes, with special pruning style optimizations to the posting list structure. This would probably only happen with unique indexes. The useful thing about this approach is it separates these two problems: 1. Representing what values are in the index for lookup, and their latest row version. 2. Finding old row versions, in the less common case where you have an old snapshot. With a design like that, nbtree would never "cut up the keyspace too finely" because of a temporary surge of UPDATE insertions. You still get bloat, but you add it to a place where garbage collection can be much better targeted. Under this scheme, it doesn't really matter so much if garbage collection doesn't happen very frequently, because there could be LP_DEAD style hints for the auxiliary posting list structure. That could probably work based on atomic ops, and would greatly reduce the number of pages that UPDATE workloads like this dirtied. It probably wouldn't make sense to add things like GIN's compression of item pointers, since most data within the auxiliary posting list structure is removed very quickly. I wouldn't expect btree_gin to be faster for this workload today, because it doesn't have kill_prior_tuple/LP_DEAD support, and because it doesn't support unique indexes, and so cannot exploit the special situation that exists with unique indexes. -- Peter Geoghegan
Hello Fabien,
On 14 Jul 2017, at 17:51, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Ok, so you did not get the large bias for i=3. Strange.
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Attachment
>> Ok, so you did not get the large bias for i=3. Strange. > > I got large bias for i=3 and theta > 1 even with a million outcomes, Ok. So this is similar to what I got. Is this bias expected from the drawing method, say because it is approximated and the approximation is weak at some points, or is there an issue with its implementation, says some shift which gets smoothed down for higher indexes? > I am attaching patch v3. Among other things I fixed small typo in > description of random_exponential function in pgbench.sgml file. I'll look into that. -- Fabien.
> On 17 Jul 2017, at 13:51, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > Is this bias expected from the drawing method, say because it is approximated and the approximation is weak at some points,or is there an issue with its implementation, says some shift which gets smoothed down for higher indexes? > I have checked paper where such implementation was proposed and there theta allowed only on range between 0 and 1. It seemslike it is not guaranteed that it should work well when theta is more than 1. I am attaching paper, see page 23. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello, >> Is this bias expected from the drawing method, say because it is >> approximated and the approximation is weak at some points, or is there >> an issue with its implementation, says some shift which gets smoothed >> down for higher indexes? > > I have checked paper where such implementation was proposed and there > theta allowed only on range between 0 and 1. It seems like it is not > guaranteed that it should work well when theta is more than 1. Ok. I see a significant issue with having a random_zipfian function which does not really return a zipfian distribution under some parameter values. If there is no better alternative, I would suggest to restrict the parameter for values between 0 and 1, or to find a better approximation for theta >= 0. > I am attaching paper, see page 23. Thanks for the paper. It reminds me that I intended to propose a parametric pseudo-random permutation for pgbench, some day. -- Fabien.
Hello Alik, > I am attaching patch v3. > Among other things I fixed small typo in description of > random_exponential function in pgbench.sgml file. Ok. Probably this typo should be committed separatly and independently. A few comments about v3: Patch applies cleanly, compiles, works. About the maths: As already said, I'm not at ease with a random_zipfian function which does not display a (good) zipfian distribution. At the minimum the documentation should be clear about the approximations implied depending on the parameter value. In the litterature the theta parameter seems to be often called alpha or s (eg see https://en.wikipedia.org/wiki/Zipf%27s_law). I would suggest to stick to "s" instead of "theta"? About the code: looks simpler than the previous version, which is good. Double constants should be used when the underlying type is a double, instead of relying on implicit int to double promotion (0 -> 0.0, 1 -> 1.0). Functions zipfZeta(n, theta) does not really computes the zeta(n) function, so I think that a better name should be chosen. It seems to compute H_{n,theta}, the generalized harmonic number. Idem "thetan" field in struct. The handling of cache overflow by randomly removing one entry looks like a strange idea. Rather remove the oldest entry? ISTM that it should print a warning once if the cache array overflows as performance would drop heavily. If the zipf cache is constant size, there is no point in using dynamic allocation, just declare an array... Comments need updating: eg "theta parameter of previous execution" which dates back when there was only one value. There should be a comment to explain that the structure is in the thread for thread safety. There should be non regression tests somehow. If the "improve pgbench tap test infrastructure" get through, things should be added there. -- Fabien.
Hello Fabien,
I am attaching patch v4.
On 19 Jul 2017, at 17:21, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
About the maths: As already said, I'm not at ease with a random_zipfian function which does not display a (good) zipfian distribution. At the minimum the documentation should be clear about the approximations implied depending on the parameter value.
I add one more sentence to documentation to emphasize that degree of proximity depends on parameter . And also I made restriction on parameter, now it can be only in range (0; 1)
In the litterature the theta parameter seems to be often called alpha
or s (eg see https://en.wikipedia.org/wiki/Zipf%27s_law). I would suggest to
stick to "s" instead of "theta”?
I have renamed it to “s”.
Functions zipfZeta(n, theta) does not really computes the zeta(n) function,
so I think that a better name should be chosen. It seems to compute
H_{n,theta}, the generalized harmonic number. Idem "thetan" field in struct.
Renamed zipfZeta to zipfGeneralizedHarmonic, zetan to harmonicn.
The handling of cache overflow by randomly removing one entry looks like
a strange idea. Rather remove the oldest entry?
Replaced with Least Recently Used replacement algorithm.
ISTM that it should print a warning once if the cache array overflows as performance would drop heavily.
Now it prints warning message if array overflowed. To print message only one time, it uses global flag, which is available for all threads.
And theoretically message can be printed more than one time.
It could be solved easily using pg_atomic_test_set_flag() from src/include/port/atomics.h but it can not be used in pgbench because of following lines of code there:
#ifdef FRONTEND
#error "atomics.h may not be included from frontend code"
#endif
Or it can be fixed by using mutexes from pthread, but I think code become less readable and more complex in this case.
So, should I spend time on solving this issue?
If the zipf cache is constant size, there is no point in using dynamic allocation, just declare an array…
Fixed. Does ZIPF_CACHE_SIZE = 15 is ok?
There should be non regression tests somehow. If the "improve pgbench
tap test infrastructure" get through, things should be added there.
I will send tests later, as separate patch.
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hello Alik, >> About the maths: As already said, I'm not at ease with a random_zipfian >> function which does not display a (good) zipfian distribution. At the >> minimum the documentation should be clear about the approximations >> implied depending on the parameter value. > > I add one more sentence to documentation to emphasize that degree of > proximity depends on parameter . > And also I made restriction on > parameter, now it can be only in range (0; 1) Hmmm. On second thought, maybe one or the other is enough, either restrict the parameter to values where the approximation is good, or put out a clear documentation about when the approximation is not very good, but it may be still useful even if not precise. So I would be in favor of expanding the documentation but not restricting the parameter beyond avoiding value 1.0. >> [...] > Now it prints warning message if array overflowed. To print message only > one time, it uses global flag, which is available for all threads. And > theoretically message can be printed more than one time. [...] > So, should I spend time on solving this issue? No. Just write a comment. >> If the zipf cache is constant size, there is no point in using dynamic >> allocation, just declare an array… > > Fixed. Does ZIPF_CACHE_SIZE = 15 is ok? My guess is yes, till someone complains that it is not the case;-) >> There should be non regression tests somehow. If the "improve pgbench >> tap test infrastructure" get through, things should be added there. > > I will send tests later, as separate patch. I think that developping a test would be much simpler with the improved tap test infrastructure, so I would suggest to wait to know the result of the corresponding patch. Also, could you recod the patch to CF 2017-09? https://commitfest.postgresql.org/14/ -- Fabien.
I think that developping a test would be much simpler with the improved tap test infrastructure, so I would suggest to wait to know the result of the corresponding patch.
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Hmmm. On second thought, maybe one or the other is enough, either restrict the parameter to values where the approximation is good, or put out a clear documentation about when the approximation is not very good, but it may be still useful even if not precise.So I would be in favor of expanding the documentation but not restricting the parameter beyond avoiding value 1.0.
Also I have recorded patch to CF 2017-09 — https://commitfest.postgresql.org/14/1206/.
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hello! I realized that I was sending emails as HTML and latest patch is not visible in the archive now. That’s why I am attaching it again. I am sorry for that. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Jul 21, 2017 at 4:51 AM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > (Latest version of pgbench Zipfian patch) While I'm +1 on this idea, I think that it would also be nice if there was an option to make functions like random_zipfian() actually return a value that has undergone perfect hashing. When this option is used, any given value that the function returns would actually be taken from a random mapping to some other value in the same range. So, you can potentially get a Zipfian distribution without the locality. While I think that Zipfian distributions are common, it's probably less common for the popular items to be clustered together within the primary key, unless the PK has multiple attributes, and the first is the "popular attribute". For example, there would definitely be a lot of interest among contiguous items within an index on "(artist, album)" where "artist" is a popular artist, which is certainly quite possible. But, if "album" is the primary key, and it's a SERIAL PK, then there will only be weak temporal locality within the PK, and only because today's fads are more popular than the fads from previous years. Just another idea, that doesn't have to hold this patch up. -- Peter Geoghegan
Hello Peter, > I think that it would also be nice if there was an option to make > functions like random_zipfian() actually return a value that has > undergone perfect hashing. When this option is used, any given value > that the function returns would actually be taken from a random mapping > to some other value in the same range. So, you can potentially get a > Zipfian distribution without the locality. I definitely agree. This is a standard problem with all non uniform random generators in pgbench, namely random_{gaussian,exponential}. However hashing is not a good solution on a finite domain because of the significant collision rate, so that typically 1/3 of values are empty and collisions cause spikes. Also, collisions would break PKs. The solution is to provide a (good) pseudo-random parametric permutation, which is non trivial especially for non powers of two, so ISTM that it should be a patch on its own. The good news is that it is on my todo list and I have some ideas on how to do it. The bad news is that given the rate at which I succeed in getting things committed in pgbench, it might take some years:-( For instance, simplistic functions and operators to extend the current set have been in the pipe for 15 months and missed pg10. -- Fabien.
Hello Alik, >> So I would be in favor of expanding the documentation but not >> restricting the parameter beyond avoiding value 1.0. > > I have removed restriction and expanded documentation in attaching patch v5. I've done some math investigations, which consisted in spending one hour with Christian, a statistician colleague of mine. He took an old book out of a shelf, opened it to page 550 (roughly in the middle), and explained to me how to build a real zipfian distribution random generator. The iterative method is for parameter a>1 and works for unbounded values. It is simple to add a bound. In practice the iterative method is quite effective, i.e. number of iterations is typically small, at least if the bound is large and if parameter a is not too close to 1. I've attached a python3 script which implements the algorithm. It looks like magic. Beware that a C implementation should take care of float and int overflows. # usage: a, #values, #tests sh> zipf.py 1.5 1000 1000000 # after 1.7 seconds c = [391586, 138668, 75525, 49339, 35222, 26621, ... ... 11, 13, 12, 11, 16] (1.338591 iterations per draw) sh> zipf.py 1.1 1000 1000000 # after 3.1 seconds c = [179302, 83927, 53104, 39015, 30557, 25164, ... ... 82, 95, 93, 81, 80] (2.681451 iterations per draw) I think that this method should be used for a>1, and the other very rough one can be kept for parameter a in [0, 1), a case which does not make much sense to a mathematician as it diverges if unbounded. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello Fabien, > On 5 Aug 2017, at 12:15, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > Hello Alik, > > I've done some math investigations, which consisted in spending one hour with Christian, a statistician colleague of mine.He took an old book out of a shelf, opened it to page 550 (roughly in the middle), and explained to me how to builda real zipfian distribution random generator. > > The iterative method is for parameter a>1 and works for unbounded values. It is simple to add a bound. In practice theiterative method is quite effective, i.e. number of iterations is typically small, at least if the bound is large andif parameter a is not too close to 1. > I've attached a python3 script which implements the algorithm. It looks like magic. Beware that a C implementation shouldtake care of float and int overflows. > Thank you for the script. I will rewrite it to C and add to the patch soon. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hello Fabien, > > I think that this method should be used for a>1, and the other very rough one can be kept for parameter a in [0, 1), acase which does not make much sense to a mathematician as it diverges if unbounded. Now “a” does not have upper bound, that’s why on using iterative algorithm with a >= 10000 program will stuck on infiniteloop because of following line of code: double b = pow(2.0, s - 1.0); Because after overflow “b” becomes “+Inf”. So should upper bound for “a" be set? Should I mention in docs that there are two algorithms are used depending on values of a(s/theta)? In attaching patch, I have added computeIterativeZipfian method and it’s usage in getZipfianRand. Is it better to move code of computing via cache to new method, so that getZipfianRand will contain only 2 computeXXXZipfianmethod calls? — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello Alik, > Now “a” does not have upper bound, that’s why on using iterative algorithm with a >= 10000 program will stuck on infiniteloop because of following line of code: > double b = pow(2.0, s - 1.0); > Because after overflow “b” becomes “+Inf”. Yep, overflow can happen. > So should upper bound for “a" be set? Yes, I agree. a >= 10000 does not make much sense... If you want uniform you should use random(), not call random_zipfian with a = 10000. Basically it suggests that too large values of "a" should be rejected. Not sure where to put the limit, though. > Should I mention in docs that there are two algorithms are used > depending on values of a(s/theta)? Yes, as a general principle I think that the documentation should reflect the implementation. > In attaching patch, I have added computeIterativeZipfian method and it’s > usage in getZipfianRand. Is it better to move code of computing via > cache to new method, so that getZipfianRand will contain only 2 > computeXXXZipfian method calls? I have not looked in detail, but from what you say I would agree that the implementation should be symmetric, so having one function calling one method or the other sounds good. -- Fabien.
Hello, Fabien I am attaching patch v7. > Yes, I agree. a >= 10000 does not make much sense... If you want uniform you should use random(), not call random_zipfianwith a = 10000. Basically it suggests that too large values of "a" should be rejected. Not sure where to putthe limit, though. I set upper bound for parameter to be equal to 1000. > > Yes, as a general principle I think that the documentation should reflect the implementation. Documentation have been updated, I have removed algorithms descriptions and put references to them there. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello Alik, > I am attaching patch v7. Patch generates multiple warnings with "git apply", apparently because of end-of-line spaces, and fails: pgbench-zipf-07v.patch:52: trailing whitespace. { pgbench-zipf-07v.patch:53: trailing whitespace. "random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN pgbench-zipf-07v.patch:54: trailing whitespace. }, pgbench-zipf-07v.patch:66:trailing whitespace. #define ZIPF_CACHE_SIZE 15 /* cache cells number */ pgbench-zipf-07v.patch:67:trailing whitespace. error: patch failed: doc/src/sgml/ref/pgbench.sgml:1013 error: doc/src/sgml/ref/pgbench.sgml: patch does not apply error:patch failed: src/bin/pgbench/exprparse.y:191 error: src/bin/pgbench/exprparse.y: patch does not apply error: patchfailed: src/bin/pgbench/pgbench.c:93 error: src/bin/pgbench/pgbench.c: patch does not apply error: patch failed: src/bin/pgbench/pgbench.h:75 error: src/bin/pgbench/pgbench.h: patch does not apply It also complains with "patch", although it seems to work: (Stripping trailing CRs from patch; use --binary to disable.) patching file doc/src/sgml/ref/pgbench.sgml (Stripping trailingCRs from patch; use --binary to disable.) patching file src/bin/pgbench/exprparse.y (Stripping trailing CRs frompatch; use --binary to disable.) patching file src/bin/pgbench/pgbench.c (Stripping trailing CRs from patch; use --binaryto disable.) patching file src/bin/pgbench/pgbench.h patch unexpectedly ends in middle of line patch unexpectedlyends in middle of line Please do not put spaces at the end of lines, eg there is a lonely tab on the second line of "computeHarmonicZipfian". It seems that "CR" characters are used: sh> sha1sum ~/pgbench-zipf-07v.patch 439ad1de0a960b3b415ff6de9412b3cc4d6922e7 sh> file ~/pgbench-zipf-07v.patch pgbench-zipf-07v.patch: unified diff output, ASCII text, with CRLF line terminators Compilation issues one warning: pgbench.c: In function ‘computeHarmonicZipfian’: pgbench.c:894:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] double uniform = pg_erand48(thread->random_state); Please conform to pg standard for portability, even if it is a very old one:-) About the documentation: I would suggest to replace 0.5 in the first example by 1.5, as a zipfian distribution with a lower-than-one parameter is not that well defined. I'm fine with using references to papers or books for the algorithm. The documentation lines in the sgml file are too long. At least limit text paragraphs to maybe 80 columns as done elsewhere. typo: "<entry> Zipfian" (remove space) typo: "used(described" (missing space) typo: "numbers(algorithm" (again) The sentence "It's recommended to pick <replaceable>parameter</> in range (0, 1) for better accuracy." does not make sense with the updated generation algorithm. The text would benefit from a review by a native English speaker, who I am not. Anyway here is an attempt at improving/simplifying the text (I have removed sgml formatting). If some native English speaker could review it, that would be great. """ "random_zipfian" generates an approximated bounded zipfian distribution. For parameter in (0,1), an approximated algorithmis taken from "Quickly Generating Billion-Record Synthetic Databases", Jim Gray et al, SIGMOD 1994. For parameterin (1, 1000), a rejection method is used, based on "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551,Springer 1986. The distribution is not defined when the parameter's value is 1.0. The drawing performance is poorfor parameter values close and above 1.0 and on a small range. "parameter" defines how skewed the distribution is. The larger the "parameter", the more frequently values close to thebeginning of the interval are drawn. The closer to 0 "parameter" is, the flatter (more uniform) the access distribution. """ English in comments and code: "inited" is not English, I think you want "initialized". Maybe "nb_cells" would do. "it is not exist" (multiple instances) -> "it does not exist". I think that the references to the paper/book should appear as comments in the code, for instance in front of the functions which implement the algorithm. "current_time", "last_touch_time" are counter based, they are not "time". I suggest to update the name to "current" and "last_touch" or "last_used". "time when cell was used last time" -> "last used logical time" About the code: I would prefer that you use "1.0" instead of "1." for double constants. I think that getZipfianRand should be symmetric. Maybe a direct formula could do: return min - 1 + (s > 1) ? xxx() : yyy(); or at least use an explicit "else" for symmetry. Function "zipfn" asks for s and n as arguments, but ISTM that they are already include as fields in cell, so this seems redundant. Moreover I do not think that the zipfn function is that useful. I would suggest to inline it in its caller. "zipfGeneralizedHarmonic" is not really related to zipf, it is just used there. I suggest to call it "generalizedHarmonicNumber" to reflect what it does. I suggest to put all LRU management in the getCell* function, that is to move the last_use update from computeHarmonicZipfian to getCell. computeHarminicZipfian and computeIterativeZipfian should have the same signature, not one with a random state and the other with the thread. I suggest to do the same as other random functions, whatever it is. Otherwise, no action required: The patch probably conflicst with other patches in the CF, which means that there will be rebase needed. That is life. The queue is long and slow. Also, this function deserve some tap tests, that should be added if the "pgbench tap test" patch get through. Otherwise it is as well tested as everything else around, which is basically no test. -- Fabien.
Hello Fabien, Thank you for detailed review. I hope I have fixed all the issues you mentioned in your letter. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello Alik, Applies, compiles, works for me. Some minor comments and suggestions. Two typos: - "usinng" -> "using" - "a rejection method used" -> "a rejection method is used" I'm not sure of "least_recently_used_i", this naming style is not used in pgbench. "least_recently_used" would be ok. "..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical, even if the result is the same? I would put the parameter value check in getZipfianRand, so that if someone reuse the function elsewhere the check is also performed. That would also simplify a bit the already very large expression evaluation function. When/if the pgbench tap test patch get through, coverage tests should be added. Maybe the cache overflow could be counted, to allow for a possible warning message in the final report? -- Fabien.
> On 06 Sep 2017, at 08:42, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > Hello Alik, > > Applies, compiles, works for me. > > Some minor comments and suggestions. > > Two typos: > - "usinng" -> "using" > - "a rejection method used" -> "a rejection method is used" > > I'm not sure of "least_recently_used_i", this naming style is not used in pgbench. "least_recently_used" would be ok. > > "..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical, > even if the result is the same? > > I would put the parameter value check in getZipfianRand, so that if someone reuse the function elsewhere the check is alsoperformed. That would also simplify a bit the already very large expression evaluation function. > > When/if the pgbench tap test patch get through, coverage tests should > be added. > > Maybe the cache overflow could be counted, to allow for a possible warning message in the final report? Since this patch has been Waiting for author and no update on this patch has been posted during the commitfest, it is Returned with feedback. When you have a new version of the patch, please re-submit to a future commitfest. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello Fabien, Sorry for not responding for long time. > Two typos: > - "usinng" -> "using" > - "a rejection method used" -> "a rejection method is used" > > I'm not sure of "least_recently_used_i", this naming style is not used in pgbench. "least_recently_used" would be ok. > > "..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical, > even if the result is the same? Fixed. > Maybe the cache overflow could be counted, to allow for a possible warning message in the final report? Also done. But one question: Should it be done by printing to stdout or stderr ? > When/if the pgbench tap test patch get through, coverage tests should > be added. I have added few tests and I am not sure if they OK or not. It was my first experience with perl and tests in PostgreSQL. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Alik, Patch applies with "patch", but not with "git apply", probably because it is in CR-NL eol format. No big deal. Documentation has been switched from SGML to XML, so now tags must be explicitely closed, i.e. use <foo>x</foo> instead of <foo>x</>. Patch compiles with a warning: pgbench.c: In function ‘getZipfianRand’: pgbench.c:905:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] Declaration must move before assertion. The random_zipfian(1,100,2) call result is not tested. "1 times" -> "once". "array overflow on \random_zipfian" remove spurious "\". -- Fabien.
> Patch applies with "patch", but not with "git apply", probably because it is in CR-NL eol format. No big deal. > > Documentation has been switched from SGML to XML, so now tags must be explicitely closed, i.e. use <foo>x</foo> insteadof <foo>x</>. > > Patch compiles with a warning: > > Declaration must move before assertion. > > "array overflow on \random_zipfian" remove spurious "\". Fixed. > The random_zipfian(1,100,2) call result is not tested. I have reduced range to [1, 10] and updated the test. > "1 times" -> "once”. I am not sure that it need to be fixed in this way, because there can be another number instead of 1. So, I updated it to “%d time(s)” — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hello Alik, Patch applies, compiles, works (I checked the distribution). Doc gen ok. make check ok. > I have reduced range to [1, 10] and updated the test. I would suggest to use [1, 9] and the simpler regex [1-9] instead of the full list. Also, there should be an empty line between functions. I just noticed that one is missing after zipfFindOrCreateCacheCell(). -- Fabien.
> I would suggest to use [1, 9] and the simpler regex [1-9] instead of the full list. > > Also, there should be an empty line between functions. I just noticed that one is missing after zipfFindOrCreateCacheCell(). Fixed. — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
> Fixed. Applies, compiles, make check ok, doc ok. Note that the patch may interact with other patches which add functions to pgbench, so might need a rebase depending on the order in which the patch are applied. We'll see. I have added the thread to the next CF (January 2018), and marked it as "ready for committer". -- Fabien.
> I have added the thread to the next CF (January 2018), and marked it as "ready for committer”. That’s cool, thank you for review! — Thanks and Regards, Alik Khilazhev Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> Note that the patch may interact with other patches which add functions to > pgbench, so might need a rebase depending on the order in which the patch are > applied. Attached a minor rebase after 16827d4424. -- Fabien.
Attachment
Thank you for all, pushed Fabien COELHO wrote: > >> Note that the patch may interact with other patches which add functions to >> pgbench, so might need a rebase depending on the order in which the patch are >> applied. > > Attached a minor rebase after 16827d4424. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Fri, Jul 7, 2017 at 12:45 AM Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in PostgreSQL.MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number generator,everyone will be able to make research on this topic without using YCSB. I've been thinking about this again, in light of my recent work to improve the B-Tree code by making all entries have fully unique keys, adding suffix truncation, and making better choices about split points [1]. Somebody posted a patch that fixed the issue by making the heavyweight lock manager lock a value rather than a heap TID within the last year. I can't find that thread right now -- perhaps someone can point it out now. I suspect that my patch will have a similar effect to this other patch I'm thinking of with the Zipfian workload, though it will do so without touching anything outside of the B-Tree code, and without changing the user-visible semantics of locking. Can we possibly find a way to rerun this Zipfian experiment/test case? I bet Alexander's recent B-Tree patch will also help. Here is why I suspect my patch will help with the Zipfian workload, particularly on a multi-socket machine: The logic for following downlinks in Postgres is that downlinks are a lower bound, but we still cannot follow them when the key is equal to our insertion scankey -- we must go left, even when we have all keys filled out. This means that we sometimes end up holding an exclusive buffer lock on "the first leaf page the value could be on", even though that page doesn't have any such values (they're all in later pages, to the left). So we accidentally lock-out insertions of the value 1 when we insert the value 2, and of the value 2 when we insert the value 3, and of the value 3 when we insert the value 4, and so on down the line. This is the worse possible thing for the Zipfian test case, I think, since most of the contention is artificial. I think that my patch may ameliorate the problem, because: * We make the tie-breaker heap TID attribute have DESC sort order, so generally speaking contention starts on the leftmost page among leaf pages full of duplicates -- for reads, and for writes. * The patch works very hard to make sure that large groups of duplicates are not spread across pages. There would be no mixing of leaf pages containing the value 1 and the value 2 with the Zipfian test case, for example. Old duplicates would be packed into full pages. * We can invent a fake lower-than-any-real-value heap TID in the insertion scankey in certain cases that don't have one. This way, the scan key as a whole is *not* equal to pivot tuples that are at least equal on non-heap-TID attributes, so we *can* follow a downlink that is "equal" to our scankey, provided it has a truncated-away heap TID attribute value. In short, the artificial "false sharing" that we see in the B-Tree for the Zipfian test case may go away. There will be *no* contention between an inserter (or really UPDATE-er) that inserts the value 2 with concurrent inserters of the value 1. All of the ways in which that was likely to happen are each fixed by the patch. There is still buffer lock contention on heap pages, and the contention of waiting for a row lock. Maybe the busy waiting will automatically be fixed, though, since you have one point of contention for each of the really hot values at the far left of the leaf level. [1] https://commitfest.postgresql.org/20/1787/ -- Peter Geoghegan
On Fri, Jul 7, 2017 at 12:45 AM Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in PostgreSQL.MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number generator,everyone will be able to make research on this topic without using YCSB. I've been thinking about this again, in light of my recent work to improve the B-Tree code by making all entries have fully unique keys, adding suffix truncation, and making better choices about split points [1]. Somebody posted a patch that fixed the issue by making the heavyweight lock manager lock a value rather than a heap TID within the last year. I can't find that thread right now -- perhaps someone can point it out now. I suspect that my patch will have a similar effect to this other patch I'm thinking of with the Zipfian workload, though it will do so without touching anything outside of the B-Tree code, and without changing the user-visible semantics of locking. Can we possibly find a way to rerun this Zipfian experiment/test case? I bet Alexander's recent B-Tree patch will also help. Here is why I suspect my patch will help with the Zipfian workload, particularly on a multi-socket machine: The logic for following downlinks in Postgres is that downlinks are a lower bound, but we still cannot follow them when the key is equal to our insertion scankey -- we must go left, even when we have all keys filled out. This means that we sometimes end up holding an exclusive buffer lock on "the first leaf page the value could be on", even though that page doesn't have any such values (they're all in later pages, to the left). So we accidentally lock-out insertions of the value 1 when we insert the value 2, and of the value 2 when we insert the value 3, and of the value 3 when we insert the value 4, and so on down the line. This is the worse possible thing for the Zipfian test case, I think, since most of the contention is artificial. I think that my patch may ameliorate the problem, because: * We make the tie-breaker heap TID attribute have DESC sort order, so generally speaking contention starts on the leftmost page among leaf pages full of duplicates -- for reads, and for writes. * The patch works very hard to make sure that large groups of duplicates are not spread across pages. There would be no mixing of leaf pages containing the value 1 and the value 2 with the Zipfian test case, for example. Old duplicates would be packed into full pages. * We can invent a fake lower-than-any-real-value heap TID in the insertion scankey in certain cases that don't have one. This way, the scan key as a whole is *not* equal to pivot tuples that are at least equal on non-heap-TID attributes, so we *can* follow a downlink that is "equal" to our scankey, provided it has a truncated-away heap TID attribute value. In short, the artificial "false sharing" that we see in the B-Tree for the Zipfian test case may go away. There will be *no* contention between an inserter (or really UPDATE-er) that inserts the value 2 with concurrent inserters of the value 1. All of the ways in which that was likely to happen are each fixed by the patch. There is still buffer lock contention on heap pages, and the contention of waiting for a row lock. Maybe the busy waiting will automatically be fixed, though, since you have one point of contention for each of the really hot values at the far left of the leaf level. [1] https://commitfest.postgresql.org/20/1787/ -- Peter Geoghegan