Thread: FILLFACTOR and increasing index
Hi, the doc pages are somehow "cryptic" regarding FILLFACTOR. (well, at least they're cryptic to me, since I don't know a lot of btree stuff...) I have an index on a timestamp value that is inserted, for 90% of the inserts, in increasing order. No updates, no deletes on the table (appends only). What would be the "right" or "sensible" FILLFACTOR in this case? I guess the docs need some example of "sensible" values for some of the "classic" cases? (I'm sure I'm not the only one storing a timestamp index in time-order...) Leonardo
> > I have an index on a timestamp value that is inserted, for 90% > > of the inserts, in increasing order. No updates, no deletes on the > > table (appends only). > > The bit about "increasing order" is a red herring here. If you have > no updates, then you can leave the FILLFACTOR alone. > > FILLFACTOR controls how much extra room there is in the way the table > is stored, so that if a row is UPDATEd it might be possible to store > the row in the same disk page. This alleviates certain pathological > conditions with high-UPDATE tables and the way Postgres stores the > data (the non-overwriting storage manager). (please add the list when replying to emails) I'm talking about the index fillfactor, not the table fillfactor...
On Mon, May 9, 2011 at 3:32 PM, Leonardo Francalanci <m_lists@yahoo.it> wrote: >> > I have an index on a timestamp value that is inserted, for 90% >> > of the inserts, in increasing order. No updates, no deletes on the >> > table (appends only). >> >> The bit about "increasing order" is a red herring here. If you have >> no updates, then you can leave the FILLFACTOR alone. >> >> FILLFACTOR controls how much extra room there is in the way the table >> is stored, so that if a row is UPDATEd it might be possible to store >> the row in the same disk page. This alleviates certain pathological >> conditions with high-UPDATE tables and the way Postgres stores the >> data (the non-overwriting storage manager). > > > (please add the list when replying to emails) > > I'm talking about the index fillfactor, not the table fillfactor... It will be really useful to see some test results where you alter the fillfactor and report various measurables. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> It will be really useful to see some test results where you alter the > fillfactor and report various measurables. It's not that easy... stressing "only" the index insertion speed won't be simple. I would have liked some "theory"... The docs seem to imply there are some guidelines, it's just that it's too cryptic: "for heavily updated tables a smaller fillfactor is better to minimize the need for page splits" "heavily updated" -> does it mean tables that are inserted/updated or only "updated"??? "leaf pages are filled to this percentage [...] when extending the index at the right (adding new largest key values)." Does it mean that since I will (almost) always add new largest key values, I should have a big or small FILLFACTOR??? I know that theory is one thing and real testing another; but I can't test everything; if there are some (proved?) guidelines I'd like to use them (example: I'm not going to test that fillfactor in table creation in my case won't make any difference in performance; I trust the docs and the fact that "it makes sense").
Dne 9.5.2011 17:25, Leonardo Francalanci napsal(a): >> It will be really useful to see some test results where you alter the >> fillfactor and report various measurables. > > > It's not that easy... stressing "only" the index insertion > speed won't be simple. I would have liked some "theory"... > The docs seem to imply there are some guidelines, it's > just that it's too cryptic: > > "for heavily updated tables a smaller fillfactor is better > to minimize the need for page splits" > > > "heavily updated" -> does it mean tables that are inserted/updated > or only "updated"??? Well, an UPDATE is actually DELETE+INSERT (that's how PostgreSQL MVCC works). It may be a bit more complicated with HOT, but that's not your case, as you're only inserting data. > "leaf pages are filled to this percentage [...] when extending the index > at the right (adding new largest key values)." Hmmm, not sure how exactly this works, but I guess that if you're only inserting data then fillfactor=100 is the right thing. I believe it kicks in only when you need to insert data into an 'old' leaf page. If the page is full, then it needs to be split but if you reserve some free space (using e.g. fillfactor=80) then the split is not needed. > Does it mean that since I will (almost) always add new largest key > values, I should have a big or small FILLFACTOR??? I'd go with the fillfactor=100. > I know that theory is one thing and real testing another; but I can't > test everything; if there are some (proved?) guidelines I'd like to > use them (example: I'm not going to test that fillfactor in table creation > in my case won't make any difference in performance; I trust the > docs and the fact that "it makes sense"). Yes, I use the same approach, but I'm not aware of any such guideline related to fillfactor with indexes. Anyway those guidelines need to be written by someone, so you have a great opportunity ;-) regards Tomas
Dne 9.5.2011 17:25, Leonardo Francalanci napsal(a): > I know that theory is one thing and real testing another; but I can't > test everything; if there are some (proved?) guidelines I'd like to > use them (example: I'm not going to test that fillfactor in table creation > in my case won't make any difference in performance; I trust the > docs and the fact that "it makes sense"). > Anyway testing this (with the 'insert only' workload) may be quite simple: ================= fillfactor = 100 ==================== testdb=# create table test_fill (id int); CREATE TABLE Time: 2,515 ms testdb=# create index test_fill_idx on test_fill(id) with (fillfactor=100); CREATE INDEX Time: 10,331 ms testdb=# insert into test_fill select i from generate_series(1,1000000) s(i); INSERT 0 1000000 Time: 11542,512 ms testdb=# select relpages from pg_class where relname = 'test_fill_idx'; relpages ---------- 1977 (1 row) ==================== fillfactor = 70 ======================== testdb=# create table test_fill (id int); CREATE TABLE Time: 1,382 ms testdb=# create index test_fill_idx on test_fill(id) with (fillfactor=70); CREATE INDEX Time: 10,296 ms testdb=# insert into test_fill select i from generate_series(1,1000000) s(i); INSERT 0 1000000 Time: 11117,398 ms testdb=# select relpages from pg_class where relname = 'test_fill_idx'; relpages ---------- 2819 (1 row) ============================================================ So there seems to be no difference in insert performance (the INSERT takes about 11s in both cases), but the size of the index with fillfactor=70 needs much more space. So with the insert only (in ascending order) workload, I'd go with fillfactor=100 (or you may leave it at 90, which is the default value, the difference will be negligible). regards Tomas
> Yes, I use the same approach, but I'm not aware of any such guideline > related to fillfactor with indexes. Anyway those guidelines need to be > written by someone, so you have a great opportunity ;-) I did a quick test using your example. As in your test, "increasing" values don't get any gain from a different fillfactor. I tried a random index: create table test_fill (id int); create index test_fill_idx on test_fill(id) with (fillfactor=100); insert into test_fill select (random()*100000)::integer from generate_series(1,10000000) i; time: 373936.724 drop table test_fill; create table test_fill (id int); create index test_fill_idx on test_fill(id) with (fillfactor=50); insert into test_fill select (random()*100000)::integer from generate_series(1,10000000) i; time: 393209.911 not much difference... Now I'm getting confused... is which cases fillfactor makes a difference???
>> Yes, I use the same approach, but I'm not aware of any such guideline >> related to fillfactor with indexes. Anyway those guidelines need to be >> written by someone, so you have a great opportunity ;-) > > > I did a quick test using your example. As in your test, "increasing" > values don't get any gain from a different fillfactor. > I tried a random index: > > create table test_fill (id int); > create index test_fill_idx on test_fill(id) with (fillfactor=100); > insert into test_fill select (random()*100000)::integer from > generate_series(1,10000000) i; > > > time: 373936.724 > > drop table test_fill; > create table test_fill (id int); > create index test_fill_idx on test_fill(id) with (fillfactor=50); > > insert into test_fill select (random()*100000)::integer from > generate_series(1,10000000) i; > time: 393209.911 > > > not much difference... > > Now I'm getting confused... is which cases fillfactor makes a > difference??? What about the index size? How much space do they occupy? Analyze the table and do this SELECT relpages FROM pg_class WHERE relname = 'indexname'; and I believe you'll see the difference - the former index (fillfactor=100) should grow much larger that the latter one. The minimal performance difference is probably caused by the fact that we're dealing with int4 column (and you've used just 100000 rows, i.e. about 0.5MB of data) so the index is going to be tiny anyway. Let's try to do that with varchar(32) column, just do something like this db=# create table test_fill (id varchar(32)); db=# create index test_fill_idx on test_fill(id) with (fillfactor=100); db=# insert into test_fill select md5(round(random()*100000)::text) from generate_series(1,10000000) i; I believe this might make a difference ... regards Tomas
> What about the index size? How much space do they occupy? Analyze the > table and do this Of course space is different. That's not the point. The point is: I'm willing to pay the price for another HD, if that helps with performance. But it doesn't. > > The minimal performance difference is probably caused by the fact that > we're dealing with int4 column (and you've used just 100000 rows, i.e. > about 0.5MB of data) so the index is going to be tiny anyway. I've used 10M rows, not 100000. > Let's try to do that with varchar(32) column, just do something like this Did it with 5M rows. Still no difference.
> >> What about the index size? How much space do they occupy? Analyze the >> table and do this > > > Of course space is different. That's not the point. The point is: I'm > willing > to pay the price for another HD, if that helps with performance. But it > doesn't. > >> >> The minimal performance difference is probably caused by the fact that >> we're dealing with int4 column (and you've used just 100000 rows, i.e. >> about 0.5MB of data) so the index is going to be tiny anyway. > > I've used 10M rows, not 100000. OK, I've misread the query - still, it's just 50MB of data. >> Let's try to do that with varchar(32) column, just do something like >> this > > > Did it with 5M rows. Still no difference. Hm, so the page splits probably are not that expensive to affect this. I wonder whether this would be true with multiple processes inserting data into the index concurrently. I guess the process needs to obtain a lock to do the page split, and that might make them much more expensive due to contention. But maybe I'm completely wrong - I really am not that familiar with btree internals yet and didn't have to investigate this. Anyway in your case (insert only, single process) I'd probably go with the default value (fillfactor=90). regards Tomas
Hi, I've studied the implementation of the btree indexes and how exactly the fillfactor is used, and in general - when a page split happens, the process needs to obtain more locks than with simple insert, which may result in contention with other processes that modify the index (the same pages) - the fillfactor is used only for the leaf pages, the rest of the index does not use it (directly) So lowering the number of page splits might remove some contention when there's a lot of processes accessing the same pages. But that's the theory - I really was not able to come up with a test that benefits from lower fillfactor. Actually the lower the fillfactor, the larger the index, which may be a significant issue with huge indexes. So in your case, I'd probably go with the default fillfactor (90), and maybe I'd consider even using fillfactor=100, to make the index as dense as possible. Anyway I guess the impact of this will be negligible, compared to other parameters (shared buffers, work_mem, etc.). regards Tomas
Hi, 2011-05-12 00:28 keltezéssel, Tomas Vondra írta: > Hi, > > I've studied the implementation of the btree indexes and how exactly the > fillfactor is used, and in general > > - when a page split happens, the process needs to obtain more locks > than with simple insert, which may result in contention with other > processes that modify the index (the same pages) > > - the fillfactor is used only for the leaf pages, the rest of the index > does not use it (directly) > > So lowering the number of page splits might remove some contention when > there's a lot of processes accessing the same pages. > > But that's the theory - I really was not able to come up with a test > that benefits from lower fillfactor. Actually the lower the fillfactor, > the larger the index, which may be a significant issue with huge indexes. We recently had a testcase for exercising FILLFACTOR on indexes. Several (15+) GB raw data arrives daily and must be imported into the database for analytic purposes, the table is heavily partitioned and each partition has 5 or 6 indexes. The importer is heavily threaded and uses COPY to insert the data. This is strictly an INSERT-only scenario, the analysis comes later. This is where FILLFACTOR=70 helped to reduce the index contention. With the default 90% (and 100% on the table itself) when a lot of data arrived in burst that were in the same time interval so 4-8 threads tried to push data into the same partition, individual data chunks (about 10-15MB each, about 63500 pieces) were imported in about 1.5-2.5 minutes, with a few spikes over 4 minutes. This was the result of a few days long network outage, the data collectors sent their backlogs. When all indexes were re-created with FILLFACTOR=70, the import times went down considerably. Out of about 63000 raw data chunks, there were only 6 or 8 where the import time fell in the 10-25 seconds range, the rest of them were imported under 10 seconds and the majority of them (65%) were imported under 3 seconds. The system has 24 cores, so we can use 8 of them dedicated to importing. This almost 1TB data was imported in about 3.5-4 hours after the FILLFACTOR change. Without it, it didn't finish in 2 days. Best regards, Zoltán Böszörményi > So in your case, I'd probably go with the default fillfactor (90), and > maybe I'd consider even using fillfactor=100, to make the index as dense > as possible. > > Anyway I guess the impact of this will be negligible, compared to other > parameters (shared buffers, work_mem, etc.). > > regards > Tomas > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Dne 12.6.2011 21:14, Boszormenyi Zoltan napsal(a): > We recently had a testcase for exercising FILLFACTOR on indexes. > Several (15+) GB raw data arrives daily and must be imported into > the database for analytic purposes, the table is heavily partitioned > and each partition has 5 or 6 indexes. The importer is heavily threaded > and uses COPY to insert the data. This is strictly an INSERT-only scenario, > the analysis comes later. This is where FILLFACTOR=70 helped to > reduce the index contention. > > With the default 90% (and 100% on the table itself) when a lot of data > arrived in burst that were in the same time interval so 4-8 threads tried > to push data into the same partition, individual data chunks (about 10-15MB > each, about 63500 pieces) were imported in about 1.5-2.5 minutes, with > a few spikes over 4 minutes. This was the result of a few days long > network outage, the data collectors sent their backlogs. > > When all indexes were re-created with FILLFACTOR=70, the import times > went down considerably. Out of about 63000 raw data chunks, there were > only 6 or 8 where the import time fell in the 10-25 seconds range, the rest > of them were imported under 10 seconds and the majority of them (65%) > were imported under 3 seconds. The system has 24 cores, so we can use > 8 of them dedicated to importing. This almost 1TB data was imported > in about 3.5-4 hours after the FILLFACTOR change. Without it, it didn't > finish in 2 days. > > Best regards, > Zoltán Böszörményi Great, thanks for this info! So I guess it basically confirms my theory that lowering fillfactor only helps with a lot of processes accessing the same pages. Can you report how did the index size change? regards Tomas