Thread: Index size
Hi, I have created a btree index on a 'int4' attribute of a table. After i have inserted 1,000,000 raws in my table, i can see that my index size is 2745 Blocks (8KB each) from pg_class. That means about 21,960 KB size. I try to understand hows is this number generated, because thought that for each new entry in table, there is a new entry in index and that each entry of the index is: 4 Bytes for the int4 attribute and 40 Bytes for oid So 44 * 1,000,000 ~ 42,969 KB Can anybody inform me where I do the mistake?
> I have created a btree index on a 'int4' attribute of a table. > > After i have inserted 1,000,000 raws in my table, i can see that my index > size is 2745 Blocks (8KB each) from pg_class. That means about 21,960 KB > size. > > I try to understand hows is this number generated, because thought that > for each new entry in table, there is a new entry in index and that each > entry of the index is: > > 4 Bytes for the int4 attribute > and > 40 Bytes for oid > > So 44 * 1,000,000 ~ 42,969 KB > > Can anybody inform me where I do the mistake? There's no oid in index tuples. There is an 8-byte long header for each index tuple. Since you are inserting 4-byte long user data, you index tuples are 12-byte each. Each index tuple needs a "pointer" in a block, which is called "item pointer" and that is 4-byte long. Each block can hold up to floor((8192-24(page header)-16(special data))/(12+4)) = 509 tuples. ceil(1,000,000/509) = 1965 is the blocks you need for your index. In addition to this, you need a "meta page" and a "root page". So it becomes 1965+1+1 = 1967. Also you need "internal pages", whose numer is hard to guess since it depends on the actual index tree structure(for example, tree height). From my limited experience, for 1,000,000 tuples, you will need at least 7 internal pages. Now the number becomes 1967+7 = 1974. Still it's different from 2745. If you don't have deleted tuples, the difference probably comes from the fact that a btree index can never be 100% occupied. IMO 1974/2745 = 0.71 seems not so bad. -- Tatsuo Ishii
Thanks a lot. An other question: Is there any way to prevent duplicates on btree index attribute, PERMITTING them on table? On Tue, 1 Mar 2005, Tatsuo Ishii wrote: > > I have created a btree index on a 'int4' attribute of a table. > > > > After i have inserted 1,000,000 raws in my table, i can see that my index > > size is 2745 Blocks (8KB each) from pg_class. That means about 21,960 KB > > size. > > > > I try to understand hows is this number generated, because thought that > > for each new entry in table, there is a new entry in index and that each > > entry of the index is: > > > > 4 Bytes for the int4 attribute > > and > > 40 Bytes for oid > > > > So 44 * 1,000,000 ~ 42,969 KB > > > > Can anybody inform me where I do the mistake? > > There's no oid in index tuples. There is an 8-byte long header for > each index tuple. Since you are inserting 4-byte long user data, you > index tuples are 12-byte each. Each index tuple needs a "pointer" in a > block, which is called "item pointer" and that is 4-byte long. Each > block can hold up to floor((8192-24(page header)-16(special > data))/(12+4)) = 509 tuples. ceil(1,000,000/509) = 1965 is the blocks > you need for your index. In addition to this, you need a "meta page" > and a "root page". So it becomes 1965+1+1 = 1967. Also you need > "internal pages", whose numer is hard to guess since it depends on the > actual index tree structure(for example, tree height). From my limited > experience, for 1,000,000 tuples, you will need at least 7 internal > pages. Now the number becomes 1967+7 = 1974. Still it's different from > 2745. If you don't have deleted tuples, the difference probably comes > from the fact that a btree index can never be 100% occupied. IMO > 1974/2745 = 0.71 seems not so bad. > -- > Tatsuo Ishii > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > ... Now the number becomes 1967+7 = 1974. Still it's different from > 2745. If you don't have deleted tuples, the difference probably comes > from the fact that a btree index can never be 100% occupied. IMO > 1974/2745 = 0.71 seems not so bad. In fact the traditional figure for the steady-state load factor of a btree index is 2/3rds; that is, after a long sequence of inserts and deletes you can expect about one-third of each page to be empty space. If Ioannis' number was taken immediately after a CREATE INDEX operation, then his index size isn't reflective of any settling to a steady-state load factor; rather it happens because the CREATE INDEX command deliberately loads the index leaf pages only 2/3rds full, to avoid a disproportionate amount of page splitting when normal inserts commence. regards, tom lane
> An other question: > > Is there any way to prevent duplicates on btree index attribute, > PERMITTING them on table? I can't think of any usefull usage for such an index. Can you explain why you need it? -- Tatsuo Ishii
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > ... Now the number becomes 1967+7 = 1974. Still it's different from > > 2745. If you don't have deleted tuples, the difference probably comes > > from the fact that a btree index can never be 100% occupied. IMO > > 1974/2745 = 0.71 seems not so bad. > > In fact the traditional figure for the steady-state load factor of a > btree index is 2/3rds; that is, after a long sequence of inserts and > deletes you can expect about one-third of each page to be empty space. > > If Ioannis' number was taken immediately after a CREATE INDEX operation, > then his index size isn't reflective of any settling to a steady-state > load factor; rather it happens because the CREATE INDEX command > deliberately loads the index leaf pages only 2/3rds full, to avoid a > disproportionate amount of page splitting when normal inserts commence. Interesting. Right after CREATE INDEX for a int4 column using pgbench -s 10(1,000,000 tuples), I got 2184 leaf pages. From my caliculation the number of leaf pages is expected to 1965, which is 100% full case assumption of course. So 1965/2184 = 0.8997 = 90% is actually used? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> ... rather it happens because the CREATE INDEX command >> deliberately loads the index leaf pages only 2/3rds full, to avoid a >> disproportionate amount of page splitting when normal inserts commence. > Interesting. Right after CREATE INDEX for a int4 column using pgbench > -s 10(1,000,000 tuples), I got 2184 leaf pages. From my caliculation > the number of leaf pages is expected to 1965, which is 100% full case > assumption of course. So 1965/2184 = 0.8997 = 90% is actually used? Shoulda read the code rather than going by memory ;-). What nbtsort.c actually says is * It is not wise to pack the pages entirely full, since then *any* * insertion would cause a split (and not only of the leaf page; the need * for a split would cascade right up the tree). The steady-state load * factor for btrees is usually estimated at 70%. We choose to pack leaf * pages to 90% and upper pages to 70%. This gives us reasonable density * (there aren't many upper pages if the keys are reasonable-size) without * incurring a lot of cascading splits during early insertions. and indeed the code seems to do that: /* set "full" threshold based on level. See notes at head of file. */ if (level > 0) state->btps_full = (PageGetPageSize(state->btps_page) * 3) / 10; else state->btps_full = PageGetPageSize(state->btps_page) / 10; regards, tom lane
> > Interesting. Right after CREATE INDEX for a int4 column using pgbench > > -s 10(1,000,000 tuples), I got 2184 leaf pages. From my caliculation > > the number of leaf pages is expected to 1965, which is 100% full case > > assumption of course. So 1965/2184 = 0.8997 = 90% is actually used? > > Shoulda read the code rather than going by memory ;-). What nbtsort.c > actually says is > > * It is not wise to pack the pages entirely full, since then *any* > * insertion would cause a split (and not only of the leaf page; the need > * for a split would cascade right up the tree). The steady-state load > * factor for btrees is usually estimated at 70%. We choose to pack leaf > * pages to 90% and upper pages to 70%. This gives us reasonable density > * (there aren't many upper pages if the keys are reasonable-size) without > * incurring a lot of cascading splits during early insertions. > > and indeed the code seems to do that: > > /* set "full" threshold based on level. See notes at head of file. */ > if (level > 0) > state->btps_full = (PageGetPageSize(state->btps_page) * 3) / 10; > else > state->btps_full = PageGetPageSize(state->btps_page) / 10; > Thanks for the explanation. So it seems Ioannis' number was not taken immediately after a CREATE INDEX operation? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > So it seems Ioannis' number was not taken immediately after a CREATE > INDEX operation? I would guess not, but it's up to him to say. If it is a number derived after some period of normal operation, then his result agrees with the theory that says 70% is the steady-state figure ... regards, tom lane
On Tue, 1 Mar 2005, Tom Lane wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > So it seems Ioannis' number was not taken immediately after a CREATE > > INDEX operation? > > I would guess not, but it's up to him to say. If it is a number derived > after some period of normal operation, then his result agrees with the > theory that says 70% is the steady-state figure ... yes, my number was taken after a large amount of inserts. Your comments about the block usage in case of b-tree indexes are absolutely interesting. Where can i find a documentation with technical analysis for all (if possible) of components of postgres? All documentations that i have found are very general and refer to simple users. > > regards, tom lane >
Ioannis Theoharis <theohari@ics.forth.gr> writes: > Where can i find a documentation with technical analysis for all (if > possible) of components of postgres? Read the source code. regards, tom lane
On Wed, 2 Mar 2005, Tatsuo Ishii wrote: > > An other question: > > > > Is there any way to prevent duplicates on btree index attribute, > > PERMITTING them on table? > > I can't think of any usefull usage for such an index. Can you explain > why you need it? I have a relation like this: (att0 varchar(1000), att1 int4) i create a b-tree index on att1 () i cluster my raltion according to index now i have a query select * form tc2000000000 where att1<=900000000 and att1>=0 ; As far as i can see from explain analyze an index scan is used: Index Scan using inst_id_idx on tc2000000000 Index Cond: ((att1 <= 900000000) AND (att1 >= 0)) If for each entry in table, an entry in index is beeing held, then the index size is populated too fast. I guess, that postgres uses index to find the first entry satisfying the index conition, after find the last one and then do a sequential scan on the appropriate fraction of the table (to take advantage of physical clustering). In my case, discrete values on att1 are orders of magnitude less than number of table raws. Thus, the big index size is useless for me. I want to avoid the overhead of scanning such a big index, just permitting ONLY the discrete values to entry in index. In such a way the whole scenario i presented before for how i guess, that postgres evaluates my query, is still in use. I think there must be a way to change the way of index_usage to alter it to what i 'm looking for. > -- > Tatsuo Ishii >
On Wed, Mar 02, 2005 at 10:08:58PM +0200, Ioannis Theoharis wrote: > I have a relation like this: (att0 varchar(1000), att1 int4) > > i create a b-tree index on att1 () > i cluster my raltion according to index > > now i have a query > select * > form tc2000000000 > where att1<=900000000 and att1>=0 ; > > As far as i can see from explain analyze an index scan is used: > Index Scan using inst_id_idx on tc2000000000 > Index Cond: ((att1 <= 900000000) AND (att1 >= 0)) > > If for each entry in table, an entry in index is beeing held, then the > index size is populated too fast. > > I guess, that postgres uses index to find the first entry satisfying the > index conition, after find the last one and then do a sequential scan on > the appropriate fraction of the table (to take advantage of physical > clustering). What makes you think that? Clustering is nice, but postgresql needs to get the right answer and that the table in clustered is not something postgresql can rely on. It uses the index to find *every* row you're looking for, there's no shortcut here. > In my case, discrete values on att1 are orders of magnitude less than > number of table raws. > > Thus, the big index size is useless for me. I want to avoid the overhead > of scanning such a big index, just permitting ONLY the discrete values to > entry in index. In such a way the whole scenario i presented before for > how i guess, that postgres evaluates my query, is still in use. There's no special relationship between two rows with the same att1. Either you find the rows by using an index for each row, or scanning the whole table. There's no inbetween. The only thing clustering acheives is that due to values being together, the chance that succeeding indexes entries will already have been loaded is higher, thus reducing the overall cost. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Wed, Mar 02, 2005 at 11:30:58PM +0200, Ioannis Theoharis wrote: > On Wed, 2 Mar 2005, Martijn van Oosterhout wrote: > > What makes you think that? Clustering is nice, but postgresql needs to > > get the right answer and that the table in clustered is not something > > postgresql can rely on. > > If postgresql doesn't rely on it, it' s postgresql's technical decision > (and i don't know the reason) and not a default decision between rdbms's. > > But if you know exactly the reason, it would be a great help for me to > know it. Easy, if you CLUSTER a table, it's CLUSTERed then. But it doesn't stay that way. As soon as you insert a new row, or update an old one, it gets added to the end (the only place with space) and now it's not clustered anymore. It's almost clustered and from a caching point of view it's fine. But postgresql can't assume at any point a table will stay clustered, an insert could happen in the middle of your processing. Logically you can't magically add space in the middle of a file, you have to move everything else up. If you know an efficient way to keep a table clustered while handling arbitrary inserts and updates, I'd be curious to know... -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
All you said are wright. But it 's not so difficult for postgresql to hold on a bit attribute attached to each table the information, whether there is done an insertion/deletion/update to a clustered table or not. And i guess, postgresql would already implement this simply alternative. > Easy, if you CLUSTER a table, it's CLUSTERed then. But it doesn't stay > that way. As soon as you insert a new row, or update an old one, it > gets added to the end (the only place with space) and now it's not > clustered anymore. It's almost clustered and from a caching point of > view it's fine. But postgresql can't assume at any point a table will > stay clustered, an insert could happen in the middle of your > processing. > > Logically you can't magically add space in the middle of a file, you > have to move everything else up. If you know an efficient way to keep a > table clustered while handling arbitrary inserts and updates, I'd be > curious to know...
Hi everybody! I have an SRF which is called from a JAVA app with JDBC. Everything works fine and I want now to be able to pass some result-related info to my app. It is not about the format of the results (ResultSetMetaData) or something like that. Is it possible to return some string (or other type of)info together with the result tuples (even if it requiers some hacking i.e. there is no provision for something like that)? Any ideas? Regards, Ntinos Katsaros
On Tue, 8 Mar 2005 ntinos@aueb.gr wrote: > Hi everybody! > > I have an SRF which is called from a JAVA app with JDBC. Everything > works fine and I want now to be able to pass some result-related info to > my app. It is not about the format of the results (ResultSetMetaData) or > something like that. > > Is it possible to return some string (or other type of)info together with > the result tuples (even if it requiers some hacking i.e. there is no > provision for something like that)? Any ideas? > The only idea that comes to mind is using RAISE NOTICE in your plpgsql function and Statement or ResultSet .getWarnings() on the Java side to retrieve that info. There really isn't any other out of band data path. Kris Jurka
Thank you very much for your reply. The thing is that my SRF is written in C, not plpgsql, but I'll look into RAISE NOTICE anyway.(I think there is something equevalent in libpq) Thanks again, Ntinos Katsaros Kris Jurka writes: > > > On Tue, 8 Mar 2005 ntinos@aueb.gr wrote: > >> Hi everybody! >> >> I have an SRF which is called from a JAVA app with JDBC. Everything >> works fine and I want now to be able to pass some result-related info to >> my app. It is not about the format of the results (ResultSetMetaData) or >> something like that. >> >> Is it possible to return some string (or other type of)info together with >> the result tuples (even if it requiers some hacking i.e. there is no >> provision for something like that)? Any ideas? >> > > The only idea that comes to mind is using RAISE NOTICE in your plpgsql > function and Statement or ResultSet .getWarnings() on the Java side to > retrieve that info. There really isn't any other out of band data path. > > Kris Jurka > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Solution found! Thanks to Kris Jurka's advise I managed to pass this info using: elog(INFO,... or elog(NOTICE,... . These messages together with .getWarnings() do the job. : e.g. message returned by the SQLWarning: SNOTICEC00000M#SUCCESSFUL EXECUTION. NO TUPLES FROM PEER(S): mobileb#Ftestmybuild.cL2558Ranswer Getting the plain message is then trivial (e.g. using flag chars like '#' above) Of cource the appropriate logging must be set in postgresql.conf. Just in case somenone wants to do the same thing. I dont know if this is the best solution (or if any other exists) but it surely works. Regards, Ntinos Katsaros PS: libpq has nothing to do with the above :-)! ntinos@aueb.gr writes: > Thank you very much for your reply. The thing is that my SRF is written in > C, not plpgsql, but I'll look into RAISE NOTICE anyway.(I think there is > something equevalent in libpq) > > Thanks again, > Ntinos Katsaros > > Kris Jurka writes: > >> >> >> On Tue, 8 Mar 2005 ntinos@aueb.gr wrote: >> >>> Hi everybody! >>> >>> I have an SRF which is called from a JAVA app with JDBC. Everything >>> works fine and I want now to be able to pass some result-related info to >>> my app. It is not about the format of the results (ResultSetMetaData) or >>> something like that. >>> >>> Is it possible to return some string (or other type of)info together >>> with the result tuples (even if it requiers some hacking i.e. there is >>> no provision for something like that)? Any ideas? >>> >> >> The only idea that comes to mind is using RAISE NOTICE in your plpgsql >> function and Statement or ResultSet .getWarnings() on the Java side to >> retrieve that info. There really isn't any other out of band data path. >> >> Kris Jurka >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 4: Don't 'kill -9' the postmaster > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq