Thread: adding new pages bulky way
I need your advice. For on-disk bitmap I run a list of TIDs. TIDs are stored in pages as an array, page's opaque data holds an array of bits, indicating whether corresponding TID has been deleted and should be skipped during the scan. Pages, that contain TIDs list, are organized in extents, each extent has 2^N pages, where N is extent's number (i.e. 2nd extent will occupy 4 pages). Given that I know number of TIDs, that fit into one page, and the TID's sequential number, I can easily calculate: - extent number TID belongs to; - page offset inside that extent, and; - TID place in the page. At the moment, I store BlockNumber of the extent's first page in the metapage and allocate all pages that belongs to that extent sequentially. I need to do so to minimize number of page reads when searching for the TID in the list; I'll need to read 1 page at most to find out TID at given position during the scan. I hope you understood the idea. This also means, that while extent's pages are being added this way, no other pages can be added to the index. And the higher is extent's number, the more time it'll take to allocate all pages. The question is: allocating pages this way is really ugly, I understand. Is there some API that would allow allocating N pages in the bulk way? Maybe this is a know problem, that has been already solved before? Any other ideas? Thanks in advance! -- Victor Y. Yegorov
On Mon, Jun 06, 2005 at 10:59:04PM +0300, Victor Y. Yegorov wrote: > The question is: allocating pages this way is really ugly, I understand. Is > there some API that would allow allocating N pages in the bulk way? > Maybe this is a know problem, that has been already solved before? > Any other ideas? I don't understand your question. What's the problem with holding the extend lock for the index relation while you extend it? Certainly you want only a single process creating a new extent in the index, right? I guess the question is when are the extents created, and what concurrency do you expect from that operation. -- Alvaro Herrera (<alvherre[a]surnet.cl>) "La naturaleza, tan frágil, tan expuesta a la muerte... y tan viva"
"Victor Y. Yegorov" <viy@mits.lv> writes: > [ scheme involving a predetermined layout of index pages ] > The question is: allocating pages this way is really ugly, I understand. Is > there some API that would allow allocating N pages in the bulk way? Why bother? Just write each page when you need to --- there's no law that says you must use P_NEW. The hash index type does something pretty similar, IIRC. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes > > Why bother? Just write each page when you need to --- there's no law > that says you must use P_NEW. The hash index type does something pretty > similar, IIRC. > Is there any performance benefits if we have a mdextend_several_pages() function in md.c? So the relation can be extended in a bulky way. In my understanding, if we write write(fd, buffer, BLCKSZ*10) instead of for (i=0; i<10; i++) write(fd, buffer, BLCKSZ); This will reduce some costs of file system logs for journal file systems. Of course, the cost we have to pay is the longer time of holding relation extension lock. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > Is there any performance benefits if we have a mdextend_several_pages() > function in md.c? I very seriously doubt that there would be *any* win, and I doubt even more that it could possibly be worth the klugery you'd have to do to make it happen. Bear in mind that index access methods are two API layers away from md.c --- how will you translate this into something that makes sense in the context of bufmgr's API? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes > > I very seriously doubt that there would be *any* win, and I doubt even > more that it could possibly be worth the klugery you'd have to do to > make it happen. Bear in mind that index access methods are two API > layers away from md.c --- how will you translate this into something > that makes sense in the context of bufmgr's API? > Index access or heap access doesn't matter. The imaginary plan is like this: -- change 1 -- /* md.c */ mdextend() { mdextend_several_pages(); add_pages_to_FSM(); } -- change 2 -- /** Any places hold relation extension lock*/ if (needLock) LockPage(relation, 0, ExclusiveLock); /* ADD: check again here */ if (InvalidBlockNumber != GetPageWithFreeSpace()) UnlockPage(relation, 0, ExclusiveLock); /* I have to do the extension */ buffer = ReadBuffer(relation, P_NEW); Above code is quite like how we handle xlogflush() currently.
* Tom Lane <tgl@sss.pgh.pa.us> [07.06.2005 07:59]: > Why bother? Just write each page when you need to --- there's no law > that says you must use P_NEW. This means 2 things: 1) I cannot mix P_NEW and exact-number ReadBuffer() calls; 2) thus, I have to track next-block-number myself. Is it so? BTW, are there any differences in buffer seeking speed, if buffer block-numbers are mixed and if they're not (i.e. P_NEW is used)? -- Victor Y. Yegorov
On Tue, Jun 07, 2005 at 07:52:57PM +0300, Victor Y. Yegorov wrote: > * Tom Lane <tgl@sss.pgh.pa.us> [07.06.2005 07:59]: > > Why bother? Just write each page when you need to --- there's no law > > that says you must use P_NEW. > > This means 2 things: > 1) I cannot mix P_NEW and exact-number ReadBuffer() calls; Huh, why? You need to grab the relation extension block (LockRelationForExtension in CVS tip). -- Alvaro Herrera (<alvherre[a]surnet.cl>) "[PostgreSQL] is a great group; in my opinion it is THE best open source development communities in existence anywhere." (Lamar Owen)
* Alvaro Herrera <alvherre@surnet.cl> [08.06.2005 00:39]: > Huh, why? You need to grab the relation extension block > (LockRelationForExtension in CVS tip). Really? Didn't knew that. Consider: 1) I add 2 pages to the newly-created relation using P_NEW as BlockNumber; 2) then I do LockRelationForExtension; ReadBuffer(135) and UnockRelationForExtension. What BlockNumber will be assigned to the buffer, if I call ReadBuffer(P_NEW) now? 136? BTW, is it OK to say "BlockNumber is assigned to buffer"? -- Victor Y. Yegorov
"Victor Y. Yegorov" <viy@mits.lv> writes: > * Alvaro Herrera <alvherre@surnet.cl> [08.06.2005 00:39]: >> Huh, why? You need to grab the relation extension block >> (LockRelationForExtension in CVS tip). > Really? Didn't knew that. > Consider: > 1) I add 2 pages to the newly-created relation > using P_NEW as BlockNumber; > 2) then I do LockRelationForExtension; ReadBuffer(135) and > UnockRelationForExtension. As things are set up at the moment, you really should not use P_NEW at all unless you hold the relation extension lock. (At least not for ordinary heap relations. An index access method could have its own rules about how to add blocks to the relation --- hash does for instance.) This is all pretty ugly in my view, and so I would not stand opposed to ideas about a cleaner design ... regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes > > I very seriously doubt that there would be *any* win > I did a quick proof-concept implemenation to test non-concurrent batch insertion, here is the result: Envrionment: - Pg8.0.1 - NTFS / IDE -- batch 16 pages extension -- test=# insert into t select * from t; INSERT 0 131072 Time: 4167.000 ms test=# insert into t select * from t; INSERT 0 262144 Time: 8111.000 ms test=# insert into t select * from t; INSERT 0 524288 Time: 16444.000 ms test=# insert into t select * from t; INSERT 0 1048576 Time: 41980.000 ms -- batch 32 pages extension -- test=# insert into t select * from t; INSERT 0 131072 Time: 4086.000 ms test=# insert into t select * from t; INSERT 0 262144 Time: 7861.000 ms test=# insert into t select * from t; INSERT 0 524288 Time: 16403.000 ms test=# insert into t select * from t; INSERT 0 1048576 Time: 41290.000 ms -- batch 64 pages extension -- test=# insert into t select * from t; INSERT 0 131072 Time: 4236.000 ms test=# insert into t select * from t; INSERT 0 262144 Time: 8202.000 ms test=# insert into t select * from t; INSERT 0 524288 Time: 17265.000 ms test=# insert into t select * from t; INSERT 0 1048576 Time: 44063.000 ms -- batch 128 pages extension -- test=# insert into t select * from t; INSERT 0 131072 Time: 4256.000 ms test=# insert into t select * from t; INSERT 0 262144 Time: 8242.000 ms test=# insert into t select * from t; INSERT 0 524288 Time: 17375.000 ms test=# insert into t select * from t; INSERT 0 1048576 Time: 43854.000 ms -- one page extension -- test=# insert into t select * from t; INSERT 0 131072 Time: 4496.000 ms test=# insert into t select * from t; INSERT 0 262144 Time: 9013.000 ms test=# insert into t select * from t; INSERT 0 524288 Time: 19508.000 ms test=# insert into t select * from t; INSERT 0 1048576 Time: 49962.000 ms Benefits are there, and it is an approximate 10% improvement if we select good batch size. The explaination is: if a batch insertion need 6400 new pages, originally it does write()+file system logs 6400 times, now it does 6400/64 times(though each time the time cost is bigger). Also, considering write with different size have different cost, seems for my machine 32 is the an optimal choice. What I did include: (1) md.c Modify function mdextend(): - extend 64 pages each time; - after extension, let FSM be aware of it (change FSM a littlebit so it could report freespace also for an empty page) (2) bufmgr.c make ReadPage(+empty_page) treat different of an empty page and non-empty one to avoid unnecesary read for new pages, that is: if (!empty_page) smgrread(reln->rd_smgr, blockNum, (char *)MAKE_PTR(bufHdr->data)); else PageInit((char *) MAKE_PTR(bufHdr->data), BLCKSZ, 0); /* Only for heap pages and race could be here ... */ (3) hio.c RelationGetBufferForTuple(): - pass correct "empty_page" parameter to ReadPage() according to the query result from FSM. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > What I did include: > make ReadPage(+empty_page) treat different of an empty page and non-empty > one to avoid unnecesary read for new pages, that is: In other words, if FSM is wrong you will overwrite valid data? No thanks ... this is guaranteed to fail under simple concurrent usage, let alone any more interesting scenarios like FSM being actually out of date. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes > > In other words, if FSM is wrong you will overwrite valid data? No > thanks ... this is guaranteed to fail under simple concurrent usage, > let alone any more interesting scenarios like FSM being actually out of > date. > You are welcome ;-). The FSM race/corruption is a trouble. Maybe we could put it in TODO list for better solutions since we can see the performance benefits are there. Regards, Qingqing