Thread: adding new pages bulky way

adding new pages bulky way

From
"Victor Y. Yegorov"
Date:
I need your advice.

For on-disk bitmap I run a list of TIDs.

TIDs are stored in pages as an array, page's opaque data holds an array of
bits, indicating whether corresponding TID has been deleted and should be
skipped during the scan.

Pages, that contain TIDs list, are organized in extents, each extent has 2^N
pages, where N is extent's number (i.e. 2nd extent will occupy 4 pages).
Given that I know number of TIDs, that fit into one page, and the TID's
sequential number, I can easily calculate:
-  extent number TID belongs to;
-  page offset inside that extent, and;
-  TID place in the page.

At the moment, I store BlockNumber of the extent's first page in the
metapage and allocate all pages that belongs to that extent sequentially. I
need to do so to minimize number of page reads when searching for the TID in
the list; I'll need to read 1 page at most to find out TID at given position
during the scan. I hope you understood the idea.

This also means, that while extent's pages are being added this way, no other
pages can be added to the index. And the higher is extent's number, the more
time it'll take to allocate all pages.

The question is: allocating pages this way is really ugly, I understand. Is
there some API that would allow allocating N pages in the bulk way?
Maybe this is a know problem, that has been already solved before?
Any other ideas?


Thanks in advance!


-- 

Victor Y. Yegorov


Re: adding new pages bulky way

From
Alvaro Herrera
Date:
On Mon, Jun 06, 2005 at 10:59:04PM +0300, Victor Y. Yegorov wrote:

> The question is: allocating pages this way is really ugly, I understand. Is
> there some API that would allow allocating N pages in the bulk way?
> Maybe this is a know problem, that has been already solved before?
> Any other ideas?

I don't understand your question.  What's the problem with holding the
extend lock for the index relation while you extend it?  Certainly you
want only a single process creating a new extent in the index, right?

I guess the question is when are the extents created, and what
concurrency do you expect from that operation.

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"La naturaleza, tan frágil, tan expuesta a la muerte... y tan viva"


Re: adding new pages bulky way

From
Tom Lane
Date:
"Victor Y. Yegorov" <viy@mits.lv> writes:
> [ scheme involving a predetermined layout of index pages ]

> The question is: allocating pages this way is really ugly, I understand. Is
> there some API that would allow allocating N pages in the bulk way?

Why bother?  Just write each page when you need to --- there's no law
that says you must use P_NEW.  The hash index type does something pretty
similar, IIRC.
        regards, tom lane


Re: adding new pages bulky way

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes
>
> Why bother?  Just write each page when you need to --- there's no law
> that says you must use P_NEW.  The hash index type does something pretty
> similar, IIRC.
>

Is there any performance benefits if we have a mdextend_several_pages()
function in md.c? So the relation can be extended in a bulky way. In my
understanding, if we write
   write(fd, buffer, BLCKSZ*10)

instead of
   for (i=0; i<10; i++)       write(fd, buffer, BLCKSZ);

This will reduce some costs of file system logs for journal file systems. Of
course, the cost we have to pay is the longer time of holding relation
extension lock.

Regards,
Qingqing




Re: adding new pages bulky way

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> Is there any performance benefits if we have a mdextend_several_pages()
> function in md.c?

I very seriously doubt that there would be *any* win, and I doubt even
more that it could possibly be worth the klugery you'd have to do to
make it happen.  Bear in mind that index access methods are two API
layers away from md.c --- how will you translate this into something
that makes sense in the context of bufmgr's API?
        regards, tom lane


Re: adding new pages bulky way

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes
>
> I very seriously doubt that there would be *any* win, and I doubt even
> more that it could possibly be worth the klugery you'd have to do to
> make it happen.  Bear in mind that index access methods are two API
> layers away from md.c --- how will you translate this into something
> that makes sense in the context of bufmgr's API?
>

Index access or heap access doesn't matter. The imaginary plan is like this:

-- change 1 -- 
/* md.c */
mdextend()
{   mdextend_several_pages();   add_pages_to_FSM();
}

-- change 2 -- 
/** Any places hold relation extension lock*/

if (needLock) LockPage(relation, 0, ExclusiveLock);

/* ADD: check again here */
if (InvalidBlockNumber != GetPageWithFreeSpace())   UnlockPage(relation, 0, ExclusiveLock);

/* I have to do the extension */
buffer = ReadBuffer(relation, P_NEW);

Above code is quite like how we handle xlogflush() currently.





Re: adding new pages bulky way

From
"Victor Y. Yegorov"
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [07.06.2005 07:59]:
> Why bother?  Just write each page when you need to --- there's no law
> that says you must use P_NEW.

This means 2 things:
1) I cannot mix P_NEW and exact-number ReadBuffer() calls;
2) thus, I have to track next-block-number myself.

Is it so?


BTW, are there any differences in buffer seeking speed, if buffer
block-numbers are mixed and if they're not (i.e. P_NEW is used)?


-- 

Victor Y. Yegorov


Re: adding new pages bulky way

From
Alvaro Herrera
Date:
On Tue, Jun 07, 2005 at 07:52:57PM +0300, Victor Y. Yegorov wrote:
> * Tom Lane <tgl@sss.pgh.pa.us> [07.06.2005 07:59]:
> > Why bother?  Just write each page when you need to --- there's no law
> > that says you must use P_NEW.
> 
> This means 2 things:
> 1) I cannot mix P_NEW and exact-number ReadBuffer() calls;

Huh, why?  You need to grab the relation extension block
(LockRelationForExtension in CVS tip).

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."                (Lamar Owen)


Re: adding new pages bulky way

From
"Victor Y. Yegorov"
Date:
* Alvaro Herrera <alvherre@surnet.cl> [08.06.2005 00:39]:
> Huh, why?  You need to grab the relation extension block
> (LockRelationForExtension in CVS tip).

Really? Didn't knew that. 

Consider:
1) I add 2 pages to the newly-created relation  using P_NEW as BlockNumber;
2) then I do LockRelationForExtension; ReadBuffer(135) and  UnockRelationForExtension.

What BlockNumber will be assigned to the buffer, if I call
ReadBuffer(P_NEW) now? 136?


BTW, is it OK to say "BlockNumber is assigned to buffer"?


-- 

Victor Y. Yegorov


Re: adding new pages bulky way

From
Tom Lane
Date:
"Victor Y. Yegorov" <viy@mits.lv> writes:
> * Alvaro Herrera <alvherre@surnet.cl> [08.06.2005 00:39]:
>> Huh, why?  You need to grab the relation extension block
>> (LockRelationForExtension in CVS tip).

> Really? Didn't knew that. 

> Consider:
> 1) I add 2 pages to the newly-created relation
>    using P_NEW as BlockNumber;
> 2) then I do LockRelationForExtension; ReadBuffer(135) and
>    UnockRelationForExtension.

As things are set up at the moment, you really should not use
P_NEW at all unless you hold the relation extension lock.

(At least not for ordinary heap relations.  An index access
method could have its own rules about how to add blocks to
the relation --- hash does for instance.)

This is all pretty ugly in my view, and so I would not stand
opposed to ideas about a cleaner design ...
        regards, tom lane


Re: adding new pages bulky way

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes
>
> I very seriously doubt that there would be *any* win
>

I did a quick proof-concept implemenation to test non-concurrent batch
insertion, here is the result:

Envrionment:   - Pg8.0.1   - NTFS / IDE


-- batch 16 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4167.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 8111.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 16444.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 41980.000 ms

-- batch 32 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4086.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 7861.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 16403.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 41290.000 ms

-- batch 64 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4236.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 8202.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 17265.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 44063.000 ms

-- batch 128 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4256.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 8242.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 17375.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 43854.000 ms

-- one page extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4496.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 9013.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 19508.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 49962.000 ms

Benefits are there, and it is an approximate 10% improvement if we select
good batch size. The explaination is: if a batch insertion need 6400 new
pages, originally it does write()+file system logs 6400 times, now it does
6400/64 times(though each time the time cost is bigger). Also, considering
write with different size have different cost, seems for my machine 32 is
the an optimal choice.

What I did include:

(1) md.c
Modify function mdextend():   - extend 64 pages each time;   - after extension, let FSM be aware of it (change FSM a
littlebit so it
 
could report freespace also for an empty page)

(2) bufmgr.c
make ReadPage(+empty_page) treat different of an empty page and non-empty
one to avoid unnecesary read for new pages, that is:   if (!empty_page)       smgrread(reln->rd_smgr, blockNum, (char
*)MAKE_PTR(bufHdr->data));   else       PageInit((char *) MAKE_PTR(bufHdr->data), BLCKSZ, 0); /* Only for
 
heap pages and race could be here ... */

(3) hio.c
RelationGetBufferForTuple(): - pass correct "empty_page" parameter to ReadPage() according to the query
result from FSM.

Regards,
Qingqing





Re: adding new pages bulky way

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> What I did include:

> make ReadPage(+empty_page) treat different of an empty page and non-empty
> one to avoid unnecesary read for new pages, that is:

In other words, if FSM is wrong you will overwrite valid data?  No
thanks ... this is guaranteed to fail under simple concurrent usage,
let alone any more interesting scenarios like FSM being actually out of
date.
        regards, tom lane


Re: adding new pages bulky way

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes
>
> In other words, if FSM is wrong you will overwrite valid data?  No
> thanks ... this is guaranteed to fail under simple concurrent usage,
> let alone any more interesting scenarios like FSM being actually out of
> date.
>

You are welcome ;-). The FSM race/corruption is a trouble. Maybe we could
put it in TODO list for better solutions since we can see the performance
benefits are there.

Regards,
Qingqing