Thread: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
[GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
mx
Date:
Hello,Everyone!<br />I'm a student in China. and I'm preparing for GSocC2008 in these days. <br />There are two questionsabout GSoC.<br /><br />1. There's a paragraph about the Example Proposal Ideas in PostgreSQL Summer Projects website.<br/><br /><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt0.8ex; padding-left: 1ex;"><b>TODO Items</b>: A number of the items on our TODO list have been marked as good projectsfor beginners who are new to the PostgreSQL code. Items on this list have the advantage of already having generalcommunity agreement that the feature is desireable. These items should also have some general discussion availablein the mailing list archives to help get you started. <b>You can find these items on the <a href="http://www.postgresql.org/docs/faqs.TODO.html">TODO</a>list, they will be marked with a percent sign (%)</b>.<br /></blockquote><br/>I didn't get attention to this paragraph before, so I choose some items without % in the List.<br /><b>Isit OK?</b> By the way, I'm writing proposal for multi-column hash now.<br /><br />2. I'm currently in my fourth yearof studies. And I'm in a lab doing database research. <br />My thesis work is about B-Tree index in NAND Flash Disk.I want to do it based on PostgreSQL..<br /> I know embedded server is the feature that postgreSQL don't want. But flashDisk is developing very fast. It's a trend that Flash Disk will replace magnetic disk one day just like what Jim Graysaid "Tape is dead, disk is tape, flash is disk", though nowadays flash device is only widely used in embedded devices.<br /><b>So, how about a project idea on NAND Flash disk without limited-resource environments?</b><br /><b>Is itan acceptable idea?</b><br /><br />Anyway, hope to hear from you, Thanks!<br /><br />-- <br />Best Regards,<br />Meng Xiao<br/><br />━━━━━━━━━━━━━━━━━━━━━━━━━<br />Data and Knowledge Engineering Research Center,CS&T<br />Harbin Instituteof Technology, Harbin, China<br />Gtalk: <a href="mailto:mx.cogito@gmail.com">mx.cogito@gmail.com</a><br />MSN:<a href="mailto:cnEnder@live.com">cnEnder@live.com</a><br /> Blog: <a href="http://xiaomeng.yo2.cn">http://xiaomeng.yo2.cn</a>
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
"Heikki Linnakangas"
Date:
mx wrote: > Hello,Everyone! > I'm a student in China. and I'm preparing for GSocC2008 in these days. > There are two questions about GSoC. > > 1. There's a paragraph about the Example Proposal Ideas in PostgreSQL Summer > Projects website. > > *TODO Items*: A number of the items on our TODO list have been marked as >> good projects for beginners who are new to the PostgreSQL code. Items on >> this list have the advantage of already having general community agreement >> that the feature is desireable. These items should also have some general >> discussion available in the mailing list archives to help get you started. >> *You can find these items on the TODO<http://www.postgresql.org/docs/faqs.TODO.html>list, they will be marked with a percentsign (%) >> *. >> > > I didn't get attention to this paragraph before, so I choose some items > without % in the List. > *Is it OK?* Yes, absolutely. The '%' sign is just a hint that those items are easier than others, and therefore good items to pick up as a beginner. > By the way, I'm writing proposal for multi-column hash now. The biggest problem with the hash index is currently that there's no significant performance over b-tree. If you want to work on hash indexes, I would suggest doing benchmarking and looking at ways to improve performance, before spending time on making it multi-column capable. And missing WAL logging is a big issue as well. > 2. I'm currently in my fourth year of studies. And I'm in a lab doing > database research. > My thesis work is about B-Tree index in NAND Flash Disk. I want to do it > based on PostgreSQL.. > I know embedded server is the feature that postgreSQL don't want. But flash > Disk is developing very fast. It's a trend that Flash Disk will replace > magnetic disk one day just like what Jim Gray said "Tape is dead, disk is > tape, flash is disk", though nowadays flash device is only widely used in > embedded devices. > *So, how about a project idea on NAND Flash disk without limited-resource > environments?* > *Is it an acceptable idea?* Maybe, hard to tell without more details. What difference does it make if the b-tree is on a flash device, as opposed to disk? What's different in general when you run on a flash disk? The "embedded server" idea in the "not wanted" list refers to the idea of running PostgreSQL in the same process as the client. If I understood you correctly, you're proposing something quite different. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > mx wrote: >> By the way, I'm writing proposal for multi-column hash now. > The biggest problem with the hash index is currently that there's no > significant performance over b-tree. If you want to work on hash > indexes, I would suggest doing benchmarking and looking at ways to > improve performance, before spending time on making it multi-column > capable. Quite. There are already some people investigating that, but maybe you could join forces with them. > And missing WAL logging is a big issue as well. Yeah, but again there seems little point in doing that work until/unless there is a demonstrable performance use-case for hash indexes. regards, tom lane
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
mx
Date:
Thank you for your suggestion!<br /><br />> The biggest problem with the hash index is currently that there's no<br />>significant performance over b-tree. If you want to work on hash<br />> indexes, I would suggest doing benchmarkingand looking at ways to<br /> > improve performance, before spending time on making it multi-column<br />>capable. And missing WAL logging is a big issue as well.<br /><br />It's a good suggestion! My work is useless if theperformance of hash index is not effective enough.<br /> I'll adopt your suggestion to consider improving hash performanceat first. It's a more challenging and exciting work.<br /><br />On Tue, Mar 25, 2008 at 12:23 AM, Heikki Linnakangas<<a href="mailto:heikki@enterprisedb.com">heikki@enterprisedb.com</a>> wrote:<br /><br />> Maybe, hardto tell without more details. What difference does it make<br />> if the b-tree is on a flash device, as opposed todisk? What's different<br />> in general when you run on a flash disk?<br />> <br /> > The "embedded server" ideain the "not wanted" list refers to the idea<br />> of running PostgreSQL in the same process as the client. If I understood<br/>> you correctly, you're proposing something quite different.<br /> > <br /><br />OK, I'll explain itin more details. <br /><br />The atom unit of flash is page(512~2048byte typically).<br />Page are organized into blocks,typically of 32 or 64 pages.<br />All read write and write operations happen at page granularity, but erase operationshappen at block granularity.<br /><br />Flash has a weird characteristic "erase-before-write".You can't just overwritea page, You have to erase the whole blocks and then write the page. So read operation is faster than write operation(about 2~200times by different device).<br /><br />It's a big problem when we just run database designed for magneticdisk.We just overwrite a page when we update B-Tree, but it's not a good way to update for flash disk. Currently,there are some research results on this problem. They use a method similar to the Log-structured File Systems and every node is encoded by many log entries. So, they can reduce update using log.<br /><br />In my opinion, we have to change Access Method and some part of Storage Managers greatly. Is it too hard for a beginner to serveas a GSoC project?<br /><br /><br />Finally, please make some suggestions, thanks!<br /><br />-- <br /> Have a goodday;-)<br />Best Regards,<br />Meng Xiao<br /><br />━━━━━━━━━━━━━━━━━━━━━━━━━<br />Data and Knowledge Engineering ResearchCenter,CS&T<br />Harbin Institute of Technology, Harbin, China<br />Gtalk: <a href="mailto:mx.cogito@gmail.com">mx.cogito@gmail.com</a><br/> MSN: <a href="mailto:cnEnder@live.com">cnEnder@live.com</a><br/> Blog: <a href="http://xiaomeng.yo2.cn">http://xiaomeng.yo2.cn</a>
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
Greg Smith
Date:
On Tue, 25 Mar 2008, mx wrote: > The atom unit of flash is page(512~2048byte typically). Page are > organized into blocks, typically of 32 or 64 pages. All read write and > write operations happen at page granularity, but erase operations happen > at block granularity. You made a subtle switch here I wanted to emphasise. Your original message suggested flash is an increasingly important storage mechanism because flash devices like SSD drives are going to be more popular in the future; that is true. However, what you're describing is something more like how flash is used in raw embedded systems applications. The kinds of SSD drives that are becoming popular for database use abstract away all of this low-level block mess and hide it with approaches like sophisticated write-leveling algorithms. You don't (and possibly can't) even know what the underlying structure is like. And even if you did, the fact that there's a always a regular operating system and filesystem underneath PostgreSQL writes will make it undertain the writes are only touching the tiny portion of flash you want to target anyway. They may write a whole OS block regardless. Accordingly, it's my opinion that while this topic is interesting and a good one for your thesis, this particular project would not be too valuable to the typical PostgreSQL user whether or not SSD becomes popular. Working to optimize raw flash writes is really more of an operating system level project than a database one. I also have my doubts about whether you could get this done within the GSoC timeframe; better to pick a less ambitious goal than rewriting the whole storage manager interface I would think. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
Alvaro Herrera
Date:
mx escribió: > In my opinion, we have to change Access Method and some part of Storage > Managers greatly. Is it too hard for a beginner to serve as a GSoC project? It's like 3 orders of magnitude more work than I'd expect a Postgres newbie to be able to finish in a summer. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes: > mx escribi�: >> In my opinion, we have to change Access Method and some part of Storage >> Managers greatly. Is it too hard for a beginner to serve as a GSoC project? > It's like 3 orders of magnitude more work than I'd expect a Postgres > newbie to be able to finish in a summer. In fact, you'd be lucky if you could get the community to buy into the idea at all over one summer. "Maximize performance on flash" is not even on the radar screen at this point, much less agreed to, and certainly not agreed-to to the point of sacrificing any other value. You might be able to get some localized changes applied, but any change large enough to risk de-stabilization of the code, or cause unpredictable performance loss for currently supported environments, would probably be rejected out of hand. regards, tom lane
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
mx
Date:
> > On Tue, 25 Mar 2008, mx wrote: > > > The atom unit of flash is page(512~2048byte typically). Page are > > organized into blocks, typically of 32 or 64 pages. All read write and > > write operations happen at page granularity, but erase operations happen > > at block granularity. > > You made a subtle switch here I wanted to emphasise. Your original > message suggested flash is an increasingly important storage mechanism > because flash devices like SSD drives are going to be more popular in the > future; that is true. However, what you're describing is something more > like how flash is used in raw embedded systems applications. The kinds of > SSD drives that are becoming popular for database use abstract away all of > this low-level block mess and hide it with approaches like sophisticated > write-leveling algorithms. > Maybe I gives too many detailed about raw flash devices. In fact, what I want to show is the asymmetric speed of read and write. Any flash devices including SSD have such a characteristic. For a sumsung 64G SSD PATA IDE 2.5,maximum Squential Read is 57MB/s, while maximum sequential Read is 38MB/s according the product datasheet. Anyway, in the eyes performance of outside, write is more expensive than read. Some strategy of trade read for write may be considered. So, the asymmetric speed of read and write make it is still valuable to do some work on SSD. >You don't (and possibly can't) even know what > the underlying structure is like. And even if you did, the fact that > there's a always a regular operating system and filesystem underneath > PostgreSQL writes will make it undertain the writes are only touching the > tiny portion of flash you want to target anyway. They may write a whole > OS block regardless. Yeah, you're right. This is the most confused thing. I wish my thesis work is independent of low level flash device. But it's very hard in fact, just as what you said. -- Have a good day;-) Best Regards, Xiao Meng ━━━━━━━━━━━━━━━━━━━━━━━━━ Data and Knowledge Engineering Research Center,CS&T Harbin Institute of Technology, Harbin, China Gtalk: mx.cogito@gmail.com MSN: cnEnder@live.com Blog: http://xiaomeng.yo2.cn
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
mx
Date:
Thank you for all of your advices. I think you're right. I should be more realistic. There are so many work to do if I want to do some work on Flash disk. It's too difficult to complete the task only in a summer. Obviously, It's not an appropriate project idea for GSoC anyway. Maybe I'll do it in the future after I've done enough work according to my theis work. So, I finally decide to focus on the project idea of improving hash index now. It's more valuable , and also challenging. Any suggestion about the project idea of improving hash index? -- Have a good day;-) Best Regards, Xiao Meng ━━━━━━━━━━━━━━━━━━━━━━━━━ Data and Knowledge Engineering Research Center,CS&T Harbin Institute of Technology, Harbin, Heilongjiang, China Gtalk: mx.cogito@gmail.com MSN: cnEnder@live.com Blog: http://xiaomeng.yo2.cn
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
"Zeugswetter Andreas OSB SD"
Date:
> So, I finally decide to focus on the project idea of improving hash > index now. It's more valuable , and also challenging. > > Any suggestion about the project idea of improving hash index? Imho one thing to look into is the storage. I do not see any real value in storing the key itself (especially longer keys) in the hash buckets. Instead store the hash value only (or not even that) and mark the index lossy (recheck the key in the heap). Andreas
Re: [GSoC] (Is it OK to choose items without % mark in theToDoList) && (is it an acceptable idea to build index on Flash Disk)
From
Kenneth Marshall
Date:
On Tue, Mar 25, 2008 at 01:46:51PM +0100, Zeugswetter Andreas OSB SD wrote: > > So, I finally decide to focus on the project idea of improving hash > > index now. It's more valuable , and also challenging. > > > > Any suggestion about the project idea of improving hash index? > > Imho one thing to look into is the storage. I do not see any real value > in storing the key itself (especially longer keys) in the hash buckets. > Instead store the hash value only (or not even that) and mark the index > lossy (recheck the key in the heap). > > Andreas > Meng, I had started a thread on the hackers mailing list about improving the hash index in PostgreSQL. You can look through it for some of the ideas that were suggested. The first one is to replace the storage of the key values in the index with the hash of the key values instead. This can leverage the lossy index heap re-check code that is already in the database. Neil Conway had posted a patch doing this with an old version of PostgreSQL. My coding skills are a bit rusty and my job has kept me from making much progress towards this. Anyway, please take a look at the hash index thread in hackers. It starts with: http://archives.postgresql.org/pgsql-hackers/2007-09/msg00051.php Let me know what you think? Cheers, Ken Marshall