Google Summer of Code 2008 - Mailing list pgsql-hackers
From | Jan Urbański |
---|---|
Subject | Google Summer of Code 2008 |
Date | |
Msg-id | 47CC53C1.5000609@students.mimuw.edu.pl Whole thread Raw |
Responses |
Re: Google Summer of Code 2008
|
List | pgsql-hackers |
Hi PostgreSQL! Although this year's GSoC is just starting, I thought getting in touch a bit earlier would only be of benefit. I study Computer Science in Faculty of Mathematics, Informatics and Mechanics of Warsaw University. I'm currently in my fourth year of studies. Having chosen Databases for my degree course I plan to write my thesis concentrating at least partially on PostgreSQL. This will (hopefully) be my first GSoC. For the past one and a half years I've alse been working in a privately held company Fiok LLP. The company deals, among others, in developing custom Web applications, which all use PostgreSQL as their database solution. During my time in Fiok I have taken part in creating an accounting system for a large Polish university, capable of generating financial reports required by the European Union, a publishing platform for editors working in the Polish Catholic Press Agency and a custom tailored CRM application. All of these projects use unique PostgreSQL features, like PITR and full-text search to name a few. You can glimpse the implemented FTS functionality by looking here: http://system.ekai.pl/kair/?_tw_DepeszeKlientaTable_0__search_plainfulltext=kalendarz&_tw_DepeszeKlientaTable_0__search_rank_orderby=on&screen=depesze It's the public part of the publishing platform, which allows subscribed readers to view published messages. The link takes you to search results for the word 'kalendarz' (which is Polish for calendar), ordered by rank() and highlighted by headline() (our client uses 8.2, hence the old function names). I do my work in Fiok almost exclusively from home, showing up at the office once every two or three weeks, so working in a distributed environment using SCM tools is natural to me. I'm also engaged in an open source project called Kato, being one of the key developers. It's a small project that started as my company's requirement for a new Web application framework and ended up being released under the New BSD License. Of course it's native database engine is PostgreSQL. You can take a look at the source here: http://kato.googlecode.com/ or play around with a simple demo here: http://sahara.fiok.pl/~jurbanski/kato-demo/kato-demo.en.php Speaking of open source contributions, I also wrote a FTS-related patch for Postgres, that made it's way into 8.3: http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php I try to follow -patches, occasinally read -hackers and sometimes make excursions around the pgsql source, trying to learn more and more of it. About my programming skills, particulary in C - one piece of code I'd like to show you was written for an Operating Systems course. It's a kernel patch implementing I/O operations throttling on a per-process basis through a /proc based interface. The code lacks comments, as they were in Polish, but it's just to assure you I'm able to write some good C: http://students.mimuw.edu.pl/~wulczer/linux-2.6.17.13-iolimits-ju219721.patch And now for the SoC. As this year's PostgreSQL Ideas are not set up yet, I thought I'd give you the two projects floating through my mind 1. WAL segment files explorer / mangler While preparing a presentation about PITR and warm stanby in PostgreSQL for my degree course, I thought it would be nice if one had a command-line tool to examine the contents of a WAL segment file and determine for example what commands were recorded in it, what are the transaction IDs they were in, etc. This could allow for instance to replay the WAL sequence up until a function went haywire and wrecked one's data - without the need to know *when* the accident happened. It could be useful as an alternative method of logging operations - since WAL files are written anyway, one could imagine a process periodically looking through them and reporting (perheaps not all) operations to some external listener. If for instance you were curious which column in a table is updated most, instead of writing a trigger to log updates to it, you could use the WAL explorer to find updates to that column and log them over the network, thus reducing disk I/O. Being even bolder, I thought about allowing to edit the contents of a WAL file, so if the proverbial junior DBA drops a crucial table and gets caught the next morning, you don't have to throw away all transactions that got commited over the night. Maybe you could *overwrite* his DROP TABLE with something neutral and replay the WAL up to it's end. 2. Implement better selectivity estimates for FTS. If I'm not mistaken, the @@ operator still uses the contsel selectivity function, returning 0.001 * <total_row_count> as the expected number of rows left after applying the @@ operator. I have in the past been bitten by performance problems that I think could be traced back to row count estimates being horribly wrong (i.e. much too low) for FTS queries asking for a very popular word. Maybe we could do better that just return one-thousandth? I myself am more for the first idea, but both seem good concepts to me. Also, both are implementable as contrib modules, with the WAL explorer possibly requiring modification to the WAL structure, and thus having to wait for 8.4 to get into core. As of now, these are just loose ideas, but ones I believe are possible to implement in the time boundaries of GSoC coding. Before digging deeper into the source and giving them more thought i wanted to consult some more experienced PosgtreSQL hackers and get their opinions - after all, that's what the community is for. To wrap it up: do you find any of these ideas worthwhile? Could they be good candidates for a GSoC project? Of course doing some stuff from the TODO list would still be fun, if you believe they are more promising/needed. Basically, any kind of involvment in PostgreSQL is something that gets me excited. Hope to hear from you, Cheers, Jan Urbanski -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
pgsql-hackers by date: