Re: Google's Summer of Code ... - Mailing list pgsql-hackers
From | Meredith L. Patterson |
---|---|
Subject | Re: Google's Summer of Code ... |
Date | |
Msg-id | 429E1F99.9020307@thesmartpolitenerd.com Whole thread Raw |
In response to | Re: Google's Summer of Code ... ("Marc G. Fournier" <scrappy@postgresql.org>) |
Responses |
Re: Google's Summer of Code ...
|
List | pgsql-hackers |
Marc G. Fournier wrote: > Do we have any 'students' that are already up to speed, enough so that > they'd be able to accomplish something significant over a 2-3 month period? Well, I suppose now might be a good time to de-lurk. Hi, my name's Meredith, I'm a PhD student at the University of Iowa, I've been reading pgsql-hackers for a few months now, and I'm planning on submitting a Postgres-related project to the Summer of Code program. My research area is data mining, and the problem I'll be focusing on for my dissertation is that of "soft queries" -- basically, determining ranking functions that reflect a user's preferences when searching through a large data set. (As opposed to "hard queries", e.g., binary filtering conditions in the WHERE clause.) Specifically, we want to derive this function even in cases where the user can't specify it himself (e.g., can't write an appropriate ORDER BY clause, perhaps because he can't objectively evaluate how much he values each feature of the data). We've developed a fairly simple user interface for this; overall, the procedure looks like: 1. User sees a small (<6 entities) set of sample data. 2. User puts each entity into one of three categories: "preferred," "considered," "not preferred" 3. System trains a ranking support vector machine based on the partial orderingsof training data the user has just provided. 4. System uses this function to provide the user with a new sample set, which the user has the option to rerank in order to retrain the function. 5. Lather, rinse, repeat. In our experiments, we've found that it takes at most maybe three or four iterations to converge on a good (i.e., representative of what the user wants) function. However, at the moment we're limited to generating linear functions, because the kernel function that a nonlinear SVM generates is not well suited to translating into an ORDER BY clause. A linear SVM generates a linear weight vector, one element for each element in a data vector -- so, if you're training on values from myTable.foo, myTable.bar and myTable.baz, and your linear SVM gives you a weight vector <a, b, c>, this trivially translates into ORDER BY (a * myTable.foo + b * myTable.bar + c * myTable.baz) If you'd like to see this in practice, feel free to check out our demonstration setup at http://austin.cs.uiowa.edu/charun. (Source to come; the SVM implementation we're using right now is encumbered by restrictive licensing, so I need to detangle our code from it first.) You'll probably notice quickly, however, that if you prefer middle-of-the-road values, the function will heterodyne all over the place. This is because a linear function, to be useful, requires that the data be _linearly separable_ -- i.e., you can draw a single straight line through the feature space to partition the data into two categories -- but middle-of-the-road values require a nonlinear function. Thus the desire to be able to support nonlinear kernel functions. It is possible, but highly inconvenient, to translate certain types of kernel functions into linear weight vectors, but in practice, I think it'd be easier -- and more robust -- to expand PostgreSQL such that a nonlinear kernel function can be used as the argument to an ORDER BY clause. I've been reluctant to mention this in the past, mainly because I don't see it as enormously useful to Postgres users as a whole; your average user doesn't know what a support vector machine is, and while I have a laundry list of use cases for this kind of search capability, the Postgres end of it is more useful as support for the system as a whole rather than a standalone Postgres feature. (As far as I can see, anyway. I won't complain if someone has a reason why it's useful for the ordinary user. :) Anyway, I'm submitting the proposal for the entire system; my original plan was to suggest Google as my mentoring organization, but I'd also be happy to work directly with the PostgreSQL Global Development Group if there's interest. I'm already intimately familiar with the PostgreSQL query parser thanks to an anti-SQL-injection app that a colleague of mine and I have been working on (of which, more later; see http://www.sixdemonbag.org/HP2005.pdf for an overview), and am quite comfortable with expanding it. I'm less familiar with the planner and optimizer, and don't as yet have a terribly good feel for how expanding the ORDER BY syntax would affect these pieces (my intuition: planner yes, optimizer ... maybe?), but after picking apart the scanner and parser in exhaustive detail, I feel pretty comfortable with the coding style and how memory management works. So I wouldn't say I'm all the way up the learning curve, but I've got a head start. Phew. Thanks for reading. I know there are a lot of TODO items that are high on the priority list, and I see that others here already know students who are working on projects more closely related to those things, so I understand entirely if y'all would prefer to work with someone who's adding more directly useful functionality to Postgres. I'm very glad to see the PGDG getting involved with the Summer of Code project, and either way, I look forward to having enough free time to start tackling various TODO items myself. :) Cheers, Meredith L. Patterson
pgsql-hackers by date: