Re: Google's Summer of Code ... - Mailing list pgsql-hackers

From Meredith L. Patterson
Subject Re: Google's Summer of Code ...
Date
Msg-id 429E1F99.9020307@thesmartpolitenerd.com
Whole thread Raw
In response to Re: Google's Summer of Code ...  ("Marc G. Fournier" <scrappy@postgresql.org>)
Responses Re: Google's Summer of Code ...
List pgsql-hackers
Marc G. Fournier wrote:
> Do we have any 'students' that are already up to speed, enough so that
> they'd be able to accomplish something significant over a 2-3 month period?

Well, I suppose now might be a good time to de-lurk. Hi, my name's
Meredith, I'm a PhD student at the University of Iowa, I've been reading
pgsql-hackers for a few months now, and I'm planning on submitting a
Postgres-related project to the Summer of Code program.

My research area is data mining, and the problem I'll be focusing on for
my dissertation is that of "soft queries" -- basically, determining
ranking functions that reflect a user's preferences when searching
through a large data set. (As opposed to "hard queries", e.g., binary
filtering conditions in the WHERE clause.) Specifically, we want to
derive this function even in cases where the user can't specify it
himself (e.g., can't write an appropriate ORDER BY clause, perhaps
because he can't objectively evaluate how much he values each feature of
the data). We've developed a fairly simple user interface for this;
overall, the procedure looks like:
 1. User sees a small (<6 entities) set of sample data. 2. User puts each entity into one of three categories:
"preferred,"   "considered," "not preferred" 3. System trains a ranking support vector machine based on the partial
orderingsof training data the user has just provided. 4. System uses this function to provide the user with a new
sample   set, which the user has the option to rerank in order to retrain    the function. 5. Lather, rinse, repeat.
 

In our experiments, we've found that it takes at most maybe three or
four iterations to converge on a good (i.e., representative of what the
user wants) function. However, at the moment we're limited to generating
linear functions, because the kernel function that a nonlinear SVM
generates is not well suited to translating into an ORDER BY clause. A
linear SVM generates a linear weight vector, one element for each
element in a data vector -- so, if you're training on values from
myTable.foo, myTable.bar and myTable.baz, and your linear SVM gives you
a weight vector <a, b, c>, this trivially translates into
 ORDER BY (a * myTable.foo + b * myTable.bar + c * myTable.baz)

If you'd like to see this in practice, feel free to check out our
demonstration setup at http://austin.cs.uiowa.edu/charun. (Source to
come; the SVM implementation we're using right now is encumbered by
restrictive licensing, so I need to detangle our code from it first.)
You'll probably notice quickly, however, that if you prefer
middle-of-the-road values, the function will heterodyne all over the
place. This is because a linear function, to be useful, requires that
the data be _linearly separable_ -- i.e., you can draw a single straight
line through the feature space to partition the data into two categories
-- but middle-of-the-road values require a nonlinear function. Thus the
desire to be able to support nonlinear kernel functions.

It is possible, but highly inconvenient, to translate certain types of
kernel functions into linear weight vectors, but in practice, I think
it'd be easier -- and more robust -- to expand PostgreSQL such that a
nonlinear kernel function can be used as the argument to an ORDER BY clause.

I've been reluctant to mention this in the past, mainly because I don't
see it as enormously useful to Postgres users as a whole; your average
user doesn't know what a support vector machine is, and while I have a
laundry list of use cases for this kind of search capability, the
Postgres end of it is more useful as support for the system as a whole
rather than a standalone Postgres feature. (As far as I can see, anyway.
I won't complain if someone has a reason why it's useful for the
ordinary user. :) Anyway, I'm submitting the proposal for the entire
system; my original plan was to suggest Google as my mentoring
organization, but I'd also be happy to work directly with the PostgreSQL
Global Development Group if there's interest.

I'm already intimately familiar with the PostgreSQL query parser thanks
to an anti-SQL-injection app that a colleague of mine and I have been
working on (of which, more later; see
http://www.sixdemonbag.org/HP2005.pdf for an overview), and am quite
comfortable with expanding it. I'm less familiar with the planner and
optimizer, and don't as yet have a terribly good feel for how expanding
the ORDER BY syntax would affect these pieces (my intuition: planner
yes, optimizer ... maybe?), but after picking apart the scanner and
parser in exhaustive detail, I feel pretty comfortable with the coding
style and how memory management works. So I wouldn't say I'm all the way
up the learning curve, but I've got a head start.

Phew. Thanks for reading. I know there are a lot of TODO items that are
high on the priority list, and I see that others here already know
students who are working on projects more closely related to those
things, so I understand entirely if y'all would prefer to work with
someone who's adding more directly useful functionality to Postgres. I'm
very glad to see the PGDG getting involved with the Summer of Code
project, and either way, I look forward to having enough free time to
start tackling various TODO items myself. :)

Cheers,
Meredith L. Patterson


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: NOLOGGING option, or ?
Next
From: Robert Treat
Date:
Subject: Re: Google's Summer of Code ...