Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers

From Joshua Berkus
Subject Re: Gsoc2012 Idea --- Social Network database schema
Date
Msg-id 612260444.406269.1332619970287.JavaMail.root@mail-1.01.com
Whole thread Raw
In response to Re: Gsoc2012 Idea --- Social Network database schema  (Qi Huang <huangqiyx@hotmail.com>)
Responses Gsoc2012 idea, tablesample  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
Qi,

Yeah, I can see that.  That's a sign that you had a good idea for a project, actually: your idea is interesting enough
thatpeople want to debate it.  Make a proposal on Monday and our potential mentors will help you refine the idea.
 

----- Original Message -----
> 
> 
> 
> 
> > Date: Thu, 22 Mar 2012 13:17:01 -0400
> > Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database
> > schema
> > From: cbbrowne@gmail.com
> > To: Kevin.Grittner@wicourts.gov
> > CC: pgsql-hackers@postgresql.org
> > 
> > On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> > <Kevin.Grittner@wicourts.gov> wrote:
> > > Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> Robert Haas <robertmhaas@gmail.com> writes:
> > >>> Well, the standard syntax apparently aims to reduce the number
> > >>> of
> > >>> returned rows, which ORDER BY does not. Maybe you could do it
> > >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> > >>> like to sample the table without reading all of it first, so
> > >>> that
> > >>> seems to miss the point.
> > >> 
> > >> I think actually the traditional locution is more like
> >! ; >> WHERE random() < constant
> > >> where the constant is the fraction of the table you want. And
> > >> yeah, the presumption is that you'd like it to not actually read
> > >> every row. (Though unless the sampling density is quite a bit
> > >> less than 1 row per page, it's not clear how much you're really
> > >> going to win.)
> > > 
> > > It's all going to depend on the use cases, which I don't think
> > > I've
> > > heard described very well yet.
> > > 
> > > I've had to pick random rows from, for example, a table of
> > > disbursements to support a financial audit. In those cases it has
> > > been the sample size that mattered, and order didn't. One
> > > interesting twist there is that for some of these financial
> > > audits
> > > they wanted the probability of a row being selected to be
> > > proportional ! to the dollar amount of the disbursement. I don't
> > > t hink you can do this without a first pass across the whole data
> > > set.
> > 
> > This one was commonly called "Dollar Unit Sampling," though the
> > terminology has gradually gotten internationalized.
> > http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
> > 
> > What the article doesn't mention is that some particularly large
> > items
> > might wind up covering multiple samples. In the example, they're
> > looking for a sample every $3125 down the list. If there was a
> > single
> > transaction valued at $30000, that (roughly) covers 10 of the
> > desired
> > samples.
> > 
> > It isn't possible to do this without scanning across the entire
> > table.
> > 
> > If you want repeatability, you probably want to instantiate a copy
> > of
> > enough information to indicate the ordering chosen. That's probably
> > something that needs to be captured as part of the work of the
> > audit,
> > so n! ot only does it need to involve a pass across the data, it
> > probably requires capturing a fair bit of data for posterity.
> > --
> > When confronted by a difficult problem, solve it by reducing it to
> > the
> > question, "How would the Lone Ranger handle this?"
> 
> 
> 
> 
> 
> 
> The discussion till now has gone far beyond my understanding.....
> Could anyone explain briefly what is the idea for now?
> The designing detail for me is still unfamiliar. I can only take time
> to understand while possible after being selected and put time on it
> to read relevant material.
> For now, I'm still curious why Neil's implementation is no longer
> working? The Postgres has been patched a lot, but the general idea
> behind Neil's implementation should still work, isn't it?
> Besides, whether this query is needed is still not decided. Seems
> this is another hard to decide point. Is it that this topic is still
> not so prepared for th e Gsoc yet? If really so, I think I still
> have time to switch to other topics. Any suggestion?
> 
> 
> Thanks.
> 
> Best Regards and Thanks
> Huang Qi Victor
> Computer Science of National University of Singapore


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Fix PL/Python metadata when there is no result
Next
From: Joshua Berkus
Date:
Subject: Re: query cache