Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers
From | Marc Mamin |
---|---|
Subject | Re: Gsoc2012 Idea --- Social Network database schema |
Date | |
Msg-id | C4DAC901169B624F933534A26ED7DF310861B47C@JENMAIL01.ad.intershop.net Whole thread Raw |
In response to | Re: Gsoc2012 Idea --- Social Network database schema (Qi Huang <huangqiyx@hotmail.com>) |
Responses |
Re: Gsoc2012 Idea --- Social Network database schema
|
List | pgsql-hackers |
Hello,
Here is something we'd like to have:
http://archives.postgresql.org/pgsql-hackers/2012-01/msg00650.php
As we are quite busy and this issue hasn't a high priority, we haven't followed it until now :-(
I'm only a Postgres user, not a hacker, so I don't have the knowledge to help on this nor to evaluate if this is might be a good Gssoc project.
Just an idea for the case you are looking for another topic.
best regards,
Marc Mamin
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Qi Huang
Sent: Samstag, 24. März 2012 05:20
To: cbbrowne@gmail.com; kevin.grittner@wicourts.gov
Cc: pgsql-hackers@postgresql.org; andres@anarazel.de; alvherre@commandprompt.com; neil.conway@gmail.com; daniel@heroku.com; josh@agliodbs.com
Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> Date: Thu, 22 Mar 2012 13:17:01 -0400
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> From: cbbrowne@gmail.com
> To: Kevin.Grittner@wicourts.gov
> CC: pgsql-hackers@postgresql.org
>
> On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
> > Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Robert Haas <robertmhaas@gmail.com> writes:
> >>> Well, the standard syntax apparently aims to reduce the number of
> >>> returned rows, which ORDER BY does not. Maybe you could do it
> >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> >>> like to sample the table without reading all of it first, so that
> >>> seems to miss the point.
> >>
> >> I think actually the traditional locution is more like
> >> WHERE random() < constant
> >> where the constant is the fraction of the table you want. And
> >> yeah, the presumption is that you'd like it to not actually read
> >> every row. (Though unless the sampling density is quite a bit
> >> less than 1 row per page, it's not clear how much you're really
> >> going to win.)
> >
> > It's all going to depend on the use cases, which I don't think I've
> > heard described very well yet.
> >
> > I've had to pick random rows from, for example, a table of
> > disbursements to support a financial audit. In those cases it has
> > been the sample size that mattered, and order didn't. One
> > interesting twist there is that for some of these financial audits
> > they wanted the probability of a row being selected to be
> > proportional to the dollar amount of the disbursement. I don't
> > think you can do this without a first pass across the whole data
> > set.
>
> This one was commonly called "Dollar Unit Sampling," though the
> terminology has gradually gotten internationalized.
> http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
>
> What the article doesn't mention is that some particularly large items
> might wind up covering multiple samples. In the example, they're
> looking for a sample every $3125 down the list. If there was a single
> transaction valued at $30000, that (roughly) covers 10 of the desired
> samples.
>
> It isn't possible to do this without scanning across the entire table.
>
> If you want repeatability, you probably want to instantiate a copy of
> enough information to indicate the ordering chosen. That's probably
> something that needs to be captured as part of the work of the audit,
> so not only does it need to involve a pass across the data, it
> probably requires capturing a fair bit of data for posterity.
> --
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"
The discussion till now has gone far beyond my understanding.....
Could anyone explain briefly what is the idea for now?
The designing detail for me is still unfamiliar. I can only take time to understand while possible after being selected and put time on it to read relevant material.
For now, I'm still curious why Neil's implementation is no longer working? The Postgres has been patched a lot, but the general idea behind Neil's implementation should still work, isn't it?
Besides, whether this query is needed is still not decided . Seems this is another hard to decide point. Is it that this topic is still not so prepared for the Gsoc yet? If really so, I think I still have time to switch to other topics. Any suggestion?
Thanks.
Best Regards and Thanks
Huang Qi Victor
Computer Science of National University of Singapore
pgsql-hackers by date: