Home > mailing lists

Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers

From	Marc Mamin
Subject	Re: Gsoc2012 Idea --- Social Network database schema
Date	March 25, 2012 07:11:49
Msg-id	C4DAC901169B624F933534A26ED7DF310861B47C@JENMAIL01.ad.intershop.net Whole thread
In response to	Re: Gsoc2012 Idea --- Social Network database schema (Qi Huang <huangqiyx@hotmail.com>)
Responses	Re: Gsoc2012 Idea --- Social Network database schema
List	pgsql-hackers

Tree view

Hello,

Here is something we'd like to have:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg00650.php

As we are quite busy and this issue hasn't a high priority, we haven't followed it until now :-(

I'm only a Postgres user, not a hacker, so I don't have the knowledge to help on this nor to evaluate if this is might be a good Gssoc project.

Just an idea for the case you are looking for another topic.

best regards,

Marc Mamin

From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Qi Huang
Sent: Samstag, 24. März 2012 05:20
To: cbbrowne@gmail.com; kevin.grittner@wicourts.gov
Cc: pgsql-hackers@postgresql.org; andres@anarazel.de; alvherre@commandprompt.com; neil.conway@gmail.com; daniel@heroku.com; josh@agliodbs.com
Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

> Date: Thu, 22 Mar 2012 13:17:01 -0400
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> From: cbbrowne@gmail.com
> To: Kevin.Grittner@wicourts.gov
> CC: pgsql-hackers@postgresql.org
>
> On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
> > Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Robert Haas <robertmhaas@gmail.com> writes:
> >>> Well, the standard syntax apparently aims to reduce the number of
> >>> returned rows, which ORDER BY does not. Maybe you could do it
> >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> >>> like to sample the table without reading all of it first, so that
> >>> seems to miss the point.
> >>
> >> I think actually the traditional locution is more like
> >> WHERE random() < constant
> >> where the constant is the fraction of the table you want. And
> >> yeah, the presumption is that you'd like it to not actually read
> >> every row. (Though unless the sampling density is quite a bit
> >> less than 1 row per page, it's not clear how much you're really
> >> going to win.)
> >
> > It's all going to depend on the use cases, which I don't think I've
> > heard described very well yet.
> >
> > I've had to pick random rows from, for example, a table of
> > disbursements to support a financial audit. In those cases it has
> > been the sample size that mattered, and order didn't. One
> > interesting twist there is that for some of these financial audits
> > they wanted the probability of a row being selected to be
> > proportional to the dollar amount of the disbursement. I don't
> > think you can do this without a first pass across the whole data
> > set.
>
> This one was commonly called "Dollar Unit Sampling," though the
> terminology has gradually gotten internationalized.
> http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
>
> What the article doesn't mention is that some particularly large items
> might wind up covering multiple samples. In the example, they're
> looking for a sample every $3125 down the list. If there was a single
> transaction valued at $30000, that (roughly) covers 10 of the desired
> samples.
>
> It isn't possible to do this without scanning across the entire table.
>
> If you want repeatability, you probably want to instantiate a copy of
> enough information to indicate the ordering chosen. That's probably
> something that needs to be captured as part of the work of the audit,
> so not only does it need to involve a pass across the data, it
> probably requires capturing a fair bit of data for posterity.
> --
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"

The discussion till now has gone far beyond my understanding.....

Could anyone explain briefly what is the idea for now?

The designing detail for me is still unfamiliar. I can only take time to understand while possible after being selected and put time on it to read relevant material.

For now, I'm still curious why Neil's implementation is no longer working? The Postgres has been patched a lot, but the general idea behind Neil's implementation should still work, isn't it?

Besides, whether this query is needed is still not decided . Seems this is another hard to decide point. Is it that this topic is still not so prepared for the Gsoc yet? If really so, I think I still have time to switch to other topics. Any suggestion?

Thanks.

Best Regards and Thanks

Huang Qi Victor

Computer Science of National University of Singapore

pgsql-hackers by date:

From: Simon Riggs
Date: 25 March 2012, 05:18:21
Subject: Re: foreign key locks, 2nd attempt

From: Claes Jakobsson
Date: 25 March 2012, 07:54:29
Subject: Re: Regarding column reordering project for GSoc 2012

Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers

Previous

Next