Re: Gsoc2012 Idea --- Social Network database schema - Mailing list pgsql-hackers
From | Joshua Berkus |
---|---|
Subject | Re: Gsoc2012 Idea --- Social Network database schema |
Date | |
Msg-id | 612260444.406269.1332619970287.JavaMail.root@mail-1.01.com Whole thread Raw |
In response to | Re: Gsoc2012 Idea --- Social Network database schema (Qi Huang <huangqiyx@hotmail.com>) |
Responses |
Gsoc2012 idea, tablesample
|
List | pgsql-hackers |
Qi, Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough thatpeople want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea. ----- Original Message ----- > > > > > > Date: Thu, 22 Mar 2012 13:17:01 -0400 > > Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database > > schema > > From: cbbrowne@gmail.com > > To: Kevin.Grittner@wicourts.gov > > CC: pgsql-hackers@postgresql.org > > > > On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner > > <Kevin.Grittner@wicourts.gov> wrote: > > > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> Robert Haas <robertmhaas@gmail.com> writes: > > >>> Well, the standard syntax apparently aims to reduce the number > > >>> of > > >>> returned rows, which ORDER BY does not. Maybe you could do it > > >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd > > >>> like to sample the table without reading all of it first, so > > >>> that > > >>> seems to miss the point. > > >> > > >> I think actually the traditional locution is more like > >! ; >> WHERE random() < constant > > >> where the constant is the fraction of the table you want. And > > >> yeah, the presumption is that you'd like it to not actually read > > >> every row. (Though unless the sampling density is quite a bit > > >> less than 1 row per page, it's not clear how much you're really > > >> going to win.) > > > > > > It's all going to depend on the use cases, which I don't think > > > I've > > > heard described very well yet. > > > > > > I've had to pick random rows from, for example, a table of > > > disbursements to support a financial audit. In those cases it has > > > been the sample size that mattered, and order didn't. One > > > interesting twist there is that for some of these financial > > > audits > > > they wanted the probability of a row being selected to be > > > proportional ! to the dollar amount of the disbursement. I don't > > > t hink you can do this without a first pass across the whole data > > > set. > > > > This one was commonly called "Dollar Unit Sampling," though the > > terminology has gradually gotten internationalized. > > http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html > > > > What the article doesn't mention is that some particularly large > > items > > might wind up covering multiple samples. In the example, they're > > looking for a sample every $3125 down the list. If there was a > > single > > transaction valued at $30000, that (roughly) covers 10 of the > > desired > > samples. > > > > It isn't possible to do this without scanning across the entire > > table. > > > > If you want repeatability, you probably want to instantiate a copy > > of > > enough information to indicate the ordering chosen. That's probably > > something that needs to be captured as part of the work of the > > audit, > > so n! ot only does it need to involve a pass across the data, it > > probably requires capturing a fair bit of data for posterity. > > -- > > When confronted by a difficult problem, solve it by reducing it to > > the > > question, "How would the Lone Ranger handle this?" > > > > > > > The discussion till now has gone far beyond my understanding..... > Could anyone explain briefly what is the idea for now? > The designing detail for me is still unfamiliar. I can only take time > to understand while possible after being selected and put time on it > to read relevant material. > For now, I'm still curious why Neil's implementation is no longer > working? The Postgres has been patched a lot, but the general idea > behind Neil's implementation should still work, isn't it? > Besides, whether this query is needed is still not decided. Seems > this is another hard to decide point. Is it that this topic is still > not so prepared for th e Gsoc yet? If really so, I think I still > have time to switch to other topics. Any suggestion? > > > Thanks. > > Best Regards and Thanks > Huang Qi Victor > Computer Science of National University of Singapore
pgsql-hackers by date: