Re: Picking 25 samples of every domain - Mailing list pgsql-novice

From Benoit Izac
Subject Re: Picking 25 samples of every domain
Date
Msg-id 87iojrrb5h.fsf@izac.org
Whole thread Raw
In response to Picking 25 samples of every domain  (Gary Warner <gar@askgar.com>)
List pgsql-novice
Le 09/10/2014 à 23:25, Gary Warner écrivait :

> I have a set of Postgres tables that are related to URLs found in
> email.  The tables are BIG.  40-50 million records per day.  We are
> using them for some research into tricks spammers use to confound
> blacklists.  When we parse the URLs, we pull out the "domain" portion of
> each URL and store it in a field called "top_domain".  The full URL is
> available as "link".
>
> Through various forms of randomization, customization, and wild-carding,
> a domain may have as many as 1 million URLs per day.  I am needing a
> query that would grab a sample number of URLs per domain (let's say 25
> for conversation) . . . something that in pseudo-code might look like
>
> for each top_domain in urltable do
>     select top_domain, link limit 25;

CREATE TYPE urltable_type AS (top_domain text, link text);

CREATE OR REPLACE FUNCTION urltable_sample(integer)
RETURNS SETOF urltable_type AS $$
DECLARE
    td text;
BEGIN
    FOR td IN SELECT DISTINCT top_domain FROM urltable
    LOOP
        RETURN QUERY EXECUTE
           'SELECT top_domain::text, link::text
            FROM urltable
            WHERE top_domain = ''' || td || '''
            LIMIT ' || $1;
    END LOOP;
    RETURN;
END
$$ LANGUAGE 'plpgsql' ;

SELECT top_domain, link FROM urltable_sample(25);

> Thoughts on the fastest way to do a query like that?

No but I'm not an expert.

--
Benoit Izac


pgsql-novice by date:

Previous
From: Gary Warner
Date:
Subject: Picking 25 samples of every domain
Next
From: Jude DaShiell
Date:
Subject: is it possible to store results of aggregate calculations in table fields?