Thread: Picking 25 samples of every domain

Picking 25 samples of every domain

From
Gary Warner
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a set of Postgres tables that are related to URLs found in
email.  The tables are BIG.  40-50 million records per day.  We are
using them for some research into tricks spammers use to confound
blacklists.  When we parse the URLs, we pull out the "domain" portion of
each URL and store it in a field called "top_domain".  The full URL is
available as "link".

Through various forms of randomization, customization, and wild-carding,
a domain may have as many as 1 million URLs per day.  I am needing a
query that would grab a sample number of URLs per domain (let's say 25
for conversation) . . . something that in pseudo-code might look like

for each top_domain in urltable do
    select top_domain, link limit 25;

Thoughts on the fastest way to do a query like that?

Real examples (to sort of show the point . . .)


spam_urls=# select top_domain, link from spam_info_2014_09_20 where
top_domain =  'pacedoctor[.]ru' limit 10;  (To prevent risk, I'm
changing "." to "[.]"


  top_domain   |               link
- ---------------+-----------------------------------
 pacedoctor[.]ru | http://kftdouhixn[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://ozukalaj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://knlssrmp[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://aaztcobtwztx[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://kqyarrarn[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://zzj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://piqtj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://wahednc[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://fjhmin[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://wevtwo[.]pacedoctor[.]ru
(10 rows)

 top_domain |
link
-
------------+-----------------------------------------------------------------------
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex8[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex7[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex6[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex5[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex4[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex3[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex2[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex1[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/u[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/rindex14[.]html
(10 rows)

  top_domain  |
link
-
--------------+---------------------------------------------------------------------
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/u[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/rindex5[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex4[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex3[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex2[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex1[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/u[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/rindex5[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex4[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex3[.]html
(10 rows)






- --

Thanks for any suggestions!

_-_
gar
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlQ2/VUACgkQg79eYCOO6PtTiwCbBomLVmWST81FEI3eQX0g5HrI
C8EAnjoXM6gb4+sn4DROXd/3IvFRoBoV
=50PP
-----END PGP SIGNATURE-----



Re: Picking 25 samples of every domain

From
Benoit Izac
Date:
Le 09/10/2014 à 23:25, Gary Warner écrivait :

> I have a set of Postgres tables that are related to URLs found in
> email.  The tables are BIG.  40-50 million records per day.  We are
> using them for some research into tricks spammers use to confound
> blacklists.  When we parse the URLs, we pull out the "domain" portion of
> each URL and store it in a field called "top_domain".  The full URL is
> available as "link".
>
> Through various forms of randomization, customization, and wild-carding,
> a domain may have as many as 1 million URLs per day.  I am needing a
> query that would grab a sample number of URLs per domain (let's say 25
> for conversation) . . . something that in pseudo-code might look like
>
> for each top_domain in urltable do
>     select top_domain, link limit 25;

CREATE TYPE urltable_type AS (top_domain text, link text);

CREATE OR REPLACE FUNCTION urltable_sample(integer)
RETURNS SETOF urltable_type AS $$
DECLARE
    td text;
BEGIN
    FOR td IN SELECT DISTINCT top_domain FROM urltable
    LOOP
        RETURN QUERY EXECUTE
           'SELECT top_domain::text, link::text
            FROM urltable
            WHERE top_domain = ''' || td || '''
            LIMIT ' || $1;
    END LOOP;
    RETURN;
END
$$ LANGUAGE 'plpgsql' ;

SELECT top_domain, link FROM urltable_sample(25);

> Thoughts on the fastest way to do a query like that?

No but I'm not an expert.

--
Benoit Izac