Picking 25 samples of every domain - Mailing list pgsql-novice

From Gary Warner
Subject Picking 25 samples of every domain
Date
Msg-id 5436FD55.2010806@askgar.com
Whole thread Raw
Responses Re: Picking 25 samples of every domain  (Benoit Izac <benoit@izac.org>)
List pgsql-novice
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a set of Postgres tables that are related to URLs found in
email.  The tables are BIG.  40-50 million records per day.  We are
using them for some research into tricks spammers use to confound
blacklists.  When we parse the URLs, we pull out the "domain" portion of
each URL and store it in a field called "top_domain".  The full URL is
available as "link".

Through various forms of randomization, customization, and wild-carding,
a domain may have as many as 1 million URLs per day.  I am needing a
query that would grab a sample number of URLs per domain (let's say 25
for conversation) . . . something that in pseudo-code might look like

for each top_domain in urltable do
    select top_domain, link limit 25;

Thoughts on the fastest way to do a query like that?

Real examples (to sort of show the point . . .)


spam_urls=# select top_domain, link from spam_info_2014_09_20 where
top_domain =  'pacedoctor[.]ru' limit 10;  (To prevent risk, I'm
changing "." to "[.]"


  top_domain   |               link
- ---------------+-----------------------------------
 pacedoctor[.]ru | http://kftdouhixn[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://ozukalaj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://knlssrmp[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://aaztcobtwztx[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://kqyarrarn[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://zzj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://piqtj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://wahednc[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://fjhmin[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://wevtwo[.]pacedoctor[.]ru
(10 rows)

 top_domain |
link
-
------------+-----------------------------------------------------------------------
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex8[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex7[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex6[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex5[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex4[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex3[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex2[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex1[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/u[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/rindex14[.]html
(10 rows)

  top_domain  |
link
-
--------------+---------------------------------------------------------------------
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/u[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/rindex5[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex4[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex3[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex2[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex1[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/u[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/rindex5[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex4[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex3[.]html
(10 rows)






- --

Thanks for any suggestions!

_-_
gar
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlQ2/VUACgkQg79eYCOO6PtTiwCbBomLVmWST81FEI3eQX0g5HrI
C8EAnjoXM6gb4+sn4DROXd/3IvFRoBoV
=50PP
-----END PGP SIGNATURE-----



pgsql-novice by date:

Previous
From: Jude DaShiell
Date:
Subject: Re: newbee, about a bulk loading from a cdv file
Next
From: Benoit Izac
Date:
Subject: Re: Picking 25 samples of every domain