Thread: Java/JDBC/PGSQL Mailing List Archiver

Java/JDBC/PGSQL Mailing List Archiver

From
Tim Perdue
Date:

Sounds interesting. As a C++ programmer, you should have no trouble at all with Java.

I highly recommend IBM VisualAge for Java. It checks all syntax and forces you to be correct. And only $100.

I found an NNTP java class, but I don't know if it's any good or not. I guess SUN has a sun.net.nntp package out there
too.(I went to Excite and did a search for java nntp) 

I'm using JDBC to drop everything into postgres and it's really pretty slick.

My current table structure looks something like this:


fld_mailing_list text, /* use int and join to other table??? */

fld_date Char(14), /* 19990101010101 I'm afraid I don't understand Postgres's implementation of timestamps*/

fld_subject text,
fld_is_followup int, /* for threading purposes */
fld_from text,
fld_body text,


I didn't put in a unique key because I never plan on having to update the message (it's static). This might be a
mistake.

Let me clean up what I have tonite and I'll send it your way to play around with tomorrow or saturday. I have attached
thenttp.java class that I found and you can take a gander at that. 

I'm not a pgsql genius, so you may have good advice on the datestamps, lobs, etc. Your idea of using multiple records
with"chunk number" is really interesting. I'm just concerned about the complexity of that (I was trying to slap this
togetherin 1-2 days). My plan at this point is to truncate messages over 8K. 

Keep in touch.

Tim



---Peter Garner <peter_garner@yahoo.com> wrote:
>
> Hi Tim,
>
>   I might be interested in
> collaborating as I am presently writing
> an offline reader that gets NNTP news
> and puts it in postgres. (I have already
> done this in C++ but I want to port it
> to Java.)  If nothing else, you are
> welcome to the C++ code.  Word of
> warning, I have been doing C++ for 11
> years but this is my first attempt at
> Java! :-)  Another word of warning, :-)
> I am an itinerant consultant and I tend
> to get called off to work in the middle
> of personal projects like this! :-)
>
> Right now my C++ program uses LOBs.  I
> think Herouz is right, however.  She
> suggested splitting >8K fields into
> multiple text fields.  E.g. you would
> have a table (for nntp news) like :
>
> create table MsgBodies
> (
>   MsgId        Text  ,
>   ChunkNumber  int   ,
>   MsgBody      Text  ,
>
>   primary key (MsgId , ChunkNumber)
> ) ;
>
> The majority of nntp msgs would just
> have one entry, but a few would be split
> into two or more entries.

==
______________________________________________________
Directricity.com - Get local!
http://directricity.com/

_________________________________________________________
DO YOU YAHOO!?
Get your free @yahoo.com address at http://mail.yahoo.com
/***************

NNTP Java Class

by Charles Bloom

quite nice

***************/

import java.io.*;
import java.net.*;
import java.lang.*;

class ArticleHeader
    {
    int number;
    String subject;
    String author;
    String date;
    int bytes,lines;

    int seqNum,seqTot;
    public void getSeq() /* from subject */
        {
        int slashI,fdelimI,ldelimI,a,b;
        if ( (slashI = subject.lastIndexOf('/')) == -1 )
            { seqNum=seqTot=0; return; }

        a = subject.lastIndexOf('(' /*)*/,slashI);
        b = subject.lastIndexOf('[',slashI);
        if ( a > b ) fdelimI = a; else fdelimI = b;
        if ( fdelimI == -1 ) { seqNum=seqTot=0; return; }

        a = subject.indexOf(/*(*/ ')',slashI);
        b = subject.indexOf(']',slashI);
        if ( a == -1 ) a = 999;    if ( b == -1 ) b = 999;
        if ( a < b ) ldelimI = a;    else ldelimI = b;
        if ( ldelimI == 999 )    { seqNum=seqTot=0; return; }

        try
            {
            seqNum = clib.atoi(subject.substring(fdelimI+1,slashI));
            seqTot = clib.atoi(subject.substring(slashI+1,ldelimI));
            }
        catch( NumberFormatException e ) { seqNum=seqTot=0; return; }

        }

    public String fileName() /* from subject */
        {
        int dotidx,fspace,lspace;
        subject.replace('\t',' ');
        if ( (dotidx = subject.lastIndexOf('.')) == -1 ) { dotidx = 0; fspace = 0; }
        else { fspace = subject.lastIndexOf(' ',dotidx); fspace++; }
        if ( (lspace = subject.indexOf(' ',dotidx)) == -1 ) lspace = subject.length();
        return ( subject.substring(fspace,lspace) );
        };
    };

class NNTP_Client
    {
    public SimpleClientConnection net;
    public long group_low=0,group_high=0; /* only valid right after goGroup */

    boolean nntpConnected;
    DataInputStream nntp_in;
  PrintStream nntp_out;
    PrintStream log_out;
    String host,group_name;

    public boolean reset() throws IOException
        {
        ArticleHeader none = new ArticleHeader();
        none.number = 0;
        return ( reset(none) );
        };
    public boolean reset(ArticleHeader last) throws IOException
        {
        disconnect();
        NNTP_Connect();
        if ( group_name != null )
            {
            if ( ! goGroup(group_name) ) return(false);
            if ( ! goArticle(last.number) ) return(false);
            }
        return(true);
        };


    /* these progress indicators should be in a seperate module
            so that they could be swapped out for GUI ones */

    void progressUpdate(int cur,int tot)
        {
        System.err.print("nntp   : " + cur + " / " + tot + "\r");
        System.err.flush();
        }
    void progressDone() { System.err.println(); }

    public byte[] getBody(ArticleHeader header) throws IOException
        {
        byte body[];
        String instr;

        nntp_out.println("body");
        instr = nntp_in.readLine();
        log_out.println(instr);
        if ( instr.indexOf("body") == -1 )
            throw new IOException("nntp.getBody:no article body");

        {
        byte tbody[];
        int curoff=0,lines=0,c;
        tbody = new byte[header.lines*80];

        progressUpdate(lines,header.lines);

        while ( lines < header.lines )
            {
            if ( (c = nntp_in.read()) == -1 ) break;
            tbody[curoff++] = (byte) c;
            if ( c == '\n' )
                {
                lines++;
                if ( lines % 50 == 0 )
                    {
                    progressUpdate(lines,header.lines);
                    System.err.flush();
                    }
                }
            }

        progressUpdate(lines,header.lines);
        progressDone();

        body = new byte[curoff];
        System.arraycopy(tbody,0,body,0,curoff);
        }

        log_out.println( nntp_in.readLine() ); /* '.' line */

        return( body );
        };

    public ArticleHeader getHeader() throws IOException
        {
        boolean goterror;
        String instr,curstr;
        ArticleHeader header = new ArticleHeader();
        int next_pos;

        do
            {
            goterror = false;

            nntp_out.println("xover");
            instr = nntp_in.readLine();
            log_out.println(instr);
            if ( instr.indexOf("data follows") == -1 )
                throw new IOException("nntp.getHeader:no xover data");

            instr = nntp_in.readLine();
            log_out.println(instr);

            /* now process instr and fill in header */

            /* article number */
            next_pos = instr.indexOf('\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            try { header.number = clib.atoi(curstr); }
            catch ( NumberFormatException e ) { goterror = true; }

            /* subject */
            next_pos = instr.indexOf( '\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            header.subject = curstr;

            /* author */
            next_pos = instr.indexOf( '\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            header.author = curstr;

            /* date */
            next_pos = instr.indexOf( '\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            header.date = curstr;

            /* message-id */
            next_pos = instr.indexOf('\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            // ignore

            /* references */
            next_pos = instr.indexOf( '\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            // ignore

            /* bytes */
            next_pos = instr.indexOf( '\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            try { header.bytes = clib.atoi(curstr); }
            catch ( NumberFormatException e ) { goterror = true; }

            /* lines (last one) */
            next_pos = instr.indexOf( '\t');
            curstr = instr.substring(0,next_pos);    instr  = instr.substring(next_pos + 1);
            try { header.lines = clib.atoi(curstr); }
            catch ( NumberFormatException e ) { goterror = true; }

            log_out.println( nntp_in.readLine() ); /* a line of just "." */

            }    while( goterror );

        return(header);
        };

    public boolean goArticle(int number)
        {
        String retStr;
        nntp_out.println("stat " + number);
        try {    log_out.println( retStr = nntp_in.readLine() ); }
        catch ( IOException e ) { return(false); }
        if ( retStr.indexOf("Bad") != -1 ) return(false);
        return(true);
        };

    public boolean goGroup(String in_group_name)
        {
        String reply;
        String[] replyToks;
        group_low = group_high = 0;
        group_name = in_group_name;
        nntp_out.println("group " + group_name);
        try {    log_out.println( reply = nntp_in.readLine() ); }
        catch ( IOException e ) { return(false); }
        if ( reply.indexOf("No such group") != -1 )
            {
            log_out.println("NNTP_Client:Got:No such group");
            return(false);
            }
        replyToks = clib.stringSpaceTok(reply);
        if ( replyToks.length < 5 )
            {
            log_out.println("NNTP_Client:Got: less than 5 tokens in group header");
            return(false);
            }
        group_low  = clib.atol(replyToks[2]);
        group_high = clib.atol(replyToks[3]);
        return(true);
        };

    /* next : returns false when no more messages */
    public boolean next() throws IOException
        {
        String statstr;
        nntp_out.println("next");
        statstr = nntp_in.readLine();
        log_out.println(statstr);
        if ( statstr.indexOf("retrieved") == -1 ) return(false);
        return(true);
        };

    public boolean isConnected()
        {
        return( nntpConnected );
        };

    public void disconnect()
        {
        if ( nntpConnected ) nntp_out.println("quit");
        nntpConnected = false;
        net.disconnect();
        };

  public NNTP_Client(String in_host,PrintStream in_log_out) throws IOException
         {

        host = in_host;
        log_out = in_log_out;

        group_name = null;

        nntpConnected = false;

        NNTP_Connect();
        }

    public void NNTP_Connect() throws IOException
        {
    net = new SimpleClientConnection(host, 119); /* 119 is NNTP */

        nntpConnected = net.isConnected();
        if ( nntpConnected )
            {
        nntp_in  = net.inputStream();
        nntp_out = net.outputStream();
            }
        else
            {
            nntp_in = null; nntp_out= null;
            }

        /* init */
            {
            String incoming = null;
            log_out.println("Waiting for 'ready'");
            do
                {
                incoming = nntp_in.readLine();
                log_out.println(incoming);
                } while ( incoming.indexOf("ready") == -1 );
            }

        };
    };


Re: Java/JDBC/PGSQL Mailing List Archiver

From
Peter Garner
Date:
Hi Tim! :-)

> Sounds interesting. As a C++ programmer, you should
>have no trouble at all with Java.

Actually the trouble I am having is that they are so
close that I am having trouble with subtle differences,
hehehe.


>I highly recommend IBM VisualAge for Java. It checks
>all syntax and forces you to be correct. And only
>$100.

I have never been a big fan of IDEs.  I use Visual
Slick Edit for Linux.  Also IBM drug tests their
employees so I try to boycott them! :-)  Although I
have been told they no longer do this.  Does VA run
under Linux?


>I'm not a pgsql genius, so you may have good advice on
>the datestamps, lobs, etc. Your idea of using multiple
>records with "chunk number" is really interesting. I'm

What is the trouble with pgsql dates?  BTW it was
Herouth's idea, not mine!  (Hey Herouth, sorry about
miss-spelling your name in the previous message! ;-)

Thanks for the java classes!  What is the licensing
on that source code?  Can we LGPL it?
==
Peace,
Peter

We are Microsoft of Borg, you will be assimilated!!!
Resistance is fut...  ***BZZZRT***  THUD!!!
[General Protection Fault in MSBorg32.DLL]
Please contact the vendor for more information
_________________________________________________________
DO YOU YAHOO!?
Get your free @yahoo.com address at http://mail.yahoo.com


Re: [SQL] Java/JDBC/PGSQL Mailing List Archiver

From
Fabrice Scemama
Date:
This could be the beginning of a very nice GPL Project.
I'd personally advocate using Perl to do this, since we
have all necessary modules, all of whom are easy to use :
Net::NNTP
DBI
DBD::Pg
...
everything related to Emails and MIME and MD5
etc.

I'd rather not split text messages in 8k parts!
That's what BLOBS were designed for, and what would you
do with attachments ? Just storing them as
MIME encoded texts ? This would mean having to decode
them a number of times, instead of which storing them
as MIME encoded *and* binary files would increase the server
performance while only wasting a few megs.

We might consider setting up a small mailing-list to talk
about such a project.

Fabrice

French Philosophical Forums
http://www.gesnet.net/philo/


Tim Perdue wrote:
>
> Sounds interesting. As a C++ programmer, you should have no trouble at all with Java.
>
> I highly recommend IBM VisualAge for Java. It checks all syntax and forces you to be correct. And only $100.
>
> I found an NNTP java class, but I don't know if it's any good or not. I guess SUN has a sun.net.nntp package out
theretoo. (I went to Excite and did a search for java nntp) 
>
> I'm using JDBC to drop everything into postgres and it's really pretty slick.
>
> My current table structure looks something like this:
>
> fld_mailing_list text, /* use int and join to other table??? */
>
> fld_date Char(14), /* 19990101010101 I'm afraid I don't understand Postgres's implementation of timestamps*/
>
> fld_subject text,
> fld_is_followup int, /* for threading purposes */
> fld_from text,
> fld_body text,
>
> I didn't put in a unique key because I never plan on having to update the message (it's static). This might be a
mistake.
>
> Let me clean up what I have tonite and I'll send it your way to play around with tomorrow or saturday. I have
attachedthe nttp.java class that I found and you can take a gander at that. 
>
> I'm not a pgsql genius, so you may have good advice on the datestamps, lobs, etc. Your idea of using multiple records
with"chunk number" is really interesting. I'm just concerned about the complexity of that (I was trying to slap this
togetherin 1-2 days). My plan at this point is to truncate messages over 8K. 
>
> Keep in touch.
>
> Tim





Re: [SQL] Java/JDBC/PGSQL Mailing List Archiver

From
Herouth Maoz
Date:
At 2:05 +0200 on 22/1/99, Fabrice Scemama wrote:


> I'd rather not split text messages in 8k parts!
> That's what BLOBS were designed for

Just to give you a few off points on BLOBs:

(1) They are not dumped with pg_dump, and you have to design your own
    backup procedure for them, which you will be able to integrate later
    with a dumped database.

(2) You can't display them in a casual psql query. I usually use psql
    when I want to check something in my database. Storing the text in
    split records would allow you to view the content in psql.

(3) Sometimes one needs to delete some data which was entered incorrectly
    or as a result of a buggy implementation. If you want to delete a
    large object you have to write a program. If you want to delete a
    bunch of split text records, you only have to enter psql and issue
    a DELETE statement in SQL.

(4) You can't search on large objects. Not even an unindexed search.
    You have to retrieve every large object, and perform the search in
    the front end. If you had it in split text records, you would be
    able to search with LIKE or regexp.

Attachments are a different story. All of the above applies when we are
dealing with text processing. If the content is something other than text,
you don't need to search it, there is no point in viewing it in PSQL
because you need a proper viewer anyway, etc.

So if you want to treat your attachments as binary objects, binary objects
they should be.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma



Re: [SQL] Java/JDBC/PGSQL Mailing List Archiver

From
Fabrice Scemama
Date:
I globally agree with your points. Thanks for taking time
to write them for us.

If 1, 2, and 3 cannot be considered as essential -- we still
can backup the BLOBs, delete them, etc. with few programming
skills, 4 is very important.

So, as you put it, as far as Mailing List Archives are concerned,
bodies should be split, and attachments BLOBed.

Chavouah Tov!

Fabrice Scemama
Internet Developer too, but in France ;)

Herouth Maoz wrote:
>
> At 2:05 +0200 on 22/1/99, Fabrice Scemama wrote:
>
> > I'd rather not split text messages in 8k parts!
> > That's what BLOBS were designed for
>
> Just to give you a few off points on BLOBs:
>
> (1) They are not dumped with pg_dump, and you have to design your own
>     backup procedure for them, which you will be able to integrate later
>     with a dumped database.
>
> (2) You can't display them in a casual psql query. I usually use psql
>     when I want to check something in my database. Storing the text in
>     split records would allow you to view the content in psql.
>
> (3) Sometimes one needs to delete some data which was entered incorrectly
>     or as a result of a buggy implementation. If you want to delete a
>     large object you have to write a program. If you want to delete a
>     bunch of split text records, you only have to enter psql and issue
>     a DELETE statement in SQL.
>
> (4) You can't search on large objects. Not even an unindexed search.
>     You have to retrieve every large object, and perform the search in
>     the front end. If you had it in split text records, you would be
>     able to search with LIKE or regexp.
>
> Attachments are a different story. All of the above applies when we are
> dealing with text processing. If the content is something other than text,
> you don't need to search it, there is no point in viewing it in PSQL
> because you need a proper viewer anyway, etc.
>
> So if you want to treat your attachments as binary objects, binary objects
> they should be.
>
> Herouth
>
> --
> Herouth Maoz, Internet developer.
> Open University of Israel - Telem project
> http://telem.openu.ac.il/~herutma