Thread: Greenplum MapReduce

Greenplum MapReduce

From
Suvankar Roy
Date:

Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue).

The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red)

If somebody can explain this and the potential solution

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
        - INPUT:
                NAME: doc
                TABLE: documents
        - INPUT:
                NAME: kw
                TABLE: keywords
        - MAP:
                NAME:                 doc_map
                LANGUAGE:         python
                FUNCTION:          |
                        i = 0
                        terms = {}
                        for term in data.lower().split():
                                i = i + 1
                                if term in terms:
                                        terms[term] += ','+str(i)
                                else:
                                        terms[term] = str(i)
                        for term in terms:
                                yield([doc_id, term, terms[term]])          
                OPTIMIZE: STRICT IMMUTABLE
                PARAMETERS:
                        - doc_id integer
                        - data text
                RETURNS:
                        - doc_id integer
                        - term text
                        - positions text        
        - MAP:
                NAME:         kw_map
                LANGUAGE:         python
                FUNCTION:         |
                        i = 0
                        terms = {}
                        for term in keyword.lower().split():
                                i = i + 1
                                if term in terms:
                                        terms[term] += ','+str(i)
                                else:
                                        terms[term] = str(i)
                                yield([keyword_id, i, term, terms[term]])
                OPTIMIZE: STRICT IMMUTABLE
                PARAMETERS:
                        - keyword_id integer
                        - keyword text
                RETURNS:
                        - keyword_id integer
                        - nterms integer
                        - term text
                        - positions text          
        - TASK:
                NAME: doc_prep
                SOURCE: doc
                MAP: doc_map
        - TASK:
                NAME: kw_prep
                SOURCE: kw
                MAP: kw_map          
        - INPUT:
                NAME: term_join
                QUERY: |
                        SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms,
                                 doc.positions as doc_positions,
                                kw.positions as kw_positions
                         FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)
        - REDUCE:
                NAME: term_reducer
                TRANSITION: term_transition
                FINALIZE: term_finalizer        
        - TRANSITION:
                NAME: term_transition
                LANGUAGE: python
                PARAMETERS:
                        - state text
                        - term text
                        - nterms integer
                        - doc_positions text
                        - kw_positions text
                FUNCTION: |
                        if state:
                                kw_split = state.split(':')
                        else:
                                kw_split = []
                                for i in range(0,nterms):
                                        kw_split.append('')
                        for kw_p in kw_positions.split(','):
                                kw_split[int(kw_p)-1] = doc_positions          
                        outstate = kw_split[0]
                        for s in kw_split[1:]:
                                outstate = outstate + ':' + s
                        return outstate        
          - FINALIZE:
                NAME: term_finalizer
                LANGUAGE: python
                RETURNS:
                        - count integer
                MODE: MULTI
                FUNCTION: |
                        if not state:
                                return 0
                        kw_split = state.split(':')
                        previous = None
                        for i in range(0,len(kw_split)):
                                isplit = kw_split[i].split(',')
                                if any(map(lambda(x): x == '', isplit)):
                                        return 0
                                adjusted = set(map(lambda(x): int(x)-i, isplit))
                                if (previous):
                                        previous = adjusted.intersection(previous)
                                else:
                                        previous = adjusted
                        if previous:
                                return len(previous)
                        return 0
        - TASK:
                NAME: term_match
                SOURCE: term_join
                REDUCE: term_reducer
        - INPUT:
                NAME: final_output
                QUERY: |
                        SELECT doc.*, kw.*, tm.count
                        FROM documents doc, keywords kw, term_match tm
                        WHERE doc.doc_id = tm.doc_id
                          AND kw.keyword_id = tm.keyword_id
                          AND tm.count > 0
        EXECUTE:
                - RUN:
                        SOURCE: final_output
                        TARGET: STDOUT



Regards,

Suvankar Roy
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you


Re: Greenplum MapReduce

From
Chris
Date:
Suvankar Roy wrote:
>
> Hi all,
>
> Has anybody worked on Greenplum MapReduce programming ?

It's a commercial product, you need to contact greenplum.

--
Postgresql & php tutorials
http://www.designmagick.com/


Re: Greenplum MapReduce

From
Richard Huxton
Date:
Suvankar Roy wrote:
> Hi all,
>
> Has anybody worked on Greenplum MapReduce programming ?
>
> I am facing a problem while trying to execute the below Greenplum
> Mapreduce program written in YAML (in blue).

The other poster suggested contacting Greenplum and I can only agree.

> The error is thrown in the 7th line as:
> Error: YAML syntax error - found character that cannot start any token
> while scanning for the next token, at line 7 (in red)

There is no red, particularly if viewing messages as plain text (which
most people do on mailing lists). Consider indicating a line some other
way next time (commonly below the line you put something like "this is
line 7 ^^^^^")

The most common problem I get with YAML files though is when a tab is
accidentally inserted instead of spaces at the start of a line.

--
   Richard Huxton
   Archonet Ltd

Re: Greenplum MapReduce

From
Richard Huxton
Date:
Suvankar Roy wrote:
> Hi Richard,
>
> I sincerely regret the inconvenience caused.....

No big inconvenience, but the lists can be very busy sometimes and the
easier you make it for people to answer your questions the better the
answers you will get.

> %YAML 1.1
> ---
> VERSION: 1.0.0.1
> DATABASE: test_db1
> USER: gpadmin
> DEFINE:
>         - INPUT: #****** This the line which is causing the error ******#
 >                 NAME: doc
 >                 TABLE: documents

If it looks fine, always check for tabs. Oh, and you could have cut out
all the rest of the file, really.

> I have learnt that unnecessary TABs can the cause of this, so trying to
> overcome that, hopefully the problem will subside then....

I'm always getting this. It's easy to accidentally introduce a tab
character when reformatting YAML. It might be worth checking if your
text editor has an option to always replace tabs with spaces.

--
   Richard Huxton
   Archonet Ltd

Re: Greenplum MapReduce

From
Suvankar Roy
Date:

Hi Richard,

I sincerely regret the inconvenience caused.....

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
       - INPUT: #****** This the line which is causing the error ******#
               NAME: doc
               TABLE: documents
       - INPUT:
               NAME: kw
               TABLE: keywords
       - MAP:
               NAME:           doc_map
               LANGUAGE:       python
               FUNCTION:        |
                       i = 0
                       terms = {}
                       for term in data.lower().split():
                               i = i + 1
                               if term in terms:
                                       terms[term] += ','+str(i)
                               else:
                                       terms[term] = str(i)
                       for term in terms:
                               yield([doc_id, term, terms[term]])
               OPTIMIZE: STRICT IMMUTABLE
               PARAMETERS:
                       - doc_id integer
                       - data text
               RETURNS:
                       - doc_id integer
                       - term text
                       - positions text
       - MAP:
               NAME:   kw_map
               LANGUAGE:       python
               FUNCTION:       |
                       i = 0
                       terms = {}
                       for term in keyword.lower().split():
                               i = i + 1
                               if term in terms:
                                       terms[term] += ','+str(i)
                               else:
                                       terms[term] = str(i)
                               yield([keyword_id, i, term, terms[term]])
               OPTIMIZE: STRICT IMMUTABLE
               PARAMETERS:
                       - keyword_id integer
                       - keyword text
               RETURNS:
                       - keyword_id integer
                       - nterms integer
                       - term text
                       - positions text
       - TASK:
               NAME: doc_prep
               SOURCE: doc
               MAP: doc_map
       - TASK:
               NAME: kw_prep
               SOURCE: kw
               MAP: kw_map
       - INPUT:
               NAME: term_join
               QUERY: |
                       SELECT doc.doc_id, kw.keyword_id, kw.term,
kw.nterms,
                               doc.positions as doc_positions,
                               kw.positions as kw_positions
                        FROM doc_prep doc INNER JOIN kw_prep kw ON
(doc.term = kw.term)
       - REDUCE:
               NAME: term_reducer
               TRANSITION: term_transition
               FINALIZE: term_finalizer
       - TRANSITION:
               NAME: term_transition
               LANGUAGE: python
               PARAMETERS:
                       - state text
                       - term text
                       - nterms integer
                       - doc_positions text
                       - kw_positions text
               FUNCTION: |
                       if state:
                               kw_split = state.split(':')
                       else:
                               kw_split = []
                               for i in range(0,nterms):
                                       kw_split.append('')
                       for kw_p in kw_positions.split(','):
                               kw_split[int(kw_p)-1] = doc_positions
                       outstate = kw_split[0]
                       for s in kw_split[1:]:
                               outstate = outstate + ':' + s
                       return outstate
       - FINALIZE:
               NAME: term_finalizer
               LANGUAGE: python
               RETURNS:
                       - count integer
               MODE: MULTI
               FUNCTION: |
                       if not state:
                               return 0
                       kw_split = state.split(':')
                       previous = None
                       for i in range(0,len(kw_split)):
                               isplit = kw_split[i].split(',')
                               if any(map(lambda(x): x == '', isplit)):
                                       return 0
                               adjusted = set(map(lambda(x): int(x)-i,
isplit))
                               if (previous):
                                       previous =
adjusted.intersection(previous)
                               else:
                                       previous = adjusted
                       if previous:
                               return len(previous)
                       return 0
       - TASK:
               NAME: term_match
               SOURCE: term_join
               REDUCE: term_reducer
       - INPUT:
               NAME: final_output
               QUERY: |
                       SELECT doc.*, kw.*, tm.count
                       FROM documents doc, keywords kw, term_match tm
                       WHERE doc.doc_id = tm.doc_id
                         AND kw.keyword_id = tm.keyword_id
                         AND tm.count > 0
       EXECUTE:
               - RUN:
                       SOURCE: final_output
                       TARGET: STDOUT



I have learnt that unnecessary TABs can the cause of this, so trying to overcome that, hopefully the problem will subside then....

Regards,

Suvankar Roy



Richard Huxton <dev@archonet.com>

08/03/2009 02:55 PM

To
Suvankar Roy <suvankar.roy@tcs.com>
cc
pgsql-performance@postgresql.org
Subject
Re: [PERFORM] Greenplum MapReduce





Suvankar Roy wrote:
> Hi all,
>
> Has anybody worked on Greenplum MapReduce programming ?
>
> I am facing a problem while trying to execute the below Greenplum
> Mapreduce program written in YAML (in blue).

The other poster suggested contacting Greenplum and I can only agree.

> The error is thrown in the 7th line as:
> Error: YAML syntax error - found character that cannot start any token
> while scanning for the next token, at line 7 (in red)

There is no red, particularly if viewing messages as plain text (which
most people do on mailing lists). Consider indicating a line some other
way next time (commonly below the line you put something like "this is
line 7 ^^^^^")

The most common problem I get with YAML files though is when a tab is
accidentally inserted instead of spaces at the start of a line.

--
  Richard Huxton
  Archonet Ltd

ForwardSourceID:NT000058E2    
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you


Re: Greenplum MapReduce

From
Suvankar Roy
Date:

Hi Robert,

Thanks much for your valuable inputs....

This spaces and tabs problem is killing me in a way, it is pretty cumbersome to say the least....

Regards,

Suvankar Roy


"Robert Mah" <rmah@pobox.com>
Sent by: Robert Mah <robert.mah@gmail.com>

08/02/2009 10:52 PM

To
"'Suvankar Roy'" <suvankar.roy@tcs.com>, <pgsql-performance@postgresql.org>
cc
Subject
RE: [PERFORM] Greenplum MapReduce





Suvankar:
 
Check your file for spaces vs tabs (one of them is bad and yes, it matters).
 
And as an personal aside, this is yet another reason I hate YAML.
 
Cheers,
Rob

 
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Suvankar Roy
Sent:
Thursday, July 30, 2009 8:25 AM
To:
pgsql-performance@postgresql.org
Subject:
[PERFORM] Greenplum MapReduce

 

Hi all,


Has anybody worked on Greenplum MapReduce programming ?


I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue).


The error is thrown in the 7th line as:

Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red)


If somebody can explain this and the potential solution


%YAML 1.1

---

VERSION: 1.0.0.1
DATABASE: test_db1

USER: gpadmin

DEFINE:

       - INPUT:

               NAME: doc

               TABLE: documents
       - INPUT:

               NAME: kw

               TABLE: keywords

       - MAP:
               NAME:                 doc_map
               LANGUAGE:         python
               FUNCTION:          |

                       i = 0
                       terms = {}

                       for term in data.lower().split():
                               i = i + 1

                               if term in terms:
                                       terms[term] += ','+str(i)
                               else:
                                       terms[term] = str(i)
                       for term in terms:
                               yield([doc_id, term, terms[term]])          

               OPTIMIZE: STRICT IMMUTABLE
               PARAMETERS:
                       - doc_id integer
                       - data text
               RETURNS:
                       - doc_id integer
                       - term text
                       - positions text        
       - MAP:
               NAME:         kw_map
               LANGUAGE:         python
               FUNCTION:         |
                       i = 0
                       terms = {}
                       for term in keyword.lower().split():
                               i = i + 1
                               if term in terms:
                                       terms[term] += ','+str(i)
                               else:
                                       terms[term] = str(i)
                               yield([keyword_id, i, term, terms[term]])
               OPTIMIZE: STRICT IMMUTABLE
               PARAMETERS:
                       - keyword_id integer
                       - keyword text
               RETURNS:
                       - keyword_id integer
                       - nterms integer
                       - term text
                       - positions text          

       - TASK:
               NAME: doc_prep
               SOURCE: doc
               MAP: doc_map

       - TASK:
               NAME: kw_prep
               SOURCE: kw
               MAP: kw_map          

       - INPUT:
               NAME: term_join
               QUERY: |
                       SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms,
                                doc.positions as doc_positions,
                               kw.positions as kw_positions
                        FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)

       - REDUCE:
               NAME: term_reducer
               TRANSITION: term_transition
               FINALIZE: term_finalizer        
       - TRANSITION:
               NAME: term_transition
               LANGUAGE: python
               PARAMETERS:
                       - state text
                       - term text
                       - nterms integer
                       - doc_positions text
                       - kw_positions text
               FUNCTION: |
                       if state:
                               kw_split = state.split(':')
                       else:
                               kw_split = []
                               for i in range(0,nterms):
                                       kw_split.append('')
                       for kw_p in kw_positions.split(','):
                               kw_split[int(kw_p)-1] = doc_positions          

                       outstate = kw_split[0]
                       for s in kw_split[1:]:
                               outstate = outstate + ':' + s
                       return outstate        
         - FINALIZE:
               NAME: term_finalizer
               LANGUAGE: python
               RETURNS:
                       - count integer
               MODE: MULTI
               FUNCTION: |
                       if not state:
                               return 0
                       kw_split = state.split(':')
                       previous = None
                       for i in range(0,len(kw_split)):
                               isplit = kw_split[i].split(',')
                               if any(map(lambda(x): x == '', isplit)):
                                       return 0
                               adjusted = set(map(lambda(x): int(x)-i, isplit))
                               if (previous):
                                       previous = adjusted.intersection(previous)
                               else:
                                       previous = adjusted
                       if previous:
                               return len(previous)
                       return 0

       - TASK:
               NAME: term_match
               SOURCE: term_join
               REDUCE: term_reducer
       - INPUT:
               NAME: final_output
               QUERY: |
                       SELECT doc.*, kw.*, tm.count
                       FROM documents doc, keywords kw, term_match tm
                       WHERE doc.doc_id = tm.doc_id
                         AND kw.keyword_id = tm.keyword_id
                         AND tm.count > 0
       EXECUTE:
               - RUN:
                       SOURCE: final_output
                       TARGET: STDOUT




Regards,


Suvankar Roy

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you
 
 
ForwardSourceID:NT000058B6    
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you