Thread: Greenplum MapReduce

Greenplum MapReduce

From

Suvankar Roy

Date:

02 August 2009, 17:17:07

Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue).

The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red)

If somebody can explain this and the potential solution

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
- INPUT:
NAME: doc
TABLE: documents
- INPUT:
NAME: kw
TABLE: keywords
- MAP:
NAME: doc_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in data.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
for term in terms:
yield([doc_id, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- doc_id integer
- data text
RETURNS:
- doc_id integer
- term text
- positions text
- MAP:
NAME: kw_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in keyword.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
yield([keyword_id, i, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- keyword_id integer
- keyword text
RETURNS:
- keyword_id integer
- nterms integer
- term text
- positions text
- TASK:
NAME: doc_prep
SOURCE: doc
MAP: doc_map
- TASK:
NAME: kw_prep
SOURCE: kw
MAP: kw_map
- INPUT:
NAME: term_join
QUERY: |
SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms,
doc.positions as doc_positions,
kw.positions as kw_positions
FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)
- REDUCE:
NAME: term_reducer
TRANSITION: term_transition
FINALIZE: term_finalizer
- TRANSITION:
NAME: term_transition
LANGUAGE: python
PARAMETERS:
- state text
- term text
- nterms integer
- doc_positions text
- kw_positions text
FUNCTION: |
if state:
kw_split = state.split(':')
else:
kw_split = []
for i in range(0,nterms):
kw_split.append('')
for kw_p in kw_positions.split(','):
kw_split[int(kw_p)-1] = doc_positions
outstate = kw_split[0]
for s in kw_split[1:]:
outstate = outstate + ':' + s
return outstate
- FINALIZE:
NAME: term_finalizer
LANGUAGE: python
RETURNS:
- count integer
MODE: MULTI
FUNCTION: |
if not state:
return 0
kw_split = state.split(':')
previous = None
for i in range(0,len(kw_split)):
isplit = kw_split[i].split(',')
if any(map(lambda(x): x == '', isplit)):
return 0
adjusted = set(map(lambda(x): int(x)-i, isplit))
if (previous):
previous = adjusted.intersection(previous)
else:
previous = adjusted
if previous:
return len(previous)
return 0
- TASK:
NAME: term_match
SOURCE: term_join
REDUCE: term_reducer
- INPUT:
NAME: final_output
QUERY: |
SELECT doc.*, kw.*, tm.count
FROM documents doc, keywords kw, term_match tm
WHERE doc.doc_id = tm.doc_id
AND kw.keyword_id = tm.keyword_id
AND tm.count > 0
EXECUTE:
- RUN:
SOURCE: final_output
TARGET: STDOUT

Regards,

Suvankar Roy

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

Re: Greenplum MapReduce

From

Chris

Date:

03 August 2009, 02:05:18

Suvankar Roy wrote:
>
> Hi all,
>
> Has anybody worked on Greenplum MapReduce programming ?

It's a commercial product, you need to contact greenplum.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Greenplum MapReduce

From

Richard Huxton

Date:

03 August 2009, 09:25:43

Suvankar Roy wrote:
> Hi all,
>
> Has anybody worked on Greenplum MapReduce programming ?
>
> I am facing a problem while trying to execute the below Greenplum
> Mapreduce program written in YAML (in blue).

The other poster suggested contacting Greenplum and I can only agree.

> The error is thrown in the 7th line as:
> Error: YAML syntax error - found character that cannot start any token
> while scanning for the next token, at line 7 (in red)

There is no red, particularly if viewing messages as plain text (which
most people do on mailing lists). Consider indicating a line some other
way next time (commonly below the line you put something like "this is
line 7 ^^^^^")

The most common problem I get with YAML files though is when a tab is
accidentally inserted instead of spaces at the start of a line.

--
   Richard Huxton
   Archonet Ltd

Re: Greenplum MapReduce

From

Richard Huxton

Date:

03 August 2009, 09:50:02

Suvankar Roy wrote:
> Hi Richard,
>
> I sincerely regret the inconvenience caused.....

No big inconvenience, but the lists can be very busy sometimes and the
easier you make it for people to answer your questions the better the
answers you will get.

> %YAML 1.1
> ---
> VERSION: 1.0.0.1
> DATABASE: test_db1
> USER: gpadmin
> DEFINE:
>         - INPUT: #****** This the line which is causing the error ******#
 >                 NAME: doc
 >                 TABLE: documents

If it looks fine, always check for tabs. Oh, and you could have cut out
all the rest of the file, really.

> I have learnt that unnecessary TABs can the cause of this, so trying to
> overcome that, hopefully the problem will subside then....

I'm always getting this. It's easy to accidentally introduce a tab
character when reformatting YAML. It might be worth checking if your
text editor has an option to always replace tabs with spaces.

--
   Richard Huxton
   Archonet Ltd

Re: Greenplum MapReduce

From

Suvankar Roy

Date:

03 August 2009, 21:49:36

Hi Richard,

I sincerely regret the inconvenience caused.....

%YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: #****** This the line which is causing the error ******# NAME: doc TABLE: documents - INPUT: NAME: kw TABLE: keywords - MAP: NAME: doc_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in data.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) for term in terms: yield([doc_id, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - doc_id integer - data text RETURNS: - doc_id integer - term text - positions text - MAP: NAME: kw_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in keyword.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) yield([keyword_id, i, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - keyword_id integer - keyword text RETURNS: - keyword_id integer - nterms integer - term text - positions text - TASK: NAME: doc_prep SOURCE: doc MAP: doc_map - TASK: NAME: kw_prep SOURCE: kw MAP: kw_map - INPUT: NAME: term_join QUERY: | SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms, doc.positions as doc_positions, kw.positions as kw_positions FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term) - REDUCE: NAME: term_reducer TRANSITION: term_transition FINALIZE: term_finalizer - TRANSITION: NAME: term_transition LANGUAGE: python PARAMETERS: - state text - term text - nterms integer - doc_positions text - kw_positions text FUNCTION: | if state: kw_split = state.split(':') else: kw_split = [] for i in range(0,nterms): kw_split.append('') for kw_p in kw_positions.split(','): kw_split[int(kw_p)-1] = doc_positions outstate = kw_split[0] for s in kw_split[1:]: outstate = outstate + ':' + s return outstate - FINALIZE: NAME: term_finalizer LANGUAGE: python RETURNS: - count integer MODE: MULTI FUNCTION: | if not state: return 0 kw_split = state.split(':') previous = None for i in range(0,len(kw_split)): isplit = kw_split[i].split(',') if any(map(lambda(x): x == '', isplit)): return 0 adjusted = set(map(lambda(x): int(x)-i, isplit)) if (previous): previous = adjusted.intersection(previous) else: previous = adjusted if previous: return len(previous) return 0 - TASK: NAME: term_match SOURCE: term_join REDUCE: term_reducer - INPUT: NAME: final_output QUERY: | SELECT doc.*, kw.*, tm.count FROM documents doc, keywords kw, term_match tm WHERE doc.doc_id = tm.doc_id AND kw.keyword_id = tm.keyword_id AND tm.count > 0 EXECUTE: - RUN: SOURCE: final_output TARGET: STDOUT

I have learnt that unnecessary TABs can the cause of this, so trying to overcome that, hopefully the problem will subside then....

Regards,

Suvankar Roy

Richard Huxton <dev@archonet.com>

08/03/2009 02:55 PM

To	Suvankar Roy <suvankar.roy@tcs.com>
cc	pgsql-performance@postgresql.org
Subject	Re: [PERFORM] Greenplum MapReduce

Suvankar Roy wrote: > Hi all, > > Has anybody worked on Greenplum MapReduce programming ? > > I am facing a problem while trying to execute the below Greenplum > Mapreduce program written in YAML (in blue). The other poster suggested contacting Greenplum and I can only agree. > The error is thrown in the 7th line as: > Error: YAML syntax error - found character that cannot start any token > while scanning for the next token, at line 7 (in red) There is no red, particularly if viewing messages as plain text (which most people do on mailing lists). Consider indicating a line some other way next time (commonly below the line you put something like "this is line 7 ^^^^^") The most common problem I get with YAML files though is when a tab is accidentally inserted instead of spaces at the start of a line. -- Richard Huxton Archonet Ltd
ForwardSourceID:NT000058E2

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

Re: Greenplum MapReduce

From

Suvankar Roy

Date:

03 August 2009, 21:50:01

Hi Robert,

Thanks much for your valuable inputs....

This spaces and tabs problem is killing me in a way, it is pretty cumbersome to say the least....

Regards,

Suvankar Roy

"Robert Mah" <rmah@pobox.com>
Sent by: Robert Mah <robert.mah@gmail.com>

08/02/2009 10:52 PM

To	"'Suvankar Roy'" <suvankar.roy@tcs.com>, <pgsql-performance@postgresql.org>
cc
Subject	RE: [PERFORM] Greenplum MapReduce

Suvankar:

Check your file for spaces vs tabs (one of them is bad and yes, it matters).

And as an personal aside, this is yet another reason I hate YAML.

Cheers,
Rob

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Suvankar Roy
Sent: Thursday, July 30, 2009 8:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Greenplum MapReduce

Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue).

The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red)

If somebody can explain this and the potential solution

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
- INPUT:
NAME: doc
TABLE: documents
- INPUT:
NAME: kw
TABLE: keywords
- MAP:
NAME: doc_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in data.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
for term in terms:
yield([doc_id, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- doc_id integer
- data text
RETURNS:
- doc_id integer
- term text
- positions text
- MAP:
NAME: kw_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in keyword.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
yield([keyword_id, i, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- keyword_id integer
- keyword text
RETURNS:
- keyword_id integer
- nterms integer
- term text
- positions text
- TASK:
NAME: doc_prep
SOURCE: doc
MAP: doc_map
- TASK:
NAME: kw_prep
SOURCE: kw
MAP: kw_map
- INPUT:
NAME: term_join
QUERY: |
SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms,
doc.positions as doc_positions,
kw.positions as kw_positions
FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)
- REDUCE:
NAME: term_reducer
TRANSITION: term_transition
FINALIZE: term_finalizer
- TRANSITION:
NAME: term_transition
LANGUAGE: python
PARAMETERS:
- state text
- term text
- nterms integer
- doc_positions text
- kw_positions text
FUNCTION: |
if state:
kw_split = state.split(':')
else:
kw_split = []
for i in range(0,nterms):
kw_split.append('')
for kw_p in kw_positions.split(','):
kw_split[int(kw_p)-1] = doc_positions
outstate = kw_split[0]
for s in kw_split[1:]:
outstate = outstate + ':' + s
return outstate
- FINALIZE:
NAME: term_finalizer
LANGUAGE: python
RETURNS:
- count integer
MODE: MULTI
FUNCTION: |
if not state:
return 0
kw_split = state.split(':')
previous = None
for i in range(0,len(kw_split)):
isplit = kw_split[i].split(',')
if any(map(lambda(x): x == '', isplit)):
return 0
adjusted = set(map(lambda(x): int(x)-i, isplit))
if (previous):
previous = adjusted.intersection(previous)
else:
previous = adjusted
if previous:
return len(previous)
return 0
- TASK:
NAME: term_match
SOURCE: term_join
REDUCE: term_reducer
- INPUT:
NAME: final_output
QUERY: |
SELECT doc.*, kw.*, tm.count
FROM documents doc, keywords kw, term_match tm
WHERE doc.doc_id = tm.doc_id
AND kw.keyword_id = tm.keyword_id
AND tm.count > 0
EXECUTE:
- RUN:
SOURCE: final_output
TARGET: STDOUT

Regards,

Suvankar Roy
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you

ForwardSourceID:NT000058B6

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you