com.cloudera.crunch
Interface Pipeline

All Known Implementing Classes:
MemPipeline, MRPipeline

public interface Pipeline

Manages the state of a pipeline execution.


Method Summary
 void done()
          Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
 void enableDebug()
          Turn on debug logging for jobs that are run from this pipeline.
 org.apache.hadoop.conf.Configuration getConfiguration()
          Returns the Configuration instance associated with this pipeline.
<T> Iterable<T>
materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.
<T> PCollection<T>
read(Source<T> source)
          Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
<K,V> PTable<K,V>
read(TableSource<K,V> tableSource)
          A version of the read method for TableSource instances that map to PTables.
 PCollection<String> readTextFile(String pathName)
          A convenience method for reading a text file.
 void run()
          Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
 void setConfiguration(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration to use with this pipeline.
 void write(PCollection<?> collection, Target target)
          Write the given collection to the given target on the next pipeline run.
<T> void
writeTextFile(PCollection<T> collection, String pathName)
          A convenience method for writing a text file.
 

Method Detail

setConfiguration

void setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the Configuration to use with this pipeline.


getConfiguration

org.apache.hadoop.conf.Configuration getConfiguration()
Returns the Configuration instance associated with this pipeline.


read

<T> PCollection<T> read(Source<T> source)
Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.

Parameters:
source - The source of data
Returns:
A PCollection that references the given source

read

<K,V> PTable<K,V> read(TableSource<K,V> tableSource)
A version of the read method for TableSource instances that map to PTables.

Parameters:
tableSource - The source of the data
Returns:
A PTable that references the given source

write

void write(PCollection<?> collection,
           Target target)
Write the given collection to the given target on the next pipeline run.

Parameters:
collection - The collection
target - The output target

materialize

<T> Iterable<T> materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the returned Collection instance for client use.

Parameters:
pcollection - The PCollection to materialize
Returns:
the data from the PCollection as a read-only Collection

run

void run()
Constructs and executes a series of MapReduce jobs in order to write data to the output targets.


done

void done()
Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.


readTextFile

PCollection<String> readTextFile(String pathName)
A convenience method for reading a text file.


writeTextFile

<T> void writeTextFile(PCollection<T> collection,
                       String pathName)
A convenience method for writing a text file.


enableDebug

void enableDebug()
Turn on debug logging for jobs that are run from this pipeline.



Copyright © 2012. All Rights Reserved.