com.cloudera.crunch
Class DoFn<S,T>

java.lang.Object
  extended by com.cloudera.crunch.DoFn<S,T>
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
Aggregate.TopKFn, CombineFn, FilterFn, MapFn, MapKeysFn, MapValuesFn, Sample.SamplerFn

public abstract class DoFn<S,T>
extends Object
implements Serializable

Base class for all data processing functions in Crunch.

Note that all DoFn instances implement Serializable, and thus all of their non-transient member variables must implement Serializable as well. If your DoFn depends on non-serializable classes for data processing, they may be declared as transient and initialized in the DoFn's initialize method.

See Also:
Serialized Form

Constructor Summary
DoFn()
           
 
Method Summary
 void cleanup(Emitter<T> emitter)
          Called during the cleanup of the MapReduce job this DoFn is associated with.
 void configure(org.apache.hadoop.conf.Configuration conf)
          Called during the job planning phase.
protected  org.apache.hadoop.conf.Configuration getConfiguration()
           
protected  org.apache.hadoop.mapreduce.Counter getCounter(Enum<?> counterName)
           
protected  org.apache.hadoop.mapreduce.Counter getCounter(String groupName, String counterName)
           
protected  String getStatus()
           
protected  org.apache.hadoop.mapreduce.TaskAttemptID getTaskAttemptID()
           
 void initialize()
          Called during the setup of the MapReduce job this DoFn is associated with.
abstract  void process(S input, Emitter<T> emitter)
          Processes the records from a PCollection.
protected  void progress()
           
 float scaleFactor()
          Returns an estimate of how applying this function to a PCollection will cause it to change in side.
 void setConfigurationForTest(org.apache.hadoop.conf.Configuration conf)
          Sets a Configuration instance to be used during unit tests.
 void setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
          Called during setup to pass the TaskInputOutputContext to this DoFn instance.
protected  void setStatus(String status)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DoFn

public DoFn()
Method Detail

configure

public void configure(org.apache.hadoop.conf.Configuration conf)
Called during the job planning phase. Subclasses may override this method in order to modify the configuration of the Job that this DoFn instance belongs to.

Parameters:
conf - The Configuration instance for the Job.

process

public abstract void process(S input,
                             Emitter<T> emitter)
Processes the records from a PCollection.

Parameters:
input - The input record
emitter - The emitter to send the output to

initialize

public void initialize()
Called during the setup of the MapReduce job this DoFn is associated with. Subclasses may override this method to do appropriate initialization.


cleanup

public void cleanup(Emitter<T> emitter)
Called during the cleanup of the MapReduce job this DoFn is associated with. Subclasses may override this method to do appropriate cleanup.

Parameters:
emitter - The emitter that was used for output

setContext

public void setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
Called during setup to pass the TaskInputOutputContext to this DoFn instance.


setConfigurationForTest

public void setConfigurationForTest(org.apache.hadoop.conf.Configuration conf)
Sets a Configuration instance to be used during unit tests.

Parameters:
conf - The Configuration instance.

scaleFactor

public float scaleFactor()
Returns an estimate of how applying this function to a PCollection will cause it to change in side. The optimizer uses these estimates to decide where to break up dependent MR jobs into separate Map and Reduce phases in order to minimize I/O.

Subclasses of DoFn that will substantially alter the size of the resulting PCollection should override this method.


getConfiguration

protected org.apache.hadoop.conf.Configuration getConfiguration()

getCounter

protected org.apache.hadoop.mapreduce.Counter getCounter(Enum<?> counterName)

getCounter

protected org.apache.hadoop.mapreduce.Counter getCounter(String groupName,
                                                         String counterName)

progress

protected void progress()

getTaskAttemptID

protected org.apache.hadoop.mapreduce.TaskAttemptID getTaskAttemptID()

setStatus

protected void setStatus(String status)

getStatus

protected String getStatus()


Copyright © 2012. All Rights Reserved.