com.cloudera.crunch
Interface PCollection<S>

All Known Subinterfaces:
PGroupedTable<K,V>, PTable<K,V>
All Known Implementing Classes:
DoCollectionImpl, DoTableImpl, InputCollection, InputTable, MemCollection, MemTable, PCollectionImpl, PGroupedTableImpl, PTableBase, UnionCollection, UnionTable

public interface PCollection<S>

A representation of an immutable, distributed collection of elements that is the fundamental target of computations in Crunch.


Method Summary
 PTable<S,Long> count()
          Returns a PTable instance that contains the counts of each unique element of this PCollection.
 PCollection<S> filter(FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 String getName()
          Returns a shorthand name for this PCollection.
 Pipeline getPipeline()
          Returns the Pipeline associated with this PCollection.
 PType<S> getPType()
          Returns the PType of this PCollection.
 long getSize()
          Returns the size of the data represented by this PCollection in bytes.
 PTypeFamily getTypeFamily()
          Returns the PTypeFamily of this PCollection.
 Iterable<S> materialize()
          Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.
 PCollection<S> max()
          Returns a PCollection made up of only the maximum element of this instance.
 PCollection<S> min()
          Returns a PCollection made up of only the minimum element of this instance.
<K,V> PTable<K,V>
parallelDo(DoFn<S,Pair<K,V>> doFn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<K,V> PTable<K,V>
parallelDo(String name, DoFn<S,Pair<K,V>> doFn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(String name, DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
 PCollection<S> sample(double acceptanceProbability)
          Randomly sample items from this PCollection instance with the given probability of an item being accepted.
 PCollection<S> sample(double acceptanceProbability, long seed)
          Randomly sample items from this PCollection instance with the given probability of an item being accepted and using the given seed.
 PCollection<S> sort(boolean ascending)
          Returns a PCollection instance that contains all of the elements of this instance in sorted order.
 PCollection<S> union(PCollection<S>... collections)
          Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.
 PCollection<S> write(Target target)
          Write the contents of this PCollection to the given Target, using the storage format specified by the target.
 

Method Detail

getPipeline

Pipeline getPipeline()
Returns the Pipeline associated with this PCollection.


union

PCollection<S> union(PCollection<S>... collections)
Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.


parallelDo

<T> PCollection<T> parallelDo(DoFn<S,T> doFn,
                              PType<T> type)
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Parameters:
doFn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

<T> PCollection<T> parallelDo(String name,
                              DoFn<S,T> doFn,
                              PType<T> type)
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Parameters:
name - An identifier for this processing step, useful for debugging
doFn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

<K,V> PTable<K,V> parallelDo(DoFn<S,Pair<K,V>> doFn,
                             PTableType<K,V> type)
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Parameters:
doFn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

parallelDo

<K,V> PTable<K,V> parallelDo(String name,
                             DoFn<S,Pair<K,V>> doFn,
                             PTableType<K,V> type)
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Parameters:
name - An identifier for this processing step
doFn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

write

PCollection<S> write(Target target)
Write the contents of this PCollection to the given Target, using the storage format specified by the target.

Parameters:
target - The target to write to

materialize

Iterable<S> materialize()
Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.


getPType

PType<S> getPType()
Returns the PType of this PCollection.


getTypeFamily

PTypeFamily getTypeFamily()
Returns the PTypeFamily of this PCollection.


getSize

long getSize()
Returns the size of the data represented by this PCollection in bytes.


getName

String getName()
Returns a shorthand name for this PCollection.


filter

PCollection<S> filter(FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting PCollection.


sort

PCollection<S> sort(boolean ascending)
Returns a PCollection instance that contains all of the elements of this instance in sorted order.


count

PTable<S,Long> count()
Returns a PTable instance that contains the counts of each unique element of this PCollection.


max

PCollection<S> max()
Returns a PCollection made up of only the maximum element of this instance.


min

PCollection<S> min()
Returns a PCollection made up of only the minimum element of this instance.


sample

PCollection<S> sample(double acceptanceProbability)
Randomly sample items from this PCollection instance with the given probability of an item being accepted.


sample

PCollection<S> sample(double acceptanceProbability,
                      long seed)
Randomly sample items from this PCollection instance with the given probability of an item being accepted and using the given seed.



Copyright © 2012. All Rights Reserved.