com.cloudera.cdk.data.filesystem
Class FileSystemDatasetRepository

java.lang.Object
  extended by com.cloudera.cdk.data.filesystem.FileSystemDatasetRepository
All Implemented Interfaces:
DatasetRepository

public class FileSystemDatasetRepository
extends Object
implements DatasetRepository

A DatasetRepository that stores data in a Hadoop FileSystem.

Given a FileSystem, a root directory, and a MetadataProvider, this DatasetRepository implementation can load and store Datasets on both local filesystems as well as the Hadoop Distributed FileSystem (HDFS). Users may directly instantiate this class with the three dependencies above and then perform dataset-related operations using any of the provided methods. The primary methods of interest will be create(String, com.cloudera.cdk.data.DatasetDescriptor), get(String), and drop(String) which create a new dataset, load an existing dataset, or delete an existing dataset, respectively. Once a dataset has been created or loaded, users can invoke the appropriate Dataset methods to get a reader or writer as needed.

See Also:
DatasetRepository, Dataset, DatasetDescriptor, PartitionStrategy, MetadataProvider

Nested Class Summary
static class FileSystemDatasetRepository.Builder
          A fluent builder to aid in the construction of FileSystemDatasetRepository instances.
 
Constructor Summary
FileSystemDatasetRepository(FileSystem fileSystem, Path rootDirectory)
          Construct a FileSystemDatasetRepository on the given FileSystem and root directory, and a FileSystemMetadataProvider with the same FileSystem and root directory.
FileSystemDatasetRepository(FileSystem fileSystem, Path rootDirectory, MetadataProvider metadataProvider)
          Construct a FileSystemDatasetRepository on the given FileSystem and root directory, with the given MetadataProvider for metadata storage.
FileSystemDatasetRepository(URI uri)
          Construct a FileSystemDatasetRepository with a root directory at the given URI, and a FileSystemMetadataProvider with the same root directory.
 
Method Summary
 Dataset create(String name, DatasetDescriptor descriptor)
          Create a Dataset with the supplied descriptor.
 boolean drop(String name)
          Drop the named Dataset.
 Dataset get(String name)
          Get the latest version of a named Dataset.
 FileSystem getFileSystem()
           
 MetadataProvider getMetadataProvider()
           
 Path getRootDirectory()
           
static PartitionKey partitionKeyForPath(Dataset dataset, URI partitionPath)
          Get a PartitionKey corresponding to a partition's filesystem path represented as a URI.
protected  Path pathForDataset(String name)
           Implementations should return the fully-qualified path of the data directory for the dataset with the given name.
 String toString()
           
 Dataset update(String name, DatasetDescriptor descriptor)
          Update an existing Dataset to reflect the supplied descriptor.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

FileSystemDatasetRepository

public FileSystemDatasetRepository(FileSystem fileSystem,
                                   Path rootDirectory)
Construct a FileSystemDatasetRepository on the given FileSystem and root directory, and a FileSystemMetadataProvider with the same FileSystem and root directory.

Parameters:
fileSystem - the filesystem to store metadata and datasets in
rootDirectory - the root directory for metadata and datasets

FileSystemDatasetRepository

public FileSystemDatasetRepository(URI uri)
Construct a FileSystemDatasetRepository with a root directory at the given URI, and a FileSystemMetadataProvider with the same root directory.

Parameters:
uri - the root directory for metadata and datasets
Since:
0.3.0

FileSystemDatasetRepository

public FileSystemDatasetRepository(FileSystem fileSystem,
                                   Path rootDirectory,
                                   MetadataProvider metadataProvider)
Construct a FileSystemDatasetRepository on the given FileSystem and root directory, with the given MetadataProvider for metadata storage.

Parameters:
fileSystem - the filesystem to store datasets in
rootDirectory - the root directory for datasets
metadataProvider - the provider for metadata storage
Method Detail

create

public Dataset create(String name,
                      DatasetDescriptor descriptor)
Description copied from interface: DatasetRepository
Create a Dataset with the supplied descriptor. Depending on the underlying dataset storage, some schemas types or configurations may not be supported. If an illegal schema is supplied, an exception will be thrown by the implementing class. It is illegal to create a more than one dataset with a given name. If a duplicate name is provided, an exception is thrown.

Specified by:
create in interface DatasetRepository
Parameters:
name - The fully qualified dataset name
descriptor - A descriptor that describes the schema and other properties of the dataset
Returns:
The newly created dataset

update

public Dataset update(String name,
                      DatasetDescriptor descriptor)
Description copied from interface: DatasetRepository
Update an existing Dataset to reflect the supplied descriptor. The common case is updating a dataset schema. Depending on the underlying dataset storage, some updates may not be supported, such as a change in format or partition strategy. Any attempt to make an unsupported or incompatible update will result in an exception being thrown and no change being made to the dataset.

Specified by:
update in interface DatasetRepository
Parameters:
name - The fully qualified dataset name
descriptor - A descriptor that describes the schema and other properties of the dataset
Returns:
The newly created dataset

get

public Dataset get(String name)
Description copied from interface: DatasetRepository
Get the latest version of a named Dataset. If no dataset with the provided name exists, a DatasetRepositoryException is thrown.

Specified by:
get in interface DatasetRepository
Parameters:
name - The name of the dataset.

drop

public boolean drop(String name)
Description copied from interface: DatasetRepository
Drop the named Dataset. If no dataset with the provided name exists, a DatasetReaderException is thrown.

Specified by:
drop in interface DatasetRepository
Parameters:
name - The name of the dataset.
Returns:
true if the dataset was successfully dropped, false otherwise

partitionKeyForPath

public static PartitionKey partitionKeyForPath(Dataset dataset,
                                               URI partitionPath)
Get a PartitionKey corresponding to a partition's filesystem path represented as a URI. If the path is not a valid partition, then IllegalArgumentException is thrown. Note that the partition does not have to exist.

Parameters:
dataset - the filesystem dataset
partitionPath - a directory path where the partition data is stored
Returns:
a partition key representing the partition at the given path
Since:
0.4.0

pathForDataset

protected Path pathForDataset(String name)

Implementations should return the fully-qualified path of the data directory for the dataset with the given name.

This method is for internal use only and users should not call it directly.

Since:
0.2.0

toString

public String toString()
Overrides:
toString in class Object

getRootDirectory

public Path getRootDirectory()
Returns:
the root directory in the filesystem where datasets are stored.

getFileSystem

public FileSystem getFileSystem()
Returns:
the FileSystem on which datasets are stored.

getMetadataProvider

public MetadataProvider getMetadataProvider()
Returns:
the MetadataProvider being used by this repository.
Since:
0.2.0


Copyright © 2013 Cloudera. All rights reserved.