com.cloudera.cdk.data
Class DatasetRepositories

java.lang.Object
  extended by com.cloudera.cdk.data.DatasetRepositories

public class DatasetRepositories
extends Object

Convenience methods for working with DatasetRepository instances.

Since:
0.8.0

Constructor Summary
DatasetRepositories()
           
 
Method Summary
static DatasetRepository open(String uri)
          Synonym for open(java.net.URI) for String URIs.
static DatasetRepository open(URI repositoryUri)
           Open a DatasetRepository} for the given URI.
static void register(URIPattern pattern, OptionBuilder<DatasetRepository> builder)
          Registers a URIPattern and an OptionBuilder to create instances of DatasetRepository from the pattern's match options.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DatasetRepositories

public DatasetRepositories()
Method Detail

register

public static void register(URIPattern pattern,
                            OptionBuilder<DatasetRepository> builder)
Registers a URIPattern and an OptionBuilder to create instances of DatasetRepository from the pattern's match options.

Parameters:
pattern - a URIPattern
builder - an OptionBuilder that expects options defined by pattern and builds DatasetRepository instances.

open

public static DatasetRepository open(String uri)
Synonym for open(java.net.URI) for String URIs.

Parameters:
uri - a String URI
Returns:
a DatasetRepository for the given URI.
Throws:
IllegalArgumentException - If the String cannot be parsed into a valid URI (URI).

open

public static DatasetRepository open(URI repositoryUri)

Open a DatasetRepository} for the given URI.

This method provides a simpler way to connect to a DatasetRepository while providing information about the appropriate MetadataProvider and other options to use. For almost all cases, this is the preferred method of retrieving an instance of a DatasetRepository.

The format of a repository URI is as follows.

dsr:[storage component]

The [storage component] indicates the underlying metadata and, in some cases, physical storage of the data, along with any options. The supported storage backends are:

Local FileSystem URIs

file:[path] where [path] is a relative or absolute filesystem path to be used as the dataset repository root directory in which to store dataset data. When specifying an absolute path, the null authority (i.e. file:///my/path) form may be used. Alternatively, the authority section may be omitted entirely (e.g. file:/my/path). Either way, it is illegal to provide an authority (i.e. file://this-part-is-illegal/my/path). This storage backend will produce a DatasetRepository that stores both data and metadata on the local operating system filesystem. See FileSystemDatasetRepository for more information.

HDFS FileSystem URIs

hdfs://[host]:[port]/[path] where [host] and [port] indicate the location of the Hadoop NameNode, and [path] is the dataset repository root directory in which to store dataset data. This form will load the Hadoop configuration information per the usual methods (i.e. searching the process's classpath for the various configuration files). This storage backend will produce a DatasetRepository that stores both data and metadata in HDFS. See FileSystemDatasetRepository for more information.

Hive/HCatalog URIs

hive will connect to the Hive MetaStore. Dataset locations will be determined by Hive as managed tables.

hive:/[path] will also connect to the Hive MetaStore, but tables will be external and stored under [path]. The repository storage layout will be the same as hdfs and file repositories. HDFS connection options can be supplied by adding hdfs-host and hdfs-port query options to the URI (see examples).

Examples

repo:file:foo/bar Store data+metadata on the local filesystem in the directory ./foo/bar.
repo:file:///data Store data+metadata on the local filesystem in the directory /data
repo:hdfs://localhost:8020/data Same as above, but stores data+metadata on HDFS.
repo:hive Connects to the Hive MetaStore and creates managed tables.
repo:hive:/path?hdfs-host=localhost&hdfs-port=8020 Connects to the Hive MetaStore and creates external tables stored in hdfs://localhost:8020/path.

Parameters:
repositoryUri - The repository URI
Returns:
An appropriate implementation of DatasetRepository
Since:
0.8.0


Copyright © 2013 Cloudera. All rights reserved.