Working with datasets

The cdk:create-dataset and cdk:drop-dataset goals are for creating and dropping datasets. Configure the Hadoop settings by declaring the plugin in your POM:

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>com.cloudera.cdk</groupId>
        <artifactId>cdk-maven-plugin</artifactId>
        <version>0.9.1</version>
        <configuration>
          <hadoopConfiguration>
            <property>
              <name>fs.default.name</name>
              <value>hdfs://localhost</value>
            </property>
            <property>
              <name>hive.metastore.uris</name>
              <value>thrift://localhost:9083</value>
            </property>
          </hadoopConfiguration>
        </configuration>
      </plugin>
    </plugins>
  </build>
   ...
</project>

If you are using the default settings (local file system and local Hive metastore) then you can omit the configuration element.

To create a new dataset, run:

mvn cdk:create-dataset \
  -Dcdk.rootDirectory=/tmp/data \
  -Dcdk.datasetName=mydataset \
  -Dcdk.avroSchemaFile=myschema.avsc

The avroSchemaFile property specifies a local file.

To drop a dataset, run:

mvn cdk:drop-dataset \
  -Dcdk.rootDirectory=/tmp/data \
  -Dcdk.datasetName=mydataset

Launching jobs locally with cdk:run-tool

This goal is used to run a Hadoop Tool. The Tool’s run() method is executed in the same local VM as Maven, however it is common for the Tool to launch distributed processes, such as MapReduce jobs which run on a cluster.

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>com.cloudera.cdk</groupId>
        <artifactId>cdk-maven-plugin</artifactId>
        <version>0.9.1</version>
        <configuration>
          <toolClass>org.example.ToolImplementation</toolClass>
          <!-- optional -->
          <args>
            <arg>arg1</arg>
            <arg>arg2</arg>
          </args>
          <hadoopConfiguration>
            <property>
              <name>fs.default.name</name>
              <value>hdfs://localhost</value>
            </property>
            <property>
              <name>mapred.job.tracker</name>
              <value>localhost:8021</value>
            </property>
          </hadoopConfiguration>
        </configuration>
      </plugin>
    </plugins>
  </build>
   ...
</project>

Run the tool using:

mvn cdk:run-tool

Understanding the classpath

The classpath for the local VM is made up of the runtime classpath (all dependencies in the compile and runtime scopes). Hadoop libraries are provided by the plugin, so there is no need to include Hadoop in the compile and runtime scopes (and indeed doing so may cause undefined behaviour).

The classpath for distributed processes is made up of the runtime classpath , unless cdk.addDependenciesToDistributedCache is set to false (the default is true), in which case no dependencies are included in the distributed classpath.

This makes it very convenient to run distributed jobs, since all runtime dependencies are automatically included in the Tool classpath and the MapReduce task classpath.

Launching jobs from the cluster

There are three goals for building and running jobs on the cluster:

  • cdk:package-app builds a packaged application locally (in the Oozie package format)
  • cdk:deploy-app deploys the packaged application to the cluster
  • cdk:run-app runs the deployed application as an Oozie job

A packaged application includes an Oozie workflow file, an Oozie coordinator file (optional), and the dependencies on the runtime classpath. The workflow file may be generated from the plugin configuration. The following example shows how to run the previous example from the cluster, by adding properties for deployFileSystem and oozieUrl, and a executions section to bind cdk-package to the package phase of the Maven lifecycle.

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>com.cloudera.cdk</groupId>
        <artifactId>cdk-maven-plugin</artifactId>
        <version>0.9.1</version>
        <configuration>
          <toolClass>org.example.ToolImplementation</toolClass>
          <deployFileSystem>hdfs://localhost/</deployFileSystem>
          <oozieUrl>http://localhost:11000/oozie</oozieUrl>
          <!-- optional -->
          <args>
            <arg>arg1</arg>
            <arg>arg2</arg>
          </args>
          <hadoopConfiguration>
            <property>
              <name>fs.default.name</name>
              <value>hdfs://localhost</value>
            </property>
            <property>
              <name>mapred.job.tracker</name>
              <value>localhost:8021</value>
            </property>
          </hadoopConfiguration>
        </configuration>
        <executions>
          <execution>
            <id>make-app</id>
            <phase>package</phase>
            <goals>
              <goal>package-app</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
   ...
</project>

Build and run with:

mvn package cdk:deploy-app
mvn cdk:run-app

Back to top

Version: 0.9.1. Last Published: 2014-01-14.

Reflow Maven skin by Andrius Velykis.