The Examples Module is a collection of examples for the CDK.
This example shows basic usage of the CDK Data API for performing streaming writes to (and reads from) a dataset.
From the examples module, build with:
mvn compile
Then create the dataset with:
mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.CreateUserDatasetPojo"
You can look at the files that were created with:
find /tmp/data
Read the entities in the dataset:
mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.ReadUserDatasetPojo"
Finally, drop the dataset:
mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.DropUserDataset"
The previous examples used POJOs, since they are the most familiar data transfer obejcts for most Java programmers. Avro supports generic records too, which are more efficient, since they dont require reflection, and also dont require either the reader or writer to have the POJO class available.
Run the following to use the generic writer and reader:
mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.CreateUserDatasetGeneric" mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.ReadUserDatasetGeneric" mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.DropUserDataset"
The API supports partitioning, so that records are written to different partition files according to the value of particular partition fields.
mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.CreateUserDatasetGenericPartitioned" find /tmp/data # see how partitioning affects the data layout mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.ReadUserDatasetGeneric" mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.ReadUserDatasetGenericOnePartition" mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.DropUserDataset"
Run the equivalent example with:
scala -cp "$(mvn dependency:build-classpath | grep -v '^\[')" src/main/scala/createpojo.scala
Or for the generic example:
scala -cp "$(mvn dependency:build-classpath | grep -v '^\[')" src/main/scala/creategeneric.scala
The Java examples can be used to read (and drop) the dataset written from Scala:
mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.ReadUserDatasetGeneric" mvn exec:java -Dexec.mainClass="com.cloudera.cdk.examples.data.DropUserDataset"