CDK Morphlines Parent -

Cloudera Morphlines Reference Guide - Table of Contents

Cloudera Morphlines provides a set of frequently-used high-level transformation and I/O commands that can be combined in application specific ways. This chapter outlines the currently available commands.

Introduction

Note: Perhaps the most important property of the Morphlines framework is how easy it is to add new transformations and I/O commands and integrate existing functionality and third party systems. If none of the existing commands match your use case, you can easily write your own command and plug it in. Simply implement the Java interface Command or subclass AbstractCommand and add the resulting Java class to the classpath. No registration or other administrative action is required. Here are links to two sample command implementations:

toString
readLine

Indeed, none of the standard commands are special or intrinsically known per se. All commands are implemented like this, even including standard commands such as pipe, if, and tryRules. This means your custom commands can even replace any standard commands, if desired.

cdk-morphlines-core-stdio

This module contains standard I/O commands for tasks such as acting on single-line records, multi-line records, CSV files, and for converting bytes to strings.

readClob

The readClob command converts bytes to strings. It emits one record for the entire input stream of the first attachment, interpreting the stream as a Character Large Object (CLOB). The line is put as a string into the message output field.

The command provides the following configuration options:

Property Name	Default	Description
supportedMimeTypes	null	Optionally, require the input record to match one of the MIME types in this list.
charset	null	The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.

Example usage:

readClob {
  charset : UTF-8
}

readCSV

The readCSV command extracts zero or more records from the input stream of the first attachment of the record, representing a Comma Separated Values (CSV) file.

For the format see the wikipedia entry for Comma-separated values, the article on the Creativyst software website on The Comma Separated Value (CSV) File Format and the Ricebridge CSV Manager Demonstration

Some CSV files contain a header line that contains embedded column names. This command does not support reading and using such embedded column names as output field names because this is considered unreliable for production systems. If the first line of the CSV file is a header line, you must set the ignoreFirstLine option to true. You must explicitly define the columns configuration parameter in order to name the output fields.

Note: A quoted field can span multiple lines in the input stream.

The command provides the following configuration options:

Property Name	Default	Description
supportedMimeTypes	null	Optionally, require the input record to match one of the MIME types in this list.
separator	","	The character separating any two fields.
columns	n/a	The name of the output fields for each input column. An empty string indicates omit this column in the output. If more columns are contained in the input than specified here, those columns are automatically named `columnN`.
ignoreFirstLine	false	Whether to ignore the first line. This flag can be used for CSV files that contain a header line.
trim	true	Whether leading and trailing whitespace shall be removed from the output fields.
charset	null	The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.

Example usage for CSV (Comma Separated Values):

readCSV {
  separator : ","
  columns : [Age,"",Extras,Type]
  ignoreFirstLine : false
  trim : true
  charset : UTF-8
}

Example usage for TSV (Tab Separated Values):

readCSV {
  separator : "\t"
  columns : [Age,"",Extras,Type]
  ignoreFirstLine : false
  trim : true
  charset : UTF-8
}

Example usage for SSV (Space Separated Values):

readCSV {
  separator : " "
  columns : [Age,"",Extras,Type]
  ignoreFirstLine : false
  trim : true
  charset : UTF-8
}

readLine

The readLine command emits one record per line in the input stream of the first attachment. The line is put as a string into the message output field. Empty lines are ignored.

The command provides the following configuration options:

Property Name	Default	Description
supportedMimeTypes	null	Optionally, require the input record to match one of the MIME types in this list.
ignoreFirstLine	false	Whether to ignore the first line. This flag can be used for CSV files that contain a header line.
commentPrefix	""	A character that indicates to ignore this line as a comment for example, "#".
charset	null	The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.

Example usage:

readLine {
  ignoreFirstLine : true
  commentPrefix : "#"
  charset : UTF-8
}

readMultiLine

The readMultiLine command is a multiline log parser that collapse multiline messages into a single record. It supports regex, what, and negate configuration parameters similar to logstash. The line is put as a string into the message output field.

For example, this can be used to parse log4j with stack traces. Also see https://gist.github.com/smougenot/3182192 and http://logstash.net/docs/1.1.13/filters/multiline.

The command provides the following configuration options:

Property Name	Default	Description
supportedMimeTypes	null	Optionally, require the input record to match one of the MIME types in this list.
regex	n/a	This parameter should match what you believe to be an indicator that the line is part of a multi-line record.
what	previous	This parameter must be one of "previous" or "next" and indicates the relation of the regex to the multi-line record.
negate	false	This parameter can be true or false. If true, a line not matching the regex constitutes a match of the multiline filter and the previous or next action is applied. The reverse is also true.
charset	null	The character encoding to use, for example, UTF-8. If none is specified the charset specified in the _attachment_charset input field is used instead.

Example usage:

# parse log4j with stack traces
readMultiLine {
  regex : "(^.+Exception: .+)|(^\\s+at .+)|(^\\s+\\.\\.\\. \\d+ more)|(^\\s*Caused by:.+)"
  what : previous
  charset : UTF-8
}

# parse sessions; begin new record when we find a line that starts with "Started session"
readMultiLine {
  regex : "Started session.*"
  what : next
  charset : UTF-8
}

cdk-morphlines-core-stdlib

This module contains standard transformation commands, such as commands for flexible log file analysis, regular expression based pattern matching and extraction, operations on fields for assignment and comparison, operations on fields with list and set semantics, if-then-else conditionals, string and timestamp conversions, scripting support for dynamic java code, a small rules engine, logging, and metrics and counters.

addValues

The addValues command takes a set of outputField : values pairs and performs the following steps: For each output field, adds the given values to the field. The command can fetch the values of a record field using a field expression, which is a string of the form @\fieldname\{}.

Example usage:

addValues {
  # add values "text/log" and "text/log2" to the source_type output field
  source_type : [text/log, text/log2]

  # add integer 123 to the pid field
  pid : [123]

  # add all values contained in the first_name field to the name field
  name : "@{first_name}"
}

addValuesIfAbsent

The addValuesIfAbsent command is the same as the addValues command, except that a given value is only added to the output field if it is not already contained in the output field.

Example usage:

addValuesIfAbsent {
  # add values "text/log" and "text/log2" to the source_type output field
  # unless already present
  source_type : [text/log, text/log2]

  # add integer 123 to the pid field, unless already present
  pid : [123]

  # add all values contained in the first_name field to the name field
  # unless already present
  name : "@{first_name}"
}

callParentPipe

The callParentPipe command routes records to the enclosing pipe object. Recall that a morphline is a pipe. Thus, unless a morphline contains nested pipes, the parent pipe of a given command is the morphline itself, meaning that the first command of the morphline is called with the given record. Thus, the callParentPipe command effectively implements recursion, which is useful for extracting data from container file formats in elegant and concise ways. For example, you could use this to extract data from tar.gz files. This command is typically used in combination with the commands detectMimeType, tryRules, decompress, unpack, and possibly solrCell.

Example usage:

callParentPipe {}

For a real world example, see the solrCell command.

contains

The contains command succeeds if one of the field values of the given named field is equal to one of the the given values, and fails otherwise. Multiple fields can be named, in which case the results are ANDed.

Example usage:

# succeed if the _attachment_mimetype field contains a value "avro/binary"
# fail otherwise
contains { _attachment_mimetype : [avro/binary] }

# succeed if the tags field contains a value "version1" or "version2",
# fail otherwise
contains { tags : [version1, version2] }

convertTimestamp

The convertTimestamp command converts the timestamps in a given field from one of a set of input date formats (in an input timezone) to an output date format (in an output timezone), while respecting daylight savings time rules. The command provides reasonable defaults for common use cases.

The command provides the following configuration options:

Property Name	Default	Description
field	timestamp	The name of the field to convert.
inputFormats	A list of common input date formats	A list of SimpleDateFormat or "unixTimeInMillis" or "unixTimeInSeconds". "unixTimeInMillis" and "unixTimeInSeconds" indicate the difference, measured in milliseconds and seconds, respectively, between a timestamp and midnight, January 1, 1970 UTC. Multiple input date formats can be specified. If none of the input formats match the field value then the command fails.
inputTimezone	UTC	The time zone to assume for the input timestamp.
inputLocale	""	The Java Locale to assume for the input timestamp.
outputFormat	"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"	The SimpleDateFormat to which to convert. Can also be "unixTimeInMillis" or "unixTimeInSeconds". "unixTimeInMillis" and "unixTimeInSeconds" indicate the difference, measured in milliseconds and seconds, respectively, between a timestamp and midnight, January 1, 1970 UTC.
outputTimezone	UTC	The time zone to assume for the output timestamp.
outputLocale	""	The Java Locale to assume for the output timestamp.

Example usage:

# convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
# The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
# or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd".
convertTimestamp {
  field : timestamp
  inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]
  inputTimezone : America/Los_Angeles
  outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
  outputTimezone : UTC
}

dropRecord

The dropRecord command silently consumes records without ever emitting any record. This is much like piping to /dev/null.

Example usage:

dropRecord {}

equals

The equals command succeeds if all field values of the given named fields are equal to the the given values and fails otherwise. Multiple fields can be named, in which case a logical AND is applied to the results.

Example usage:

# succeed if the _attachment_mimetype field contains a value "avro/binary"
# fail otherwise
equals { _attachment_mimetype : [avro/binary] }

# succeed if the tags field contains nothing but the values "version1"
# and "highPriority", in that order, fail otherwise
equals { tags : [version1, highPriority] }

generateUUID

The generateUUID command sets a universally unique identifier on all records that are intercepted. An example UUID is b5755073-77a9-43c1-8fad-b7a586fc1b97, which represents a 128bit value.

The command provides the following configuration options:

Property Name	Default	Description
field	id	The name of the field to set.
preserveExisting	true	Whether to preserve the field value if one is already present.
prefix	""	The prefix string constant to prepend to each generated UUID.

Example usage:

generateUUID {
  field : my_id
}

grok

The grok command uses regular expression pattern matching to extract structured fields from unstructured log data.

This is well suited for syslog logs, apache, and other webserver logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption.

A grok command can load zero or more dictionaries. A dictionary is a file or string that contains zero or more REGEXNAME to REGEX mappings, one per line, separated by space, for example:

INT (?:[+-]?(?:[0-9]+))
HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)

For example, the regex named "INT" is associated with the following pattern:

[+-]?(?:[0-9]+)

\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)

Morphlines ships with several standard grok dictionaries.

A grok command can contain zero or more grok expressions. Each grok expression refers to a record input field name and can contain zero or more grok patterns. The following is an example grok expression that refers to the input field named "message" and contains two grok patterns:

expressions : {
  message : """\s+%{INT:pid} %{HOSTNAME:my_name_servers}"""
}

The syntax for a grok pattern is

%{REGEX_NAME:GROUP_NAME}

for example

%{INT:pid}

%{HOSTNAME:my_name_servers}

The REGEXNAME is the name of a regex within a loaded dictionary.

The GROUPNAME is the name of an output field.

If all expressions of the grok command match the input record, then the command succeeds and the content of the named capturing group is added to this output field of the output record. Otherwise, the record remains unchanged and the grok command fails, causing backtracking of the command chain.

Note: The morphline configuration file is implemented using the HOCON format (Human Optimized Config Object Notation). HOCON is basically JSON slightly adjusted for the configuration file use case. HOCON syntax is defined at HOCON github page and as such, multi-line strings are similar to Python or Scala, using triple quotes. If the threecharacter sequence """ appears, then all Unicode characters until a closing """ sequence are used unmodified to create a string value.

In addition, the grok command supports the following parameters:

Property Name	Default	Description
dictionaryFiles	[]	A list of zero or more local files or directory trees from which to load dictionaries.
dictionaryString	null	An optional inline string from which to load a dictionary.
extract	true	Can be "false", "true", or "inplace". Add the content of named capturing groups to the input record ("inplace"), to a copy of the input record ("true"), or to no record ("false").
numRequiredMatches	atLeastOnce	Indicates the minimum and maximum number of field values that must match a given grok expression for each input field name. Can be "atLeastOnce" (default), "once", or "all".
findSubstrings	false	Indicates whether the grok expression must match the entire input field value or merely a substring within.
addEmptyStrings	false	Indicates whether zero length strings stemming from empty (but matching) capturing groups shall be added to the output record.

Example usage:

# index syslog formatted files
grok {
  dictionaryFiles : [target/test-classes/grok-dictionaries]
  dictionaryString : """
    XUUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}
  """

  expressions : {
    message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"""

    #message2 : "(?<queue_field>.*)"
    #message4 : "%{NUMBER:queue_field}"
  }
  extract : true
  numRequiredMatches : all # default is atLeastOnce
  findSubstrings : false
  addEmptyStrings : false
}

More example usage:

# Split a line on one or more whitespace into substrings,
# and add the substrings to the {{columns}} output field.
grok {
  expressions : {
    message : """(?<columns>.+?)(\s+|\z)"""
  }
  findSubstrings : true
}

Note: An easy way to test grok out is to use the online grok debugger from the logstash project.

if

The if command consists of a chain of zero or more conditions commands, as well as an optional chain of zero or or more commands that are processed if all conditions succeed ("then commands"), as well as an optional chain of zero or more commands that are processed if one of the conditions fails ("else commands").

If one of the commands in the then chain or else chain fails, then the entire if command fails and any remaining commands in the then or else branch are skipped.

The command provides the following configuration options:

Property Name	Default	Description
conditions	[]	A list of zero or more commands.
then	[]	A list of zero or more commands.
else	[]	A list of zero or more commands.

Example usage:

if {
  conditions : [
    { contains { _attachment_mimetype : [avro/binary] } }
  ]
  then : [
    { logInfo { format : "processing then..." } }
  ]
  else : [
    { logInfo { format : "processing else..." } }
  ]
}

java

The java command compiles and executes the given Java code block, wrapped into a Java method with a Boolean return type and several parameters, along with a Java class definition that contains the given import statements.

The parameters passed to the code block are record, config, parent, child, context, logger and are of types Record, Config, Command, Command, MorphlineContext, Logger, respectively.

Compilation is done in main memory, meaning without writing to the filesystem.

The result is an object that can be executed (and reused) any number of times. This is a high performance implementation, using an optimized variant of https://scripting.dev.java.net/" (JSR 223 Java Scripting). Calling eval() just means calling Method.invoke(), and, as such, has the same minimal runtime cost. As a result of the low cost, this command can be called on the order of 100 million times per second per CPU core on industry standard hardware.

The command provides the following configuration options:

Property Name	Default	Description
imports	A default list sufficient for typical usage.	A string containing zero or more Java import declarations.
code	[]	A Java code block as defined in the Java language specification. Must return a Boolean value.

Example usage:

java {
  imports : "import java.util.*;"
  code: """
    List tags = record.get("tags");
    if (!tags.contains("hello")) {
      return false;
    }
    tags.add("world");
    return child.process(record);
        """
}

logTrace, logDebug, logInfo, logWarn, logError

These commands log a message to slf4j at the given log level. The command can fetch the values of a record field using a field expression, which is a string of the form @\fieldname\{}. The special field expression @{} can be used to log the entire record.

Example usage:

# log the entire record at DEBUG level to SLF4J
logDebug { format : "my record: {}", args : ["@{}"] }

More example usage:

# log the timestamp field and the entire record at INFO level to SLF4J
logInfo {
  format : "timestamp: {}, record: {}"
  args : ["@{timestamp}", "@{}"]
}

not

The not command consists of one nested command, the Boolean return value of which is inverted.

Example usage:

if {
  conditions : [
    {
      not {
        grok {
          ... some grok expressions go here
        }
      }
    }
  ]
  then : [
    { logDebug { format : "found no grok match: {}", args : ["@{}"] } }
    { dropRecord {} }
  ]
  else : [
    { logDebug { format : "found grok match: {}", args : ["@{}"] } }
  ]
}

pipe

The pipe command has an identifier and contains a chain of zero or more commands, through which records get piped. A command transforms the record into zero or more records. The output records of a command are passed to the next command in the chain. A command has a Boolean return code, indicating success or failure. If any command in the pipe fails (meaning that it returns false), the whole pipe fails (meaning that it returns false), which causes backtracking of the command chain.

Because a pipe is itself a command, a pipe can contain arbitrarily nested pipes. A morphline is a pipe. "Morphline" is simply another name for the pipe at the root of the command tree.

The command provides the following configuration options:

Property Name	Default	Description
id	n/a	An identifier for this pipe.
importCommands	[]	A list of zero or more import specifications, each of which makes all morphline commands that match the specification visible to the morphline. A specification can import all commands in an entire Java package tree (specification ends with ".**"), all commands in a Java package (specification ends with "."), or the command of a specific fully qualified Java class (all other specifications). Other commands present on the classpath are not visible to this morphline.
commands	[]	A list of zero or more commands.

Example usage demonstrating a pipe with two commands, namely addValues and logDebug:

pipe {
  id : my_pipe

  # Import all commands in these java packages, subpackages and classes.
  # Other commands on the classpath are not visible to this morphline.
  importCommands : [
    "com.cloudera.**",   # package and all subpackages
    "org.apache.solr.**", # package and all subpackages
    "com.mycompany.mypackage.*", # package only
    "com.cloudera.cdk.morphline.stdlib.GrokBuilder" # fully qualified class
  ]

  commands : [
    { addValues { foo : bar }}
    { logDebug { format : "output record: {}", args : ["@{}"] } }
  ]
}

separateAttachments

The separateAttachments command emits one output record for each attachment in the input record's list of attachments. The result is many records, each of which has at most one attachment.

Example usage:

separateAttachments {}

setValues

The setValues command is the same as the addValues command, except that it first removes all values from the given output field, and then it adds new values.

Example usage:

setValues {
  # assign values "text/log" and "text/log2" to source_type output field
  source_type : [text/log, text/log2]

  # assign the integer 123 to the pid field
  pid : [123]

  # assign all values contained in the first_name field to the name field
  name : "@{first_name}"
}

toString

The toString command converts the Java objects in a given field using the Object.toString() method to their string representation.

Example usage:

toString { field : source_type }

tryRules

The tryRules command consists of zero or more rules. A rule consists of zero or more commands.

The rules of a tryRules command are processed in topdown order. If one of the commands in a rule fails, the tryRules command stops processing this rule, backtracks and tries the next rule, and so on, until a rule is found that runs all its commands to completion without failure (the rule succeeds). If a rule succeeds, the remaining rules of the current tryRules command are skipped. If no rule succeeds the record remains unchanged, but a warning may be issued or an exception may be thrown.

Because a tryRules command is itself a command, a tryRules command can contain arbitrarily nested tryRules commands. By the same logic, a pipe command can contain arbitrarily nested tryRules commands and a tryRules command can contain arbitrarily nested pipe commands. This helps to implement complex functionality for advanced usage.

The command provides the following configuration options:

Property Name	Default	Description
catchExceptions	false	Whether Java exceptions thrown by a rule shall be caught, with processing continuing with the next rule (`true`), or whether such exceptions shall not be caught and consequently propagate up the call chain (`false`).
throwExceptionIfAllRulesFailed	true	Whether to throw a Java exception if no rule succeeds.

Example usage:

tryRules {
  catchExceptions : false
  throwExceptionIfAllRulesFailed : true
  rules : [
    {
      commands : [
        { contains { _attachment_mimetype : [avro/binary] } }
        ... handle Avro data here
        { logDebug { format : "output record: {}", args : ["@{}"] } }
      ]
    }

    {
      commands : [
        { contains { _attachment_mimetype : [text/csv] } }
        ... handle CSV data here
        { logDebug { format : "output record: {}", args : ["@{}"] } }
      ]
    }
  ]
}

cdk-morphlines-avro

This module contains morphline commands for reading, extracting, and transforming Avro files and Avro objects.

readAvroContainer

The readAvroContainer command parses an InputStream or byte array that contains Avro binary container file data. For each Avro datum, the command emits a morphline record containing the datum as an attachment in the field _attachment_body.

The Avro schema that was used to write the Avro data is retrieved from the Avro container. Optionally, the Avro schema that shall be used for reading can be supplied with a configuration option; otherwise it is assumed to be the same as the writer schema.

Note: Avro uses Schema Resolution if the two schemas are different, e.g. if the reader schema is a subset of the writer schema.

The input stream or byte array is read from the first attachment of the input record.

The command provides the following configuration options:

Property Name	Default	Description
supportedMimeTypes	null	Optionally, require the input record to match one of the MIME types in this list.
readerSchemaFile	null	An optional Avro schema file in JSON format on the local file system to use for reading.
readerSchemaString	null	An optional Avro schema in JSON format given inline to use for reading.

Example usage:

# Parse Avro container file and emit a record for each avro object
readAvroContainer {
  # Optionally, require the input to match one of these MIME types:
  # supportedMimeTypes : [avro/binary]

  # Optionally, use this Avro schema in JSON format inline for reading:
  # readerSchemaString : """<json can go here>"""

  # Optionally, use this Avro schema file in JSON format for reading:
  # readerSchemaFile : /path/to/syslog.avsc
}

readAvro

The readAvro command is the same as the readAvroContainer command except that the Avro schema that was used to write the Avro data must be explicitly supplied to the readAvro command because it expects raw Avro data without an Avro container and hence without a builtin writer schema.

Optionally, the Avro schema that shall be used for reading can be supplied with a configuration option; otherwise it is assumed to be the same as the writer schema.

Note: Avro uses Schema Resolution if the two schemas are different, e.g. if the reader schema is a subset of the writer schema.

The command provides the following configuration options:

Property Name	Default	Description
supportedMimeTypes	null	Optionally, require the input record to match one of the MIME types in this list.
readerSchemaFile	null	An optional Avro schema file in JSON format on the local file system to use for reading.
readerSchemaString	null	An optional Avro schema in JSON format given inline to use for reading.
writerSchemaFile	null	The Avro schema file in JSON format that was used to write the Avro data.
writerSchemaString	null	The Avro schema file in JSON format that was used to write the Avro data given inline.
isJson	false	Whether the Avro input data is encoded as JSON or binary.

Example usage:

# Parse Avro and emit a record for each avro object
readAvro {
  # supportedMimeTypes : [avro/binary]
  # readerSchemaString : """<json can go here>"""
  # readerSchemaFile : test-documents/sample-statuses-20120906-141433-subschema.avsc
  # writerSchemaString : """<json can go here>"""
  writerSchemaFile : test-documents/sample-statuses-20120906-141433.avsc
}

extractAvroTree

The extractAvroTree command converts an attached Avro datum to a morphline record by recursively walking the Avro tree and extracting all data into a single morphline record, with fields named by their path in the Avro tree.

The Avro input object is expected to be contained in the field \attachmentbody, and typically placed there by an upstream readAvroContainer or readAvro command.

This kind of mapping is useful for simple Avro schemas, but for more complex schemas, this approach may be overly simplistic and expensive.

The command provides the following configuration options:

Property Name	Default	Description
outputFieldPrefix	""	A string to be prepended to each output field name.

Example usage:

extractAvroTree {
  outputFieldPrefix : ""
}

extractAvroPaths

The extractAvroPaths command uses zero or more Avro path expressions to extract values from an Avro object.

The Avro input object is expected to be contained in the field _attachment_body, and typically placed there by an upstream readAvroContainer or readAvro command.

Each path expression consists of a record output field name (on the left side of the colon ':') as well as zero or more path steps (on the right hand side), each path step separated by a '/' slash, akin to a simple form of XPath. Avro arrays are traversed with the '[]' notation.

The result of a path expression is a list of objects, each of which is added to the given record output field.

The path language supports all Avro concepts, including such concepts as nested structures, records, arrays, maps, and unions. Path language supports a flatten option that collects the primitives in a subtree into a flat output list.

The command provides the following configuration options:

Property Name	Default	Description
flatten	true	Whether to collect the primitives in a subtree into a flat output list.
paths	[]	Zero or more Avro path expressions.

Example usage:

extractAvroPaths {
  flatten : true
  paths : {
    my_price : /price

    my_docId : /docId
    my_links_backward : "/links/backward"
    my_links_forward : "/links/forward"
    my_name_language_code : "/name[]/language[]/code"
    my_name_language_country : "/name[]/language[]/country"

    /mymapField/foo/label : /mapField/foo/label/
  }
}

cdk-morphlines-tika-core

This module contains morphline commands for autodetecting MIME types from binary data. Depends on tika-core.

detectMimeType

The detectMimeType command uses Apache Tika to autodetect the MIME type of the first attachment from the binary data. The detected MIME type is assigned to the _attachment_mimetype field.

The command provides the following configuration options:

Property Name	Default	Description
includeDefaultMimeTypes	true	Whether to include the Tika default MIME types file that ships embedded in tikacore.jar (see http://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml)
mimeTypesFiles	[]	The relative or absolute path of zero or more Tika custom-mimetypes.xml files to include.
mimeTypesString	null	The content of an optional custom-mimetypes.xml file embedded directly inside of this morphline configuration file.
preserveExisting	true	Whether to preserve the _attachment_mimetype field value if one is already present.
includeMetaData	false	Whether to pass the record fields to Tika to assist in MIME type detection.
excludeParameters	true	Whether to remove MIME parameters from output MIME type.

Example usage:

detectMimeType {
  includeDefaultMimeTypes : false
  #mimeTypesFiles : [src/test/resources/custom-mimetypes.xml]
  mimeTypesString :
    """
    <mime-info>
      <mime-type type="text/space-separated-values">
        <glob pattern="*.ssv"/>
      </mime-type>

      <mime-type type="avro/binary">
        <magic priority="50">
          <match value="0x4f626a01" type="string" offset="0"/>
        </magic>
        <glob pattern="*.avro"/>
      </mime-type>

      <mime-type type="mytwittertest/json+delimited+length">
        <magic priority="50">
          <match value="[0-9]+(\r)?\n\\{&quot;" type="regex" offset="0:16"/>
        </magic>
      </mime-type>
    </mime-info>
    """
}

cdk-morphlines-tika-decompress

This module contains morphline commands for decompressing and unpacking files. Depends on tika-core and commons-compress.

decompress

The decompress command decompresses the first attachment, and supports gzip and bzip2 format.

Example usage:

decompress {}

unpack

The unpack command unpacks the first attachment, and supports tar, zip, and jar format. The command emits one record per contained file.

Example usage:

unpack {}

cdk-morphlines-solr-core

This module contains morphline commands for Solr that higher level modules such as cdk-morphlines-solr-cell, search-mr, and search-flume depend on for indexing.

solrLocator

A solrLocator is a set of configuration parameters that identify the location and schema of a Solr server or SolrCloud. Based on this information a morphline Solr command can fetch the Solr index schema and send data to Solr. A solrLocator is not actually a command but rather a common parameter of many morphline Solr commands, and thus described separately here.

Example usage:

solrLocator : {
  # Name of solr collection
  collection : collection1

  # ZooKeeper ensemble
  zkHost : "127.0.0.1:2181/solr"

  # Max number of documents to pass per RPC from morphline to Solr Server
  # batchSize : 100
}

loadSolr

The loadSolr command loads a record into a Solr server or MapReduce Reducer.

The command provides the following configuration options:

Property Name	Default	Description
solrLocator	n/a	Solr location parameters as described separately above.

Example usage:

  solrLocator : {
    # Name of solr collection
    collection : collection1

    # ZooKeeper ensemble
    zkHost : "127.0.0.1:2181/solr"

    # Max number of docs to pass per RPC from morphline to Solr Server
    # batchSize : 100
  }
}

generateSolrSequenceKey

The generateSolrSequenceKey command assigns a record unique key that is the concatenation of the given baseIdField record field, followed by a running count of the record number within the current session. The count is reset to zero whenever a startSession notification is received.

For example, assume a CSV file containing multiple records but no unique ids, and the .

The name of the unique key field is fetched from Solr's schema.xml file, as directed by the solrLocator configuration parameter.

The command provides the following configuration options:

Property Name	Default	Description
solrLocator	n/a	Solr location parameters as described separately above.
baseIdField	baseid	The name of the input field to use for prefixing keys.
preserveExisting	true	Whether to preserve the field value if one is already present.

Example usage:

generateSolrSequenceKey {
  baseIdField: ignored_base_id
  solrLocator : ${SOLR_LOCATOR}
}

sanitizeUnknownSolrFields

The sanitizeUnknownSolrFields command sanitizes record fields that are unknown to Solr schema.xml by either deleting them (renameToPrefix parameter is absent or a zero length string) or by moving them to a field prefixed with the given renameToPrefix (for example, to use typical dynamic Solr fields).

Recall that Solr throws an exception on any attempt to load a document that contains a field that is not specified in schema.xml.

The command provides the following configuration options:

Property Name	Default	Description
solrLocator	n/a	Solr location parameters as described separately above.
renameToPrefix	""	Output field prefix for unknown fields.

Example usage:

sanitizeUnknownSolrFields {
  solrLocator : ${SOLR_LOCATOR}
}

cdk-morphlines-solr-cell

This module contains morphline commands for using SolrCell with Tika parsers. This includes support for types including HTML, XML, PDF, Word, Excel, Images, Audio, and Video.

solrCell

The solrCell command pipes the first attachment of a record into one of the given Tika parsers, then maps the Tika output back to a record using SolrCell.

The Tika parser is chosen from the configurable list of parsers, depending on the MIME type specified in the input record. Typically, this requires an upstream detectMimeType command.

The command provides the following configuration options:

Property Name	Default	Description
solrLocator	n/a	Solr location parameters as described separately above.
capture	[]	List of XHTML element names to extract from the Tika output. For instance, it could be used to grab paragraphs (<p>) and index them into a separate field. Note that content is also still captured into the overall "content" field.
fmaps	[]	Maps (moves) one field name to another. See the example below.
captureAttr	false	Whether to index attributes of the Tika XHTML elements into separate fields, named after the element. For example, when extracting from HTML, Tika can return the href attributes in `<a>` tags as fields named "a".
xpath	null	When extracting, only return Tika XHTML content that satisfies the XPath expression. See http://tika.apache.org/1.2/parser.html for details on the format of Tika XHTML. See also http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput.
lowernames	false	Map all field names to lowercase with underscores. For example, ContentType would be mapped to contenttype.
solrContentHandlerFactory	org.apache.solr.morphline.solrcell.TrimSolrContentHandlerFactory	A Java class to handle bridging from Tika to SolrCell.
parsers	[]	List of fully qualified Java class names of one or more Tika parsers.

Example usage:

solrCell {
  solrLocator : ${SOLR_LOCATOR}

  # extract some fields
  capture : [content, a, h1, h2]

  # rename exif_image_height field to text field
  # rename a field to anchor field
  # rename h1 field to heading1 field
  fmap : { exif_image_height : text, a : anchor, h1 : heading1 }

  # xpath : "/xhtml:html/xhtml:body/xhtml:div/descendant:node()"

  parsers : [ # one or more nested Tika parsers
    { parser : org.apache.tika.parser.jpeg.JpegParser }
  ]
}

Here is a complex morphline that demonstrates integrating multiple heterogenous input file formats via a tryRules command, including Avro and SolrCell, using auto detection of MIME types via detectMimeType command, recursion via the callParentPipe command for unwrapping container formats, and automatic UUID generation:

morphlines : [
  {
    id : morphline1
    importCommands : ["com.cloudera.**", "org.apache.solr.**"]

    commands : [
      {
        # emit one output record for each attachment in the input
        # record's list of attachments. The result is a list of
        # records, each of which has at most one attachment.
        separateAttachments {}
      }

      {
        # used for auto-detection if MIME type isn't explicitly supplied
        detectMimeType {
          includeDefaultMimeTypes : true
          mimeTypesFiles : [target/test-classes/custom-mimetypes.xml]
        }
      }

      {
        tryRules {
          throwExceptionIfAllRulesFailed : true
          rules : [
            # next top-level rule:
            {
              commands : [
                { logDebug { format : "hello unpack" } }
                { unpack {} }
                { generateUUID {} }
                { callParentPipe {} }
              ]
            }

            {
              commands : [
                { logDebug { format : "hello decompress" } }
                { decompress {} }
                { callParentPipe {} }
              ]
            }

            {
              commands : [
                {
                  readAvroContainer {
                    supportedMimeTypes : [avro/binary]
                    # optional, avro json schema blurb for getSchema()
                    # readerSchemaString : "<json can go here>"
                    # readerSchemaFile : /path/to/syslog.avsc
                  }
                }

                { extractAvroTree {} }

                {
                  setValues {
                    id : "@{/id}"
                    user_screen_name : "@{/user_screen_name}"
                    text : "@{/text}"
                  }
                }

                {
                  sanitizeUnknownSolrFields {
                    solrLocator : ${SOLR_LOCATOR}
                  }
                }
              ]
            }

            {
              commands : [
                {
                  readJsonTestTweets {
                    supportedMimeTypes : ["mytwittertest/json+delimited+length"]
                  }
                }

                {
                  sanitizeUnknownSolrFields {
                    solrLocator : ${SOLR_LOCATOR}
                  }
                }
              ]
            }

            # next top-level rule:
            {
              commands : [
                { logDebug { format : "hello solrcell" } }
                {
                  # wrap SolrCell around an Tika parsers
                  solrCell {
                    solrLocator : ${SOLR_LOCATOR}

                    capture : [
                      # twitter feed schema
                      user_friends_count
                      user_location
                      user_description
                      user_statuses_count
                      user_followers_count
                      user_name
                      user_screen_name
                      created_at
                      text
                      retweet_count
                      retweeted
                      in_reply_to_user_id
                      source
                      in_reply_to_status_id
                      media_url_https
                      expanded_url
                     ]

                    # rename "content" field to "text" fields
                    fmap : { content : text, content-type : content_type }

                    lowernames : true

                    # Tika parsers to be registered:
                    parsers : [
                      { parser : org.apache.tika.parser.asm.ClassParser }
                      { parser : org.gagravarr.tika.FlacParser }
                      { parser : org.apache.tika.parser.audio.AudioParser }
                      { parser : org.apache.tika.parser.audio.MidiParser }
                      { parser : org.apache.tika.parser.crypto.Pkcs7Parser }
                      { parser : org.apache.tika.parser.dwg.DWGParser }
                      { parser : org.apache.tika.parser.epub.EpubParser }
                      { parser : org.apache.tika.parser.executable.ExecutableParser }
                      { parser : org.apache.tika.parser.feed.FeedParser }
                      { parser : org.apache.tika.parser.font.AdobeFontMetricParser }
                      { parser : org.apache.tika.parser.font.TrueTypeParser }
                      { parser : org.apache.tika.parser.xml.XMLParser }
                      { parser : org.apache.tika.parser.html.HtmlParser }
                      { parser : org.apache.tika.parser.image.ImageParser }
                      { parser : org.apache.tika.parser.image.PSDParser }
                      { parser : org.apache.tika.parser.image.TiffParser }
                      { parser : org.apache.tika.parser.iptc.IptcAnpaParser }
                      { parser : org.apache.tika.parser.iwork.IWorkPackageParser }
                      { parser : org.apache.tika.parser.jpeg.JpegParser }
                      { parser : org.apache.tika.parser.mail.RFC822Parser }
                      { parser : org.apache.tika.parser.mbox.MboxParser,
                          additionalSupportedMimeTypes : [message/x-emlx] }
                      { parser : org.apache.tika.parser.microsoft.OfficeParser }
                      { parser : org.apache.tika.parser.microsoft.TNEFParser }
                      { parser : org.apache.tika.parser.microsoft.ooxml.OOXMLParser }
                      { parser : org.apache.tika.parser.mp3.Mp3Parser }
                      { parser : org.apache.tika.parser.mp4.MP4Parser }
                      { parser : org.apache.tika.parser.hdf.HDFParser }
                      { parser : org.apache.tika.parser.netcdf.NetCDFParser }
                      { parser : org.apache.tika.parser.odf.OpenDocumentParser }
                      { parser : org.apache.tika.parser.pdf.PDFParser }
                      { parser : org.apache.tika.parser.pkg.CompressorParser }
                      { parser : org.apache.tika.parser.pkg.PackageParser }
                      { parser : org.apache.tika.parser.rtf.RTFParser }
                      { parser : org.apache.tika.parser.txt.TXTParser }
                      { parser : org.apache.tika.parser.video.FLVParser }
                      { parser : org.apache.tika.parser.xml.DcXMLParser }
                      { parser : org.apache.tika.parser.xml.FictionBookParser }
                      { parser : org.apache.tika.parser.chm.ChmParser }
                    ]
                  }
                }

                { generateUUID { field : ignored_base_id } }

                {
                  generateSolrSequenceKey {
                    baseIdField: ignored_base_id
                    solrLocator : ${SOLR_LOCATOR}
                  }
                }

              ]
            }
          ]
        }
      }

      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }

      {
        logDebug {
          format : "My output record: {}"
          args : ["@{}"]
        }
      }

    ]
  }
]

Note: More information on SolrCell can be found here: http://wiki.apache.org/solr/ExtractingRequestHandler

cdk-morphlines-hadoop-sequencefile

readSequenceFile

The readSequenceFile command extracts zero or more records from the input stream of the first attachment of the record, representing an Apache Hadoop SequenceFile.

For the format and documentation of SequenceFiles see SequenceFile.

The command automatically handles Record-Compressed and Block-Compressed SequenceFiles.

The command provides the following configuration options:

Property Name	Default	Description
keyField	_attachment_name	The name of the output field to store the SequenceFile Record key.
valueField	_attachment_body	The name of the output field to store the SequenceFile Record value.

Example usage:

readSequenceFile {
  keyField : "key"
  valueField : "value"
}