Fork me on GitHub

RecordBreaker

RecordBreaker: Automatic structure for your text-formatted data.
RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured Avro data, without any need to write parsers or extractors. Its goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

Introduction

Hadoop's HDFS is often used to store large amounts of text-formatted data: log files, sensor readings, transaction histories, etc. Much of this data is "near-structured": the data has a format that's obvious to a human observer, but is not made explicit in the file itself. For example, the following line is an example of the Common Log Format, often used in web servers:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

It contains a source IP address, followed by a user id, followed by an date and timestamp, then an HTTP request, etc.

When a user wants to process such "near-structured" data with MapReduce, Pig, or some similar tool, she must laboriously reconstruct the metadata that is obvious to anyone who just eyeballs the data. Performing this reconstruction usually entails writing a parser or extractor, often one based on relatively brittle regular expressions. It's true that the Common Log Format is so, uh, common that writing a single good parser for it is probably worthwhile. However, there are also file listings, album track listings, temperature readings, flight schedules, and many many other kinds of data; the number of good parsers we need to write gets large, quickly. Writing all of these straightforward extractors, again and again, is a time-consuming and error-prone pain for everyone. We believe it is a major obstacle to faster and easier data analytics

The RecordBreaker project aims to automatically generate structure for text-embedded data. It consists of two main components:

  1. LearnStructure takes a text file as input and derives a parser that breaks lines of the file into typed fields. For example, the above web log entry is broken into 127.0.0.1, followed by frank, etc.
  2. SchemaDictionary takes data that's been parsed by LearnStructure and applies topic-specific labels. For example, 127.0.0.1 gets labelled as client ip address, and frank is labelled as user id.
As you can probably guess, the target structured data format is Avro. Avro allows efficient cross-platform data serialization, similar to Thrift or Protocol Buffers. Data stored in Avro has many advantages (read this overview of Avro for more) and many tools support Avro directly: Hadoop MapReduce, HBase, Pig, and others.

Related Work

Our work on the LearnStructure component draws inspiration from the PADS research project (http://www.padsproj.org/index.html), in particular the paper From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data, by Fisher, Walker, Zhu, and White. Published in POPL, 2008.. That paper itself draws on many papers in the area of information extraction and related fields. The authors have released code for their system, written in ML. ML is a great language, but is not well-suited to our needs: it is not supported by Avro, and is unlikely to appeal to many of the developers currently involved with the Hadoop ecosystem.

SchemaDictionary is more generally inspired by database schema mapping systems. (A famous example is described in The Clio Project: Managing Heterogeneity, by Miller, Hernandez, Haas, Yan, Ho, Fagin, and Popa, published in SIGMOD Record 30(1), March 2001, pp.78-83.) Schema mapping systems are usually designed to help database administrators merge existing databases; for example, when company A purchases company B and must then merge the employee lists. These tools are often expensive and expect a lot of administrator attention. In contrast, our SchemaDictionary is for busy data analysts who simply want to check out a novel dataset as quickly as possible. It is fast and simple, but can only handle relatively simple structures (rendering it inappropriate for databases, but on target for the kind of data that is popular in text-based formats).

Walkthrough

Imagine you have a simple file listing, stored in listing.txt:

5 mjc staff 170 Mar 14 2011 14:14 bin
5 mjc staff 170 Mar 12 2011 05:13 build
1 mjc staff 11080 Mar 14 2011 14:14 build.xml

We can now run our tools to automatically turn this text file into a structured Avro file.

LearnStructure

We run the first component, LearnStructure, as follows:

$ bin/learnstructure learn listing.txt outdir

This tells the learnstructure tool to learn the Avro structure found in listing.txt, and to write it out to the outdir directory. outdir contains four files: schema.json, data.avro.json, data.avro, and parser.dat.

The most interesting is probably schema.json, the program's attempt at building a JSON schema that describes the text data. In the case of the example above, it looks like this:

{
  "type" : "record",
  "name" : "record_1",
  "namespace" : "",
  "doc" : "RECORD",
  "fields" : [ {
   "name" : "base_0",
   "type" : "int",
   "doc" : "Example data: '5', '5', '1'"
  }, {
   "name" : "base_2",
   "type" : "string",
   "doc" : "Example data: 'mjc', 'mjc', 'mjc'"
  }, {
   "name" : "base_4",
   "type" : "string",
   "doc" : "Example data: 'staff', 'staff', 'staff'"
  }, {
   "name" : "base_6",
   "type" : "int",
   "doc" : "Example data: '17000', '17000', '11080'"
  }, {
   "name" : "base_8",
   "type" : {
    "type" : "record",
    "name" : "base_8",
    "doc" : "",
    "fields" : [ {
     "name" : "month",
     "type" : "int",
     "doc" : ""
    }, {
     "name" : "day",
     "type" : "int",
     "doc" : ""
    }, {
     "name" : "year",
     "type" : "int",
     "doc" : ""
    } ]
   },
   "doc" : "Example data: '(14, 3, 2011)', '(12, 3, 2011)', '(14, 3, 2011)'"
  }, {
   "name" : "base_10",
   "type" : {
    "type" : "record",
    "name" : "base_10",
    "doc" : "",
    "fields" : [ {
     "name" : "hrs",
     "type" : "int",
     "doc" : ""
    }, {
     "name" : "mins",
     "type" : "int",
     "doc" : ""
    }, {
     "name" : "secs",
     "type" : "int",
     "doc" : ""
    } ]
   },
   "doc" : "Example data: '(14, 14, 0)', '(5, 13, 0)', '(14, 14, 0)'"
  }, {
   "name" : "base_12",
   "type" : "string",
   "doc" : "Example data: 'bin', 'build', 'build.xml'"
  } ]
}
The JSON schema describes a record of several fields: an int, two strings, an int, a date, a time, and a final string. These correspond to the number of links (int), the user owner (string), the group owner (string), the filesize (int), the modification stamp (date and time), and finally the filename (string).

Of course, the field names here are nonsense. All of the values, except for subfields of the date and timestamp records, have nondescriptive synthetically-generated names. The LearnStructure step attempts to recover the type of each field, but has no way to know its name or role. Obtaining names for these fields is the job of the SchemaDictionary. For now, we just live with these bad synthetic names.

Next, let's look at data.avro.json, a JSON-formatted version of the actual data. The binary Avro version of this data is stored in data.avro. The contents of data.avro.json look like this (I've added line breaks to make it more readable):

{"base_0":5,"base_2":"mjc","base_4":"staff","base_6":17000,"base_8":{"month":3,"day":14,"year":2011},"base_10":{"hrs":14,"mins":14,"secs":0},"base_12":"bin"}

{"base_0":5,"base_2":"mjc","base_4":"staff","base_6":17000,"base_8":{"month":3,"day":12,"year":2011},"base_10":{"hrs":5,"mins":13,"secs":0},"base_12":"build"}

{"base_0":1,"base_2":"mjc","base_4":"staff","base_6":11080,"base_8":{"month":3,"day":14,"year":2011},"base_10":{"hrs":14,"mins":14,"secs":0},"base_12":"build.xml"}

The field names here match the contents of schema.json, and the values of those fields reflect the contents of listing.txt.

The final LearnStructure output file is parser.dat. This is a binary representation of the parser generated by LearnStructure; the program applied this parser to listing.txt in order to obtain data.avro and data.avro.json. If the user wants to process more data that has the same format as listing.txt, there's no need to relearn the structure; she can simply reapply the already-learned parser in parser.dat.

We're now ready to apply the SchemaDictionary component.

SchemaDictionary

We now use the SchemaDictionary tool to find meaningful names for all the fields in data obtained by LearnStructure. SchemaDictionary compares the structured but anonymous data.avro and schema.json against a large dictionary of known data types. The SchemaDictionary tool then finds the closest match between the anonymous data and a dictionary entry, and uses that match to choose meaningful labels for the anonymous Avro fields.

As long as the dictionary contains a dataset that is similar to the anonymous candidate data, then SchemaDictionary should be able to find a reasonable match. We hope the number of data types tracked by SchemaDictionary quickly becomes large and diverse, allowing it to find labels for almost any text-embedded dataset that RecordBreaker is likely to encounter. A small set of data types is currently in the SchemaDictionary repository.

Our first step is to add the data type samples from the repository into a "live" SchemaDictionary instance. We can build a dictionary that contains these samples by typing in the following commands:

bin/schemadict dict -m "HR database" -a src/samples/schemas/hr.avro schemaDict
bin/schemadict dict -m "Web listing" -a src/samples/schemas/weblisting.avro schemaDict

bin/schemadict dict -m "Flight schedule database" -a src/samples/schemas/flightschedule.avro schemaDict

Executing these commands will create a new directory called schemaDict. We will add to it three known data samples stored in src/samples/schemas, and supply a small comment to describe each one. By dumping the contents of schemaDict to the screen with this command:

bin/schemadict dict -d schemaDict

we can obtain the following listing:

1. Web listing
{"type":"record","name":"record_1","namespace":"","fields":[{"name":"datemodified","type":{"type":"record","name":"datemodified","fields":[{"name":"month","type":"int"},{"name":"day","type":"int"},{"name":"year","type":"int"}]}},{"name":"timemodified","type":{"type":"record","name":"timemodified","fields":[{"name":"hrs","type":"int"},{"name":"mins","type":"int"},{"name":"secs","type":"int"}]}},{"name":"url","type":"string"},{"name":"len","type":"int"}]}

2. HR database
{"type":"record","name":"record_1","namespace":"","fields":[{"name":"name","type":"string"},{"name":"salary","type":"double"},{"name":"vacationdays","type":"double"},{"name":"start_date","type":{"type":"record","name":"start_date","fields":[{"name":"month","type":"int"},{"name":"day","type":"int"},{"name":"year","type":"int"}]}},{"name":"numreports","type":"int"}]}

3. Flight schedule database
{"type":"record","name":"record_1","namespace":"","fields":[{"name":"departure_city","type":"string"},{"name":"destination_city","type":"string"},{"name":"airline","type":"string"},{"name":"departure_date","type":{"type":"record","name":"departure_date","fields":[{"name":"month","type":"int"},{"name":"day","type":"int"},{"name":"year","type":"int"}]}},{"name":"departure_time","type":{"type":"record","name":"departure_time","fields":[{"name":"hrs","type":"int"},{"name":"mins","type":"int"},{"name":"secs","type":"int"}]}},{"name":"return_date","type":{"type":"record","name":"return_date","fields":[{"name":"month","type":"int"},{"name":"day","type":"int"},{"name":"year","type":"int"}]}},{"name":"return_time","type":{"type":"record","name":"return_time","fields":[{"name":"hrs","type":"int"},{"name":"mins","type":"int"},{"name":"secs","type":"int"}]}}]}

Dictionary at schemaDict has 3 item(s).

This listing should confirm that schemaDict contains the three data types seen above.


We're now ready to use this dictionary to apply labels to our original file from LearnStructure. Entering this command:

bin/schemadict suggest schemaDict outdir/data.avro -d

will ask the SchemaDictionary tool to find some labels for the anonymous data in outdir/data.avro by comparing it to all the known data types in schemaDict. (The -d flag here tells the suggest tool to emit extra debug information.) Running this program will output the following:

Anonymous data filename: outdir/data.avro
Ranking of closest known data types, with match-distance (smaller is better):

-------------------------------------------------------------
1. 'Web listing, with distance: 7062.333333333333

DISCOVERED LABELS
1. In 'input', label '<root>' AS <root>
'ROOT' ==> 'ROOT'
2. In 'input', label '<root>.base_10' AS <root>.timemodified
'Example data: '(14, 14, 0)', '(5, 13, 0)', '(14, 14, 0)'' ==> ''
3. In 'input', label '<root>.base_10.hrs' AS <root>.timemodified.hrs
4. In 'input', label '<root>.base_10.secs' AS <root>.timemodified.secs
5. In 'input', label '<root>.base_10.mins' AS <root>.timemodified.mins
6. In 'input', label '<root>.base_8' AS <root>.datemodified
'Example data: '(14, 3, 2011)', '(12, 3, 2011)', '(14, 3, 2011)'' ==> ''
7. In 'input', label '<root>.base_8.month' AS <root>.datemodified.month
8. In 'input', label '<root>.base_8.year' AS <root>.datemodified.year
9. In 'input', label '<root>.base_8.day' AS <root>.datemodified.day

UNMATCHED ITEMS IN TARGET DATA TYPE
1. <root>.url
2. <root>.len

UNMATCHED ITEMS IN SOURCE DATA
1. <root>.base_6
Example data: '17000', '17000', '11080'
2. <root>.base_12
Example data: 'bin', 'build', 'build.xml'
3. <root>.base_4
Example data: 'staff', 'staff', 'staff'
4. <root>.base_2
Example data: 'mjc', 'mjc', 'mjc'
5. <root>.base_0
Example data: '5', '5', '1'

We can see that the system has chosen "Web listing" as the nearest match. This isn't a perfect match, but it's not bad: a file listing will have some potential similarities with a listing of URLs. Further, a file listing is almost certainly more similar to a server log than to an HR database or a database of flights. Let's look at the matches that it suggests.

They essentially map the anonymous data's base_10 record structure to the Web server log's timemodified structure, and base_8's date structure to the server log's datemodified. These seem like reasonable decisions about labels to apply to the anonymous data.

We can now look at the fields in each dataset that did not receive any mapping. There are two in the target Web listing dataset: url and len. There is no direct match to url in the anonymous listing data, though perhaps a match to base_12, which lists filenames, would have been a good choice. The lack of a match to len is more regrettable: it has a good match in the anonymous base_6, which lists the size of individual files and would have served as a fine label.

Five fields in the the anonymous data did not receive any mapping. We already discussed base_6 (file sizes) and base_12 (file names). The remaining items do not have any good match in the Web listing dataset. The base_4 field indicates the group owner and base_2 indicates the user owner, neither of which is related to anything in a Web access log. The final field printed here, base_0, which indicates the number of links associated with each file listing, similarly has no analogue to a Web log.


In some cases, SchemaDictionary may not correctly choose the most-similar data type correctly. For these cases, the user can ask SchemaDictionary to output its top-k "guesses." If the user types:

bin/schemadict suggest -k 3 test-schemadict testout1/data.avro -d

then the system will output its best-3 guesses. For example:

Ranking of closest known data types, with match-distance (smaller is better):


-------------------------------------------------------------
1. 'Web listing', with distance: 7062.333333333333

...(label details here)...

-------------------------------------------------------------
2. 'HR database', with distance: 9016.666666666666

...(label details here)...

-------------------------------------------------------------
3. 'Flight schedule database', with distance: 12032.833333333334

...(label details here)...

Of course, in this case the tool has correctly guessed the closest entry in the dictionary of schemas, so the top-1 answer was sufficient.

Future Work

There is lots more work to do with RecordBreaker. Here are just a few of the items on the todo list:

Dependencies

RecordBreaker depends on a handful of relatively standard libraries. The most obvious are the Avro and Hadoop projects, but there are many others.

Install

Untar or unzip the package and type ant to build it.

License

Apache 2.0

Authors

Mike Cafarella (michjc@umich.edu)

Acknowledgments

RecordBreaker would not be possible without the efforts of all the people behind Hadoop, Avro, and several other open source projects. It also owes an intellectual debt to Kathleen Fisher, David Walker, Kenny Q. Zhu, and Peter White, the authors of the terrific paper cited above.

Contact

Mike Cafarella (michjc@umich.edu)

Cloudera (github@cloudera.com)

Download

You can download this project in either zip or tar formats.

You can also clone the project with Git by running:

$ git clone git://github.com/cloudera/RecordBreaker