Production data never lies

Using production data to automatically generate Avro schemas, using Spec-Provider.

This article originally appeared on Akvo’s blog

The Akvo engineering team has bought into the idea of an immutable data log as the backbone of our architecture, which we hope will make it easier to integrate Akvo systems and build new functionalities. More importantly, it will treat data as a first-class citizen; as the valuable thing that it really is.

Automatic schema generation

One of the first tasks is to move all the data in Akvo Flow from its current storage, Google Cloud Datastore, to the immutable data log, which in our case is Kafka.

Given the value of data, as well as the fact that we expect Akvo to be around for many years to come and know that our software is going to evolve, we want to enforce some degree of quality on the data. So instead of chucking JSON blobs into Kafka, we’ve decided to use the Confluent Schema Registry.

The Schema Registry will ensure that any data added to Kafka will comply with a schema, and that as data evolves, new schemas will remain compatible with old schemas, so that old data can always be read. As the Schema Registry only supports Avro schemas, we need to generate one schema for each of the existing classes in Akvo Flow before we can start the migration.

Thankfully, the Java Avro library comes with the ReflectData utility that is able to generate an Avro schema from a Java class, so we don’t need to do this chore by hand.

If we try it on one of Akvo Flow’s classes SurveyedLocale, ReflectData generates the following:

{
  "type": "record",
  "name": "SurveyedLocale",
  "namespace": "com.gallatinsystems.surveyal.domain",
  "fields": [
    {
      "name": "systemIdentifier",
      "type": "string"
    },
    ...
    {
      "name": "surveyalValues",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "SurveyalValue",
          "fields": [
            ...
            {
              "name": "stringValueText",
              "type": {
                "type": "record",
                "name": "Text",
                "namespace": "com.google.appengine.api.datastore",
                "fields": [
                  {
                    "name": "value",
                    "type": "string"
                  }
                ]
              }
            }
          ]
        },
        "java-class": "java.util.List"
      }
    },
    ...
    {
      "name": "key",
      "type": "com.google.appengine.api.datastore.Key"
    },
    {
      "name": "ancestorIds",
      "type": {
        "type": "array",
        "items": "long",
        "java-class": "java.util.List"
      }
    }
  ]
}

You can find the full schema on Github.

This is a good starting point, and we could further improve it by making Avro somehow honour the fields marked as @NotPersistent, so they are not included in the final schema.

But, and this is a big but, there is no hint about which fields are mandatory and which ones are optional, neither is there a clue about what values are valid for those strings, longs and doubles.

We could dig into the code to find the validation rules, but even if we did that for every single one of the classes, Google Cloud Datastore doesn’t enforce any schema, so even if this class is the current view of the data, those rules may have changed over the years, and we don’t really know what is in the datastore.

Alternatively, you could ask around and hope that somebody has been around for years and has a prodigious memory, plus that there has never been any bug that allowed for invalid data to reach the datastore, plus that any business changes were reflected with flawless data migrations.

A data-driven approach

But there is a more obvious source of truth: the actual production data. Why not generate the Avro schemas directly from what the production data tell us? At the end of the day, that is what we need to support and contains examples of all possible values that are actually used.

To help us with this task, we decided to use Spec-Provider, which is a Clojure library that, when given a list of objects, will generate a Clojure Spec. If you are unfamiliar with Clojure Spec, you can think about it as a schema DSL on steroids.

But Clojure Spec is not relevant to this discussion, as Spec-Provider has two distinct phases: one to generate statistics about the data types and another to generate the Clojure Specs given those statistics. We’ll use just the first step and do the transformation to Avro ourselves.

So after pulling all the data from Google Cloud Datastore and transforming the Entities to Clojure maps, Spec-Provider provides us with the following stats:

{:pred-map        {map? {:max-length   27,
                          :min-length   21,
                          :sample-count 463417}}
 :sample-count    463417
 :map             {:keys  {:ambiguous   {:distinct-values #{false}, 
                                         :pred-map        {boolean? {:sample-count 463417}},
                                         :sample-count    463417},
                           :ancestorIds {:distinct-values #{nil},
                                         :pred-map        {nil? {:sample-count 440212}},
                                         :sample-count    440212},
                           :countryCode {:distinct-values           #{nil
                                                                      "RU"
                                                                      "MN"
                                                                      "UG"},
                                         :hit-distinct-values-limit true,
                                         :pred-map                  {string? {:max-length   2,
                                                                              :min-length   2,
                                                                              :sample-count 165067},
                                                                     nil?    {:sample-count 298350}},
                                         :sample-count              463417},
                           ...
                          }
                   }
}

You can find the full stats on Github.

The first few lines tell us that all entities are maps that have between 21 and 27 fields. They also tell us that we sampled almost 500k entities. The way Spec-Provider works, you can add as many entities as you want without the risk of running out of memory. Then the Spec-Provider stats lists all fields with their type, a subset of the possible values, min and max length for strings and lists.

With a little bit of code, we can transform those statistics into an Avro schema:

{:type "record",
 :name "SurveyedLocale",
 :fields ({:name "ambiguous", :type "boolean", :doc "It is always '<false>'. "}
           {:name "ancestorIds", :type "null", :doc "Always nil"}
           {:name "countryCode", :type ("null" "string"), :doc "Examples: <RU>, <MN>, <TD>. Size [2,2]. Nil 64.38%"}
           {:name "surveyInstanceContrib",
            :type ("null"
                    {:type "array",
                     :items {:type "long", :doc "Examples: <1520050>, <2310465>, <1840061>. Range [1520050,2043500042]. "}}),
            :doc "Size [1,88]. Nil 3.33%"}
           ...
           ),
 :doc "Size [21,27]. ",
 :namespace "org.akvo.flow"}

You can find the full schema on Github.

Looking at this schema, we see some pretty interesting stuff:

The “ambiguous” field is always false. Not a very useful flag.
There are 11 fields that are always null. Why do we need them?
Of the 27 fields, 22 fields that can be null. This reminds me of Rich Hickey’s last rant on the Maybe type.
The “localeType” field is null on 0.003% of the entities. You can guess the answer if you asked the team if this field was mandatory.
CountryCode is always a two letter string.
The display name can be the empty string.
The surveyInstanceContrib array can have up to 88 items.
The latitude/longitude coordinates have invalid data as the range is [-1307822.0,698259771924] or maybe the data is not from this planet.
We don’t need to handle @NotPersistent fields.
There are some real examples in the doc, which can come pretty handy.
This information and the schemas give us a much better starting point than the one before.

This information and the schemas give us a much better starting point than the one before.

What’s next?

Of course, this approach is not exclusively to produce Avro schemas. You can also produce other kinds of schemas, like Protocol Buffers, or maybe even directly generate your classes.

The Avro type system is not very rich. The best we can do with all the extra information is to put it in a doc, but we could supplement it by generating some code to make those additional checks.

In the case of Akvo Flow, we have dozens of entities, with dozens of fields each. With the stats at hand, we can easily create a list of suspicious fields: either fields with a tiny percentage of nulls (like the 0.003% one), which probably denote a bug or data corruption, or fields with a huge percentage of nulls, which probably denote an unused feature that we might remove.

To be able to use this approach, you need a big enough dataset, and even then, it might not be representative of the future data. For us, more than five years of data and several million rows gives us more confidence than digging around the code, asking the experts or making assumptions.

Maybe you don’t like what it tells you, but production data never lies.