Schema Inference

What is schema inference?

When working with anything big-data, you almost certainly have had to define a schema for the type of data that the target system should expect to receive.

We have always thought that this is super annoying.

Every time your events or data payload changes, you must remember to go into some platform and update the schema, repair tables, test the whole thing and 🤞.

When Batch collects events, rather than requiring you to define a schema for the event, Batch will automatically infer the schema from the event structure. Batch will also automatically evolve the schema as your events evolve.

We call this schema inference.

What is the schema used for?

Short answer

The schema is used for generating a parquet structure and a Hive metastore table definition.

Both of these are used for facilitating replays and extended searches.

Long answer

When Batch collects your events, it stores them in two locations - our search platform and in S3.

Each storage location is used for different purposes:

  1. Search platform is used for facilitating fast but potentially incomplete searches via our dashboard or API. It is stored in JSON format and does not contain your full data set.

  2. S3 is used for facilitating replays and extended searches. It is stored as batched parquet and contains your full data set.

In order to facilitate replays, Batch translates your Lucene search query to an SQL query that matches our generated parquet and Hive table schema. This query is then executed via AWS Athena (Presto) which searches through all of the parquet data.

Storing your data in parquet and S3, enables us to have near-infinite storage that can be programmatically queried and retrieved.

What does a schema look like?

Given the following JSON event:

{
  "object1": {
    "foo": "bar",
    "bar": 123,
    "baz": true
  },
  "object2": [
    1,
    2,
    3
  ],
  "object3": [
    "foo",
    "bar"
  ],
  "object4": [
    {
      "nested_obj_1": {
        "foo": "bar"
      }
    },
    {
      "nested_obj_2": {
        "foo": "baz"
      }
    }
  ]
}

Batch will generate the following schema:

{
  "Fields": [
    {
      "Tag": "name=payload, repetitiontype=OPTIONAL",
      "Fields": [
        {
          "Tag": "name=object1, repetitiontype=OPTIONAL",
          "Fields": [
            {
              "Tag": "name=foo, type=UTF8, repetitiontype=OPTIONAL"
            },
            {
              "Tag": "name=bar, type=DOUBLE, repetitiontype=OPTIONAL"
            },
            {
              "Tag": "name=baz, type=BOOLEAN, repetitiontype=OPTIONAL"
            }
          ]
        },
        {
          "Tag": "name=object2, type=LIST, repetitiontype=OPTIONAL",
          "Fields": [
            {
              "Tag": "name=element, type=DOUBLE, repetitiontype=OPTIONAL"
            }
          ]
        },
        {
          "Tag": "name=object3, type=LIST, repetitiontype=OPTIONAL",
          "Fields": [
            {
              "Tag": "name=element, type=UTF8, repetitiontype=OPTIONAL"
            }
          ]
        },
        {
          "Tag": "name=object4, type=LIST, repetitiontype=OPTIONAL",
          "Fields": [
            {
              "Tag": "name=element, repetitiontype=OPTIONAL",
              "Fields": [
                {
                  "Tag": "name=nested_obj_1, repetitiontype=OPTIONAL",
                  "Fields": [
                    {
                      "Tag": "name=foo, type=UTF8, repetitiontype=OPTIONAL"
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

While the inferred schema is not easy to read, it is intended to be used programmatically.

You can view a user-friendly version of the current schema and all past schemas from the collection's page. The user-friendly schema output will look like this:

[
  {
    "name":"object1",
    "type":"object",
    "fields":[
      {
        "name":"foo",
        "type":"string"
      },
      {
        "name":"bar",
        "type":"double"
      },
      {
        "name":"baz",
        "type":"boolean"
      }
    ]
  },
  {
    "name":"object2",
    "type":"array",
    "fields":[
      "double"
    ]
  },
  {
    "name":"object3",
    "type":"array",
    "fields":[
      "string"
    ]
  },
  {
    "name":"object4",
    "type":"array",
    "fields":[
      {
        "name":"nested_obj_1",
        "type":"object",
        "fields":[
          {
            "name":"foo",
            "type":"string"
          }
        ]
      }
    ]
  }
]

Schema inference rules

  1. Do not change types for existing keys

    1. You can always add new keys or omit keys from your events but you cannot replace existing keys - this will cause a schema conflict. For example, if your event had a key called foo that was a String, do not send another event which has foo as an Integer.

  2. Field names that use non-alphanumeric characters will be transformed to the format _UTFHEXCODE_ when stored in parquet

    1. We need to do this as Hive does not support non-alphanumeric characters for column names.

  3. Field names that use uppercase characters will be automatically updated to lowercase when stored in parquet.

    1. We need to do this as Hive columns are case-insensitive.

Best practices

Follow these best practices to ensure that your data lookups and replays remain fast and easy to work with.

  1. Use a common event envelope

    1. By doing this, you will prevent possible future key and type collisions.

  2. Use snake_case for JSON keys

    1. When generating the schema, every key is defined as a "column" in the schema with an associated type and while parquet is case-sensitive, Hive is not. This means that while having field and FIELD at the same nest-level is OK in parquet, it is NOT okay in Hive - it will cause duplicate column errors. Due to this, Batch will automatically convert ALL keys to lowercase when storing in parquet for you but to avoid any strange behavior, we advise to try to present the data in an already good format.

  3. Use alphanumeric characters for keys

    1. Similar to the Hive issue with case-insensitivity, Hive is not able to deal with any non-alphanumeric characters for columns. Batch will automatically convert any non-supported characters to the equivalent unicode hex code - but again, it is best to present a data set which won't require any such behind-the-scenes transformations.

Last updated