Checking JSON files for correctness

tl;dr: Write a Schema for your JSON format, and use Walbottle to validate your JSON files against it.

As JSON becomes used more and more in place of XML, we need a replacement for tools like xmllint to check that JSON documents follow whatever format they are supposed to be following.

Walbottle is a tool to do this, which I’ve been working on as part of client work at Collabora. Firstly, a brief introduction to JSON Schema, then I will give an example of how to integrate Walbottle into an application. In a future post I hope to explain some of the theory behind its test vector generation.

JSON Schema is a standard for describing how a particular type of JSON document should be structured. (There’s a good introduction on the Space Telescope Science Institute.) For example, what properties should be in the top-level object in the document, and what their types should be. It is entirely analogous to XML Schema (or Relax NG). It becomes a little confusing in the fact that JSON Schema files are themselves JSON, which means that there is a JSON Schema file for validating that JSON Schema files are well-formed; this is the JSON meta-schema.

Here is an example JSON Schema file (taken from the JSON Schema website):

{
    "title": "Example Schema",
    "type": "object",
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        },
        "age": {
            "description": "Age in years",
            "type": "integer",
            "minimum": 0
        }
    },
    "required": ["firstName", "lastName"]
}

Valid instances of this JSON schema are, for example:

{
    "firstName": "John",
    "lastName": "Smith"
}

or:

{
    "firstName": "Jessica",
    "lastName": "Smith",
    "age": 31
}

or even:

{
    "firstName": "Sandy",
    "lastName": "Sanderson",
    "country": "England"
}

The final example is important: by default, JSON object instances are allowed to contain properties which are not defined in the schema (because the default value for the JSON Schema additionalProperties keyword is an empty schema, rather than false).

What does Walbottle do? It takes a JSON Schema as input, and can either:

check the schema is a valid JSON Schema (the json-schema-validate tool);
check that a JSON instance follows the schema (the json-validate tool); or
generate JSON instances from the schema (the json-schema-generate tool).

Why is the last option useful? Imagine you have written a library which interacts with a web API which returns JSON. You use json-glib to turn the HTTP responses into a JSON syntax tree (tree of JsonNodes), but you have your own code to navigate through that tree and extract the interesting bits of the response, such as success codes or new objects from the server. How do you know your code is correct?

Ideally, the web API author has provided a JSON Schema file which describes exactly what you should expect from one of their HTTP responses. You can use json-schema-generate to generate a set of example JSON instances which follow or subtly do not follow the schema. You can then run your code against these instances, and check whether it:

does not crash;
correctly accepts the valid JSON instances; and
correctly rejects the invalid JSON instances.

This should be a lot better than writing such unit tests by hand, because nobody wants to spend time doing that — and even if you do, you are almost guaranteed to miss a corner case, which leaves your code prone to crashing when given unexpected input. (Alarmists would say that it is vulnerable to attack, and that any such vulnerability of network-facing code is probably prone to escalation into arbitrary code execution.)

For the example schema above, json-schema-generate returns (amongst others) the following JSON instances:

{"0":null,"firstName":null}
{"lastName":[null,null],"0":null,"age":0}
{"firstName":[]}
{"lastName":"","0":null,"age":1,"firstName":""}
{"lastName":[],"0":null,"age":-1}

They include valid and invalid instances, which are designed to try and hit boundary conditions in typical json-glib-using code.

How do you integrate Walbottle into your project? Probably the easiest way is to use it to generate a C or H file of JSON test vectors, and link or #include that into a simple test program which runs your code against each of them in turn.

Here is an example, straight from the documentation. Add the following to configure.ac:

AC_PATH_PROG([JSON_SCHEMA_VALIDATE],[json-schema-validate])
AC_PATH_PROG([JSON_SCHEMA_GENERATE],[json-schema-generate])
 
AS_IF([test "$JSON_SCHEMA_VALIDATE" = ""],
      [AC_MSG_ERROR([json-schema-validate not found])])
AS_IF([test "$JSON_SCHEMA_GENERATE" = ""],
      [AC_MSG_ERROR([json-schema-generate not found])])

Add this to the Makefile.am for your tests:

json_schemas = \
    my-format.schema.json \
    my-other-format.schema.json \
    $(NULL)
 
EXTRA_DIST += $(json_schemas)
 
check-json-schema: $(json_schemas)
    $(AM_V_GEN)$(JSON_SCHEMA_VALIDATE) $^
check-local: check-json-schema
.PHONY: check-json-schema
 
json_schemas_h = $(json_schemas:.schema.json=.schema.h)
BUILT_SOURCES += $(json_schemas_h)
CLEANFILES += $(json_schemas_h)
 
%.schema.h: %.schema.json
    $(AM_V_GEN)$(JSON_SCHEMA_GENERATE) \
        --c-variable-name=$(subst -,_,$(notdir $*))_json_instances \
        --format c $^ > $@
 
my_test_suite_SOURCES = my-test-suite.c
nodist_my_test_suite_SOURCES = $(json_schemas_h)

And add this to your test suite C file itself:

#include "my-format.schema.h"
 
…
 
// Test the parser with each generated test vector from the JSON schema.
static void
test_parser_generated (gconstpointer user_data)
{
  guint i;
  GObject *parsed = NULL;
  GError *error = NULL;
 
  i = GPOINTER_TO_UINT (user_data);
 
  parsed = try_parsing_string (my_format_json_instances[i].json,
                               my_format_json_instances[i].size, &error);
 
  if (my_format_json_instances[i].is_valid)
    {
      // Assert @parsed is valid.
      g_assert_no_error (error);
      g_assert (G_IS_OBJECT (parser));
    }
  else
    {
      // Assert parsing failed.
      g_assert_error (error, SOME_ERROR_DOMAIN, SOME_ERROR_CODE);
      g_assert (parsed == NULL);
    }
 
  g_clear_error (&error);
  g_clear_object (&parsed);
}
 
…
 
int
main (int argc, char *argv[])
{
  guint i;
 
  …
 
  for (i = 0; i < G_N_ELEMENTS (my_format_json_instances); i++)
    {
      gchar *test_name = NULL;
 
      test_name = g_strdup_printf ("/parser/generated/%u", i);
      g_test_add_data_func (test_name, GUINT_TO_POINTER (i),
                            test_parser_generated);
      g_free (test_name);
    }
 
  …
}

Walbottle is heading towards being mature. There are some features of the JSON Schema standard it doesn’t yet support: $ref/definitions and format. Its main downside at the moment is speed: test vector generation is complex, and the algorithms slow down due to computational complexity with lots of nested sub-schemas (so try to design your schemas to avoid this if possible). json-schema-generate recently acquired a --show-timings option which gives debug information about each of the sub-schemas in your schema, how many JSON instances it generates, and how long that took, which gives some insight into how to optimise the schema.

Original post