Quick abstracts of YAML or JSON documents

February 24, 2014 Serguei Filimonov

When I work with unfamiliar YAML files specifying deployment manifests, product metadata, serialized records, etc. I want to quickly get a sense of a few things:

  • What is the set of keys in this data structure?
  • If the structure(nested keys) of the document changed over time, what is the quick summary of changes

structure_digest

Given the following long YAML file, I don’t really want to read through all of it to learn what keys and paths are available in it:

{
  "receipt": "Oz-Ware Purchase Invoice",
  "date": "2012-08-06",
  "customer": {
    "given": "Dorothy",
    "family": "Gale"
  },
  "items": [
    {
      "part_no": "A4786",
      "descrip": "Water Bucket (Filled)",
      "price": 1.47,
      "quantity": 4
    },
    {
      "part_no": "E1628",
      "descrip": "High Heeled "Ruby" Slippers",
      "size": 8,
      "price": 100.27,
    },
    …(many many more items )
  ]
}

Lets remove the value content, focus on structure, summarizing array entries as one:

>structure_digest order1.yml order2.yml ...
.customer.family
.customer.given
.date
.items[].descrip
.items[].part_no
.items[].price
.items[].quantity
.items[].size
.receipt

This summary hints at the basic structure of the file, particularly removing the noise of many items having very similar content and keys.

Usage

Usage: structure_digest [options] File1[, File2, ...]
    -t, --tree                       replace repeated suffixes with indents

Usecases: Web APIs

Discogs.com provides a rich api of music records. Fetching a page of Pink Floyd’s releases returns a hefty 15K of minimized JSON:

curl -s http://api.discogs.com/artists/45467/releases > pink-floyd.json
>wc -c pink-floyd.json
    15186 pink-floyd.json
>head pink-floyd.json
{"pagination": {"per_page": 50, "items": 1330, "page": 1, "urls": {"last":
"http://api.discogs.com/artists/45467/releases?per_page=50&page=27", "next":
"http://api.discogs.com/artists/45467/releases?per_page=50&page=2"}, "pages":
27}, "releases": [{"thumb":
"http://api.discogs.com/image/R-150-1090924-1191680758.jpeg", "artist": "Pink
Floyd, The*", "main_release": 1090924, "title": "Apples And Oran…

~100s of lines in my terminal. But we can quickly understand this document now:

>structure_digest --tree pink-floyd.json
.pagination
  .items
  .page
  .pages
  .per_page
  .urls
    .last
    .next
.releases[]
  .artist
  .format
  .id
  .label
  .main_release
  .resource_url
  .role
  .status
  .thumb
  .title
  .type
  .year

Usecases: Configuration files

A BOSH manifest specifies a cloud deployment. It’s used by Cloud Foundry and its configuration is rich. Lets abstract its example manifest and find the fields configuring a BOSH “job”:

>structure_digest bosh_example.yml | grep -E "^.jobs"
.jobs[].instances
.jobs[].name
.jobs[].networks[].name
.jobs[].networks[].static_ips[]
.jobs[].persistent_disk
.jobs[].resource_pool
.jobs[].template

Pretty neat.

Finding structure changes with diff

If you have 2 versions of some information format and an example of each, here’s a quick way to see what changed:

>diff <(structure_digest old.json) <(structure_digest new.json)
2,4c2,3
< .pagination.page
< .pagination.pages
< .pagination.per_page
---
> .pagination.limit
> .pagination.offset

This is great, we can tell that the API introduced a change from pagination to offsets and limits

Learn more & respond

The project is on github. Please follow it there for new features, changes.

What do you think of this tool? Do you love it? Do you hate it? Let me know in the comments.

About the Author

Biography

Previous
Vodafone Spain to Demo at Mobile World Congress: How Big, Fast Data Will Revolutionize Telco, Powered by Pivotal
Vodafone Spain to Demo at Mobile World Congress: How Big, Fast Data Will Revolutionize Telco, Powered by Pivotal

Mobile World Congress, underway this week, is one of the largest events in the mobile space. Covering the l...

Next
The Evolution and Acceleration of the Open PaaS Movement
The Evolution and Acceleration of the Open PaaS Movement

In this post, Pivotal CEO Paul Maritz shares a major announcement on PaaS (platform as a service). Learn mo...