Stop YAML abuse and IT cruelty

YAML should die (Actually it should not.)

TL;DR

YAML is a pretty powerful and convenient human-readable serialization language.

Unfortunately it has been abused and twisted to do things it was never meant to do.

Software engineers should consider using (or creating) a DSL as soon as possible in their design process.

And finally, I think the world deserve at least one standardized DSL dedicated to CI/CD.

Introduction

This is another rant that stayed in my head for too long before I finally decided to write it here.

Hell, Martin Tournoij post YAML: probably not so great after all trace back to 2016!

After using YAML for nine years, some points have been addressed, but I’m still angry enough to write this article.

https://yaml.org/ first lines reads:

YAML: YAML Ain't Markup Language
What It Is: YAML is a human friendly data serialization
 standard for all programming languages.

And what it is being used for?

  • in place of a DCL

  • in place of a DSL

  • to replace XML

YAML versus XML

XML is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable

https://en.wikipedia.org/wiki/XML

I give you that, YAML is undoubtedly more human-friendly than XML, produce smaller files and is arguably more machine-friendly as well (see XML criticism below).

Except since it’s early days XML have standardized methods to validate a document 1. There are some initiatives in the YAML world 2, but they are neither widely adopted nor quite straight-forward.

XML criticism

For the naives, XML might seems an easier syntax to parse at first sight (especially if you know about my YAML criticism); but the specification have really fucked-up features.

The first to comes in mind is the different ways to set attributes (all the following are equivalent):

<?xml version="1.0"?>
<params>
 <param foo="bar"/>
 <param foo="bar"></param>
 <param>
  <foo>bar</foo>
 </param>
</params>

An another is the myriad of ways to encode strings which have been abused through history to make the document do things it wasn’t supposed to.

The language is extensible and an XML parser should follow the extensions; XLink is an example (which can be used as YAML anchors in a same document).

SVG being XML, it support linking; or at least it should, but not all SVG renderer implemented it because it’s a pain.

XML is more verbose than YAML (well, XML is more verbose than anything I can think of; except maybe some Microsoft formats) and lot of programs relies on DOM parsing, requiring to load the whole document before processing. Thankfully there is SAX allowing incremental parsing.

The last critic would be that a slight mistake render the whole document invalid, and humans make mistakes.

YAML in place of a DCL or DSL

Note

Thereafter I will use interchangeably the DCL and DSL terms because the line is somewhat thin… Probably because the authors of the software themselves weren’t sure.

This part is the core of my complaint.

As a sysadmin (oops, the fashionable buzzword of this era is DevOps) I have to deal with the insufferable nonsense of using YAML instead of an appropriate DSL, leaving you no way to verify what you wrote without executing it.

Configuration Management Software:

Ansible, Juju, SaltStack… Are some who made the choice of “simplicity” and are use YAML.

Which probably led James Shubin to start a few of his speeches 3 with the line: “Hey guys! Do you [really] want to become YAML engineers‽” (in which the auditory generally cheered against).

Continuous Integration:

Circle CI, Concourse CI, GitLab CI, Travis CI… In fact most CI systems relies on YAML to describe your pipelines.

IaaS, PaaS, IaC… The Cloud!

Cloud Foundry, Docker, Kubernetes… Okay, I stop here.

K8s had wrecked havoc.

As soon as you get a little serious with it, you need to use templates and YAML generators (HELM, Kapitan, ytt…) resulting in “configuration files” in the thousands of lines (it’s even the Stackery selling pitch).

Is YAML The Answer?

(Hint: No, it’s 42).

How many times did you had to try to rewrite your playbook/tasks before you can achieve what you meant?

The YAML you wrote was valid, it was even accepted by the program, it began to execute and then b̶̨̼̣̑o̵̜̝̝͌́͝ỏ̷̘̜̱͔̽͘͝ḿ̷̤̯̻͙͒͛̈́…☠

Now a somewhat rhetorical question:

“If no human is going to write those files, and nobody is going to read them; what is the point of using YAML?”.

> “You are wrong, human are writing those files and reading them!”

Well sort of. Human are writing “parts” of those files, cursing against the workarounds or the lack of them…

Whenever they are using templates (or UI, like Stackery) I consider they are not writing them; they could have written in an appropriate language which, in his turn, like the template, would have been “complied” in some appropriate, machine-readable, format.

Whenever they are using a command to read them (yq 4, jq, Visual Studio Code Kubernetes Tools, K8sYAML…) they are not reading them; they are using a program to make humanly-readable something which was not.

You are using a CLI or a GUI to interact with your databases? The same to read/write images or their Exif? It is the same; it should be the same…

Furthermore, YAML is not even easy to parse (see YAML criticism)!

Some software made the [right!] choice to develop their DCL/DSL or at least to use something more appropriate than YAML.

YAML criticism

As it is a generalist language, it is not a simple syntax to parse either.

It has been created to be able to serialize a lot of different languages structures while remaining pleasant to read by a human.

Let’s explore a few of it’s features:

  • builtin types (booleans, empty and null differentiation, floating point numbers, integers, mappings, scalars, sequences, timestamps, unordered sets…)

  • custom types, tags, tags shorthand

  • anchoring/aliasing, creating (at least) two issues:

    • self-referential / circular data structures are usually not welcome in CMS, CI/CD and configuration

    • is not indented for “recursive merging”, bringing the GitLab team to create the extends keyword as workaround

  • multiple ways to write the same thing:

    • builtin types may be explicit or implicit, the following example (just for scalar) were all lines are equivalent:

      - !!str "string"
      - "string"
      - 'string'
      - string
      - !<tag:yaml.org,2002:str> "string"
      - "\x73\x74\x72\x69\x6e\x67"
      - "\u0073\u0074\u0072\u0069\u006e\u0067"
      
    • overly smart integer parsing:

      - !!int "11"
      - 11
      - 0xb
      - 0xB
      - 013
      - 1_1
      
    • There are 9 (or 63, depending how you count) different ways to write multi-line strings in YAML.

      https://stackoverflow.com/a/21699210/248390

  • optional header for version specification and directives

  • optional document separators

  • automatic but parametrizable indentation level

  • contexts

  • smart line/flow folding, block chomping

  • and much more…

I’m sure you do not need all of these features and am pretty confident you did not event heard about half of them.

And yet, you need some other features incompatibles with the language (like loops, includes, variables…)!

The Configuration Management World have it

CFengine, mgmt, Puppet, Terraform made their own language.

cdist use POSIX SH, Chef choose Ruby…

It has been demonstrated that in the Configuration Management world some brilliant minds arrived to the conclusion a DSL was a necessity:

The Continuous World

In Configuration Management, once I was fed up with YAML usage, there were alternatives available I could turn up to.

Alas, in the Continuous world, there is no fallback… And yet, the configuration here is arguably easier than in the other (configuration management, IaaS, …) domains because:

  • the maximum scope of the CI/CD is (more or less) already known

  • all solution already share a common subset and most of them tend to go towards providing the same features

Honestly, I long for the main actors to sit up together and write a RFC for an interoperable DSL…

It should be possible to write something portable with a simple subset of the language, possible to identify which software implemented which parts in their documentation or in a centralized manner (a bit like what Can I use does for browsers).

Other writings

Call it a confirmation bias; but I do not feel alone raging against YAML abuse:

A few quotes from the start of https://news.ycombinator.com/item?id=20731160 (mid-2019):

ivan4th:

From my experience, while YAML itself is something one can learn to live with, the true horror starts when people start using text template engines to generate YAML. Like it’s done in Helm charts, for example. Aren’t these “indent” filters beautiful?

DonHopkins:

I developed Yet Another JSON Templating Language, whose main virtue was that it was extremely simple to use and implement, and it could be easily implemented in JavaScript or any other languages supporting JSON.

We had joy, we had fun, we had seasons in the sun, but as I added more and more features and syntax to cover specific requirements and uncommon edge cases, I realized I was on an inevitable death-march towards my cute little program becoming sufficiently complicated to trigger Greenspun’s tenth rule.

There is no need for Yet Another JSON Templating Language, because JavaScript is the ultimate JSON templating language. Why, it even supports comments and trailing commas!

Just use the real thing to generate JSON, instead of trying to build yet another ad-hoc, informally-specified, bug-ridden, slow implementation of half of JavaScript.

PopeDotNinja:

> the true horror starts when people start using text template engines > to generate YAML

I just had a shiver recalling a Kubernetes wrapper wrapper wapper wrapper at a former job. I think there were at least two layers of mystical YAML generation hell. I couldn’t stop it, and it tanked much joy in my work. It was a factor in me moving on.


1

Wikipedia have a big list of them.

The most frequent I encountered were DTD and more recently XSD and RELAX NG.

2

JSON Schema seems to be the most portable and adopted solution at the time of this writing (https://json-schema-everywhere.github.io/yaml).

Other notable alternatives are:

3

About his creation: mgmt; a “next generation distributed, event-driven, parallel config management”.

James was an ingeneer at RedHat, working on Puppet, which led him to create mgmt.

4

yq is a Python wrapper on top of of jq (https://stedolan.github.io/jq/).

It is highly inefficient on large files, partly because it use a DOM-like approach (to load all the file before doing any processing).