Xenon Design Rationale

The New Standard for Data

r1.0 — 18^th November 2024
Gene Thomas
Planet Earth Software

This document expands upon the Xenon introduction document.

Xenon uses the <angle bracket> notation successful in html and xml. The markup addresses deficiencies in xml, namely: first class support for arrays and a graph structure; removal of attributes, support for readable indented text.

<config>
    <hostname=server.labs.com>
    <<ips>
        24.167.104.162
    <&>
        127.0.0.1
    <$>>
<$>

Xml and Json are both overly simple at an unacceptable cost to people writing the markup. Both xml attributes and json property names should not require "quotes" if the string does not contain whitespace as per html and JavaScript.

Arrays

xml’s support for arrays is awkward:

<PurchaseOrder>
    <ItemsOrders>
        <Item>
            <ItemID>aaa111</ItemID>
            <ItemPrice>34.22</ItemPrice>
        </Item>
        <Item>
            <ItemID>bbb222</ItemID>
            <ItemPrice>2.89</ItemPrice>
        </Item> 
    </ItemsOrders>
</PurchaseOrder>

Here the array ItemOrders is represented with two sub elements, called Item. There is nothing to indicate that ItemOrders is an array and Item is not the name of the data but the type. Item is redundant, it could be any markup delimiting the objects or removed entirely.

<PurchaseOrder>
    <ItemsOrders>
        <>
            <ItemID>aaa111</ItemID>
            <ItemPrice>34.22</ItemPrice>
        <$>
        <>
            <ItemID>bbb222</ItemID>
            <ItemPrice>2.89</ItemPrice>
        <$> 
    </ItemsOrders>
</PurchaseOrder>

If symbols inside angle brackets, e.g. <$>, are unfamilar, bear with us, one becomes accustomed. $ is chosen to represent the end of an object as per regular expressions, the vertical bar inside the dollar sign means the end.

Delineating the items clarifies the structure and facilitates removing the markup surrounding the items, hence <&> between items. & is chosen to separate items as it can be read as “this item and that item”, html query strings in urls use & for this purpose. <> was considered but is not as eye-catching and hence readable.

<PurchaseOrder>
    <ItemsOrders>
        <>
            <ItemID>aaa111</ItemID>
            <ItemPrice>34.22</ItemPrice>
        <$>
    <&>
        <>
            <ItemID>bbb222</ItemID>
            <ItemPrice>2.89</ItemPrice>
        <$> 
    </ItemsOrders>
</PurchaseOrder>

The <> and <$> are now redundant, removing them does not introduce ambiguity except when the object has no fields. Being terse is desirable.

<PurchaseOrder>
    <ItemsOrders>
        <ItemID>aaa111</ItemID>
        <ItemPrice>34.22</ItemPrice>
    <&>
        <ItemID>bbb222</ItemID>
        <ItemPrice>2.89</ItemPrice>
    </ItemsOrders>
</PurchaseOrder>

Xenon clarifies that the array is such using additional angle brackets:

<PurchaseOrder>
    <<ItemsOrders>
        <ItemID>aaa111</ItemID>
        <ItemPrice>34.22</ItemPrice>
    <&>
        <ItemID>bbb222</ItemID>
        <ItemPrice>2.89</ItemPrice>
    </ItemsOrders>>
</PurchaseOrder>

The only problem with this syntax is that a seemingly empty array in-fact has one item of the scalar empty string.

<ItemsOrders>
</ItemsOrders>

So a special syntax is required for an empty array. <<ItemsOrders$$>>

Alternative — always markup items

The empty array syntax could be removed if objects, arrays and scalars always had markup around them. The empty object case would no longer be special. Also <&> could then be omitted. This is however verbose in the common case.

<PurchaseOrder>
    <<ItemsOrders>
        <^>
            <ItemID>aaa111</ItemID>
            <ItemPrice>34.22</ItemPrice>
        <$>
        <^>
            <ItemID>bbb222</ItemID>
            <ItemPrice>2.89</ItemPrice>
        <$> 
    </ItemsOrders>>
</PurchaseOrder>

Then the empty array could become:

<PurchaseOrder>
    <<ItemsOrders>
    </ItemsOrders>>
</PurchaseOrder>

The array items may be scalars so markup surrounding the items is required or the above example would be interpreted as a single array item of the empty string. That markup would have to be different from an object with no fields.

<PurchaseOrder>
    <<ItemsOrders>
        <*>
        <$>
    </ItemsOrders>>
</PurchaseOrder>

A list of strings would be:

<<Strings>
    <*>
        A string
    <$>
    <*>
        Another string
    <$>
    <*>
        Yet another string
    <$>
</Strings>>

And an empty object would become:

<PurchaseOrder>
    <<ItemsOrders>
        <^>
        <$>
    <&>
        <^>
            <ItemID>bbb222</ItemID>
            <ItemPrice>2.89</ItemPrice>
        <$> 
    </ItemsOrders>>
</PurchaseOrder>

This plausible alternative syntax is in most cases awkward and verbose.

Arrays within Arrays

Array may appear as items in an array. The <<name> syntax is used except that array items do not have a name so this is left out <<>.

<<Items>
    <<>
        One item
    <&>
        Another item
    <$>>
<&>
    <<>
        Yet another item
    <&>
        An additional item
    <$>>
<$>>

The empty array syntax is similarly used without the array name:

<<Items>
    <<>
        One item
    <&>
        Another item
    <$>>
<&>
    <<$$>>
<$>>

Graph Structure

Data may possess a graph structure, not just be a tree. First class support for a graph structure is a boon. One must be able to specify that a datum exists somewhere else in the document. # was chose to represent an id, the position in the page from html and id from css. References to these ids are marked with @. <Person> <Name=Bonnie> <Spouse#jack-smith> <Name=Jack> <$> <Doctor=@jack-smith> <$> A person whose spouse and doctor are the same person, Jack.

One can point the xenon library at arbitrarily structured data and generate representative markup.

Terse Sub-Elements not Attributes

xml supporting attributes and sub elements shows its heritage from html and sgml, languages for marking up text, not specifically designed for describing data. Xenon only has sub elements but with a terse syntax.

<Person Name="Fred">
    <Height>1.67</Height>
    <Age>30</Age>
</Person>

becomes: <Person> <Name=Fred> <Height=1.67> <Age=30> <$>

One no longer has to guess which the api designer chose.

The result is terse, more terse than even json. A basic scalar pair in json is "key":"value", or "key":value, so a 4 or 6 character overhead. xenon is <key=value>, 3 characters. Xml sub elements are <key>value</key>, 5 + len(key), approximately 10 overhead.

= was chosen over : used on the Internet (email and web) because = catches the eye more. Space was considered, as it is apparent, but is not clear when the value has spaces and requires escaping leading and/or trailing spaces. "quotes" around values is not used as the values are delimited already.

Indented Text

xml does not specify rules for processing text so text is copied verbatim, e.g.

<People>
    <Person>
        <Speech>I have said
this and
this</Speech>
    </Person>
</People>

This clearly limits readability, Xenon intuitively removes indenting and leading spacing and a newline, and in an array removes trailing newline and spacing.

<People>
    <Person>
        <Speech=
            I have said
            this and
            this>
    <$>
<$>

Json does not support multiple line text.

{
    "Person": {
        "Speech": "I have said\nthis and\nthis"
    }
}

The algorithm for processing multiple lines of text is: The first line is never unindented, it is removed if it is spacing (tabs and spaces) then a newline, or left intact. Well formed xenon, as outputted by a xenon library, never has non spacing in the first line when there are multiple lines. If all lines to be unindented are spacing just the newlines are preserved. The text is unindented by the indentation of the line with the least indenting. Tabs are expanded to spaces such that there is a tab stop every 8 columns as per Windows and Unix terminals. A tab’s width in spaces is 8 - ((column - 1) % 8) where the column is numbered from 1 and % is the remainder (modulus). If the scalar being unindented is in an array and the item ends with a newline then just spacing that newline and spacing are removed.

If all lines are indented | is used to specify where to indent from, e.g:

<People>
    <Person>
        <Speech=
            | I have said
              this and
              this>
    <$>
<$>

Extracts to “ I have said\n this and\n this”.

12 alternatives were evaluated for the two cases:

<config>
    <description
        |
          a line
          another line>
<$>

<config>
    <description
         a line
         another line
        >
<$>

<config>
    <description <
         a line
         another line
        >>
<$>

<config>
    <description [[
         a line
         another line
        ]]
<$>

<config>
    <description.2
       a line
       another line>
<$>

<config>
    <description=
     a line
     another line>
</config>

<config>
    <description
        = a line
          another line>
<$>

<config>
    <description
    == a line
       another line>
<$>

<config>
    <description= a line
                  another line>
<$>

<config>
    <description:
        a line
        another line
    <$>
</config>

<config>
    <description:>
        a line
        another line
    <$>
</config>

<config>
    <description:>
        a line
        another line
    <.>
</config>

No Cdata

Xml’s <![CDATA[ ... ]]> was considered but tidy indenting is impossible since | must be able to appear un-escaped in the CDATA section. We could indent from the closing ]]> but that would result in two methods to specify indentation.

Escaping

C style \ escaping was adopted as it is familiar to many. It leaves the original character intact which arguably facilitates readability, as opposed to entities such as < in xml.

New {<}, {tab}, {U+1F62D} style escaping was considered.

Comments

% was chosen as reads like “something unusual”. This is shared with T_EX and PostScript. ; from Alan Turing’s original notes! was considered but since it is used with #ids and :types in arrays it would complicate the lexer, requiring state, when implementing Xenon using a parser generator, and stand in the way of development.

<% comment>, an inline comment was considered but deemed unnecessary.

Formats

Xenon defines a number of recommended formats to aid interoperability of implementations. e.g. if encoding a floating point number write it like this so another language’s version of xenon shall be able to decode it. We have used established standards.

Commas

Commas should to be used in numbers, this makes them more readable. Where possible number should be presented for humans. English is the global lingua franca so is a good choice for its large number format. Code to add commas to numbers is available to xenon implementers.

Base64

Xenon uses the rfc 4648 standard for Base64 which uses the original + and / with = for padding.

Specifics

Documents must be utf-8 and should have a byte order mark. utf-8 is the present de-facto standard Unicode charset. The bom recommendation is to future-proof xenon against advances in character set encoding.