JData: a general-purpose data storage and interchange format

Status of This Document
This document is current under the state of "Request for Comments". It does not specify a standard of any kind.

Copyright Notice
Copyright (C) Qianqian Fang (2015).

License
Apache License, Version 2.0

Version
RFC.pre.0

Last Update
Jan. 02, 2015

Abstract
JData is a general-purpose data interchange format. It was derived from the JavaScript Object Notation (JSON) [RFC 4627] and Universal Binary JSON specifications. JData defines a list of dedicated "name" fields to represent a wide range of data structures, including scalars, arrays, tables, structures, hashes, trees and graphs.
1. Introduction
1.1. Background
1.2. JSON and UBJSON
1.3. JData Specification Summary
2. Data Storage Grammar
2.1. Text-based JData Storage Grammar
2.2. Binary JData Storage Grammar
3. Data Storage Models
4. Data Storage Keywords
4.1. Data Grouping Keywords
4.1.1. Data Group
4.1.2. Dataset
4.1.3. Data record
4.2. Data Structure Keywords
4.2.1. Special Constants
4.2.2. Numerical Scalars
4.2.3. Strings
4.2.4. Numerical Arrays
4.2.4.1. Full Arrays and Vectors
4.2.4.2. Complex Arrays
4.2.4.3. Sparse Arrays
4.2.4.4. Complex Sparse Arrays
4.2.5. Meta-data Keywords
5. Recommended File Specifiers

1. Introduction

1.1. Background

Data are the digital representations of our world. Generating and processing data are the essential parts of our daily lives, and are at the very foundations of modern sciences, information technologies, businesses, and the interactions between our global societies.

Data take many different forms. Some data can be represented by simple scalars; some others have complex forms with hierarchical structures. An efficient representation of data also strongly depends on the application needs. In some cases, plain text files with white-space separated fields are sufficient; however, for performance-sensitive applications, using binary formats can significantly reduce loading and processing time. The abilities to store and parse complex data structures are particularly important to the scientific communities.

It is a challenging task to encapsulate the wide varieties of data forms in a single data interchange format. There has been many previous efforts in designing a general-purpose data storage specification. Some of them have become popular choices in one or multiple categories of applications. Extensible Markup Language (XML), for example, is ubiquitously used as a data-exchange format, but the verbosity of the syntax, moderate complexity for parsing, impeded readability and inefficiency in expressing structured data suggested room for alternative formats. Comma Separated Value (CSV), a rather simple plain-text format, is used among some applications to exchange a tabular data structure (such as a spreadsheet); yet, its inability to encode more complex data forms, lack of flexibility and precision restrict it to specific applications.

Hierarchical Data Format (HDF) is a format targeting at the broad needs from the scientific communities. It has an extensible hierarchical data model with a large capacity to represent complex binary data. However, to effectively use HDF require skillful implementation and in-depth understanding to the underlying data models. For small projects with non-critical performance needs, using and advanced data format such as HDF may requires additional development and maintenance efforts. Similar arguments can be made to the Common Data Format (CDF) or Network Common Data Format (netCDF) that are partly derived from HDF. In addition, the MATLAB mat-file format and Tecplot data format are also used among the research communities. Because of the requirement of propitiatory software or libraries, these formats have also found difficulties to achieve wide-spread use outside of the user communities of the associated software.

1.2. JSON and UBJSON

The JavaScript Object Notation (JSON) format is a text-based data format that is known for its capability of storing complex data, excellent portability and human-readability. JSON is widely adopted in modern web applications, and is becoming popular among local/native applications. The key advantages of JSON include:

  1. simplicity: JSON data are composed by a list of "name":value pairs; such a simple grammar greatly eases the use and parsing of the data file; free JSON-encoders and decoders are widely available for most popular programming languages;
  2. human readability: the text-based nature of JSON and it's clean, easy-to-read format make it intuitively readable without in-depth knowledge of the format itself;
  3. hierarchical data support: JSON has a tree-like data storage paradigm which has the native capacity to support complex hierarchical data structures; there is no inherent data size limit imposed by the format itself;
  4. web ready: because JSON can be readily parsed by JavaScript, most JSON-encoded data file can be directly invoked (inline or load from remote site) from a JavaScript based web-application.

JSON also has limitations. JSON's "value" fields are weakly-typed. They only support strings, numbers, and Boolean types, but lack of the fine-granularity to represent various numerical types of different byte-lengths (in C-language, for example, short, int, long int, float, double, long double) and their signs (signed and unsigned). Because JSON is a text-based format, the size of the data file can be significantly larger than a binary format and requires additional conversion when used in an application. This introduces overhead in both storage and processing.

The Universal Binary JSON (UBJSON) is a binary counterpart to the JSON format. It specifically addresses the above mentioned limitations, yet, adheres to a simple grammar similar to the text-based JSON. As a trade-off, it does lose the "human readability" to a certain extend. Although the implemented parsers for UBJSON are not as abundant as JSON, due to the simplicity of the format itself, the development cost for implementing a parser for a new programming language is significantly lower than other more complex data formats.

With the ease-of-use, superior portability and parser availability, JSON and UBJSON have the potentials to serve as the main-stream data storage and interchange formats for general needs, especially for the scientific communities. A combination of JSON and it's binary counterpart offers features that are not currently available with the existing data storage schemes. Yet they do not provide all the advanced features found in the more sophisticated formats, their greatly simplified encoding and decoding strategies permit efficient data sharing among the general audiences.

1.3. JData Specification Summary

JData is a specification for storing, exchanging and processing general-purpose data that are commonly encountered in information technology industries and research communities. It has a text/UNICODE format derived from the JSON specification and a binary format derived from the UBJSON specification. JData is designed to represent the commonly used data structures including arrays, structures, trees and graphs. A round-trip conversion is defined between the text and binary versions of JData documents.

The purpose of this document is to define the text and binary JData format specifications. This is achieved through the definition of a semantic layer over the JSON/UBJSON data storage syntax to map various types of complex data structures. Such semantic layer includes

  1. a list of reserved "name" fields, or keywords, that define the containers of various data types that are commonly used in research,
  2. a list of reserved "name" fields and formats to facilitate the grouping and organization of hierarchical data,
  3. a list of format properties for the associated "value" field to store the specific metadata of the data points,

and, in addition,

  1. a set of conversion rules between the text and binary forms.

In the following sections, we will define the basic JData grammar, data models, followed by the keywords for data grouping and various data types, including scalars, 1,2,3 and N-dimensional arrays, sparse and complex arrays, structures, hashes/associative arrays, trees and graphs. The expressions of these data structures in both text and binary forms are specified and exemplified, and their conversion rules are defined.

2. Data Storage Grammar

2.1. Text-based JData Storage Grammar

The text-based JData grammar is identical to the JSON grammar defined in [RFC 4627], except that JData also accepts the Concatenated JSON Object Collection (CJOC) in the following form:

{
   "object1":{}
}
{
   "object2":{}
}
...

In CJOC, multiple JSON objects can be included inside a single file or stream, separated by 0 or multiple JSON white-spaces [RFC 4627], namely

 LF (U+000A), CR (U+000D), tab (U+0009), or space (U+0020)

In the case where JSON Reference is used, the document must have only one root object, and must not contain a CJOC.

2.2. Binary JData Storage Grammar

The binary JData grammar is identical to the UBJSON grammar defined in ubjson.org (Draft 12), with the array grammar extension to support N-dimensional arrays:

 [[] [$] [type] [#] [[] [$] [a integer type] [nx ny nz ...] []] [nx*ny*nz*...*sizeof(type) ] []]

where the integer type can be one of the UBJSON integer types (i,U,I,l,L), and nx, ny, and nz ... are all non-negative numbers specifying the dimensions of the N-dimensional array. The binary data of the N-dimensional array is then serialized in the column-major format (similar to MATLAB or FORTRAN) order.

As a special note, all UBJSON integer types must be stored in the Big-Endian format, according to the specification; the storage to the floating point types (d,D) follows the IEEE 754 specification.

3. Data Storage Models

In JData, an element with no child (the first-level sub-item) is called a "leaf". A meaningful datum, either in the form of a simple data point or a complex structure, is referred to as a "data record". A data record can be a single leaf, or a collection of leaves. A set of logically connected data records, such as the citation information of a paper, is referred to as a "dataset"; a group of datasets of logical connections is referred to as a "data group". The top-most level of the JSON object is referred to as the "root". If a JData document contains multiple JSON/UBJSON objects in the form of a CJOC, the upper-most level is referred to as a "super-root". The leaves that contain weakly-relevant or irrelevant data to the processing is referred to as the "auxiliary data". The auxiliary data can appear at any level under the root or super-root.

In the above definitions, only "leaf", "root" and "super-root" have specific meanings. The interpretations to the "meaningful datum", "logically connected data", and "weakly-relevant or irrelevant data" are application dependent.

A dataset can be stored directly at any level under the root object and does not have to be a member of a data group. One dataset can be embedded at any level inside another dataset.

A data group must be either empty, or contains at least one dataset, or sub-data group as its direct child.

Multiple data groups can be stored at any level under the root object. A data group can not be a member, at any level, of a dataset.

super-root{
    root1{
	group1{
            dataset1{
        	dataset1.1{...}
        	...
                ... other auxiliary data ...
            }
            dataset2{...}
	},
	group2{
            dataset3{...}
	}
	dataset4{...}
        ... other auxiliary data ...
    }
    root2{
        ...
    }
}

This document does not specify the format of the axillary data, thus, they can contain arbitrary meta-data or structures, as long as their naming do not conflict with the data-representing elements, and the combined document must satisfy the grammar specification stated in the above section. The handling of the axillary data is application dependent.

4. Data Storage Keywords

All JData keywords are case sensitive. The data storage keywords are shared between the text-based and binary JData formats.

4.1. Data Grouping Keywords

The use of data grouping keywords is optional in a JData document. Nonetheless, properly partitioning and grouping the data based on their logical connections and name the components accordingly can greatly enhance the portability and readability of the data, and thus, are strongly recommended.

4.1.1. Data Group

A data group is marked by

 "_DataGroup_" : { ... }

or

  "_DataGroup_" : [ ... ]

The data group tag can contain an optional name parameter, specified in the following format

 "_DataGroup_,name='name of the data group'"

one or more above mentioned white spaces may be inserted before or after ":", "," and "=".

A unique name parameter must be given when multiple data groups appear under the same parent. The parameterized name tag must be a valid JSON string, made of valid UNICODE characters and/or one of the 9 escape formats [RFC 4627]. The "name" parameter must be enclosed by a pair of single quotes.

Removing the insignificant whitespaces, the name parameter of the data group must be unique among the sub-items of it's parent. Although it is not required, it is recommended to specify a name that is unique throughout the JData document (including the linked/reference documents if present).

4.1.2. Dataset

Similarly, a dataset is specified by

 "_Dataset_" : { ... }

or

  "_Dataset_" : [ ... ]

with an optional name parameter in the following format:

 "_Dataset_,name='name of the dataset'"

4.1.3. Data record

Also similarly, a dataset is specified by

 "_DataRecord_" : { ... }

or

  "_DataRecord_" : [ ... ]

with an optional name parameter in the following format:

 "_DataRecord_,name='name of the data record'"

4.2. Data Structure Keywords

4.2.1. Special Constants

For text-based JData, the following value fields have special meanings:

true
a logical "true" value (boolean)
false
a logical "false" value (boolean)
null
a null object
"_NaN_"
floating point NaN defined by IEEE 754
"_Inf_"
floating point positive infinity defined by IEEE 754
"-_Inf_"
floating point negative infinity defined by IEEE 754
"_NoOp_"
no-op defined in the UBJSON specification

For binary JData, the above constants are represented by (using the block-notation)

true
[T]
false
[F]
null
[Z]
"_NaN_"
[s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx] where the significance bit (s) is ignored, and xxx must be non-zero.
"_Inf_"
[0111 1111 1000 0000 0000 0000 0000 0000]
"-_Inf_"
[1111 1111 1000 0000 0000 0000 0000 0000]
"_NoOp_"
[N]

4.2.2. Numerical Scalars

A real-valued number (integer, floating-point value) can be represented directly in JSON and UBJSON. See Section 2.4 "Numbers" in [RFC 4627] and the "Numerical Types" section in UBJSON specification for details.

Alternatively, a real-valued number can be represented as a 1-by-1 array by utilizing the "array keywords", detailed below.

A complex-valued number is treated as a 1-by-1 complex array, and its storage format is specified in the below section entitled "Complex Arrays".

4.2.3. Strings

Strings are naively supported in both JSON and UBJSON. Please refer to Section 2.5 "Strings" in [RFC 4627] and "String Type" in the UBJSON specification for details.

4.2.4. Numerical Arrays

4.2.4.1. Full Arrays and Vectors

A 1D and 2D arrays can be represented by the array constructs in JSON and UBJSON grammar. For example, a row vector

 "row_vector": [1, 2, 3, 4]

a column vector:

 "column_vector": [
        [1],
        [2],
        [3],
        [4]
  ]

a 2D array:

 "a_2D_array": [
        [11,12,13],
        [21,22,23],
        [31,32,33],
        [41,42,43]
  ]

Similarly, the binary forms of these data structures can also be represented by the UBJSON syntax:

 [i][10][row_vector] [[][$][i][#][i][4] [1][2][3][4] []]

 [i][13][column_vector] [[] [[][i][1][]] [[][i][2][]] [[][i][3][]] [[][i][4][]] []]

 [i][10][a_2D_array] [[] [[][$][i][#][i][3] [11][12][13] []] [[][$][i][#][i][3] [21][22][23] []] [[][$][i][#][i][3] [31][32][33] []] [[][$][i][#][i][3] [41][42][43] []] []]

Using the extended UBJSON array grammar described in the "Data Storage Grammar" section, the above column-vector and 2D-array examples can also be shorten as

 [i][13][column_vector] [[][$][i][#][[][i][2][4][1][]] [1][2][3][4] []]

 [i][10][a_2D_array] [[][$][i][#][[][i][2][4][3][]] [11][21][31][41][12][22][32][42][13][23][33][43]  []]

In this document, we consider the following general data storage models:

  • a full array (empty, 1D, 2D, 3D or in higher dimensions)
  • a sparse array (empty, 1D, 2D, 3D or in higher dimensions)

 _ArraySize_, _ArrayType_, _ArrayData_,
 _ArrayIsComplex_, _ArrayIsSparse_

4.2.4.2. Complex Arrays

4.2.4.3. Sparse Arrays

4.2.4.4. Complex Sparse Arrays

4.2.5. Meta-data Keywords

In this document, we define the following document-specific meta-data keywords:

 _Author_, _CreationTime_, _Comment_

5. Recommended File Specifiers

For the text-based JData file, the recommended file suffix is ".jdat"; for the binary JData file, the recommended file suffix is ".ubjd".

The MIME type for the text-based JData document is "application/jdata-text"; that for the binary JData document is "application/jdata-binary"

Powered by Habitat