Showing revision 5

JData: a general-purpose data storage and interchange format

Status of This Document
This document is current under the state of "Request for Comments". It does not specify a standard of any kind.

Copyright Notice
Copyright (C) Qianqian Fang (2015).

License
Apache License, Version 2.0

Version
RFC.pre.0

Last Update
Jan. 02, 2015

Abstract
JData is a general-purpose data interchange format. It was derived from the JavaScript Object Notation (JSON) [RFC 4627] and Universal Binary JSON specifications. JData defines a list of dedicated "name" fields to represent a wide range of data structures, including scalars, arrays, tables, structures, hashes, trees and graphs.
1. Introduction
1.1. Background
1.2. JSON and UBJSON
1.3. JData Specification Summary
1.4. Conventions Used in This Document
2. Data Storage Grammar
2.1. JSON Grammar
3. Data Storage Models
4. Data Storage Keywords
4.1. Data Group Keywords
4.2. Data Record Keywords
4.3. Comments

1. Introduction

1.1. Background

Data are the digital representations of our world. Generating and processing data are the essential parts of our daily lives, and are at the very foundations of modern sciences, information technologies, businesses, and the functions of our global societies.

Data take many different forms. Some data can be represented by simple scalars; some others have complex forms with hierarchical structures. An efficient representation of data also strongly depends on the application needs. In some cases, plain text files with white-space separated fields are sufficient; however, for performance-sensitive applications, binary formats are often preferred. The abilities to store and parse complex data structures are particularly important to the scientific communities.

It is a challenging task to encapsulate the wide varieties of data forms in a single data interchange format. There has been many previous efforts in designing a general-purpose data storage specification. Some of them have become popular choices in one or multiple categories of applications. Extensible Markup Language (XML), for example, is used as a popular data-exchange format, but the verbosity of the syntax, moderate complexity for parsing, impeded readability and inefficiency in expressing structured data suggested rooms for alternative formats. Comma Separated Value (CSV), a rather simple plain-text format, is used among some applications to exchange a tabular data structure (such as a spreadsheet); yet, its inability to encode more complex data forms, lack of flexibility and precision restrict it to specific applications.

Hierarchical Data Format (HDF) is a format targeting at the broad needs from the scientific communities. It has an extensible hierarchical data model with a large capacity to represent complex binary data. However, to effectively use HDF require skillful implementation and in-depth understanding to the underlying data models. For small projects with non-critical performance needs, using and advanced data format such as HDF may requires additional development maintenance efforts. Similar arguments can be made to the Common Data Format (CDF) or Network Common Data Format (netCDF) that are partly derived from HDF. In addition, the MATLAB mat-file format and Tecplot data format are also used among the research communities. Because of the requirement of propitiatory software or libraries, these formats also have difficulties to find wide-spread use outside of the user communities of the associated software.

1.2. JSON and UBJSON

The JavaScript Object Notation (JSON) format is a text-based data format that is known for its capability of storing complex data, excellent portability and human-readability. JSON is widely adopted in modern web applications, and is becoming popular in local/native applications. The key advantages of JSON include:

  1. simplicity: JSON data are composed by a list of "name":value pairs; such a simple grammar greatly eases the use and parsing of the data file; free JSON-encoders and decoders are widely available for most popular programming languages;
  2. human readability: the text-based nature of JSON and it's clean, easy-to-read format make it intuitively readable without in-depth knowledge of the format itself;
  3. hierarchical data support: JSON has a tree-like data storage paradigm which has the native capacity to support complex hierarchical data structures; there is no inherent data size limit imposed by the format itself;
  4. web ready: because JSON can be readily parsed by JavaScript, a JSON-encoded data file can be directly invoked (inline or load from remote site) from a JavaScript based web-application.

JSON also has limitations. JSON's "value" fields are weakly-typed. They only support strings, numbers, and Boolean types, but lack of the the fine-granularity to represent various numerical types of different byte-lengths (in C-language, for example, short, int, long int, float, double, long double) and their signs (signed and unsigned). Because JSON is a text-based format, the size of the data file can be significantly larger than a binary format and requires additional conversion when used in an application. This introduces overhead in both storage and processing.

The Universal Binary JSON (UBJSON) is a binary counterpart to the JSON format. It specifically addresses the above mentioned limitations, yet, adheres to a simple grammar similar to the text-based JSON. As a trade-off, it does lose the "human readability" to a certain extend. Although the implemented parsers for UBJSON are not as abundant as JSON, due to the simplicity of the format itself, the development cost for implementing a parser for a new programming language is significantly lower than other more complex data formats.

With the ease-of-use, superior portability and parser availability, JSON and UBJSON have the potentials to serve as the main-stream data storage and interchange formats for general needs, especially for the scientific communities. A combination of JSON and it's binary counterpart offers features that are not currently available with the existing data storage schemes. Yet they do not offer all the advanced features found in the more sophisticated formats, their greatly simplified encoding and decoding strategies permit efficient data sharing among the general audiences.

1.3. JData Specification Summary

JData is a specification for storing, exchanging and processing general-purpose data that are commonly encountered in information technology industries and research communities. It has a text/UNICODE format derived from the JSON specification and a binary format derived from the UBJSON specification. JData is designed to represent the commonly used data structures including arrays, structures, trees and graphs. A round-trip conversion is defined between the text and binary versions of JData files.

The purpose of this document is to define the text and binary JData format specifications. This is achieved through the definition of a semantic layer over the JSON/UBJSON data storage syntax to map various types of complex data structures. Such semantic layer includes

  1. a list of reserved "name" fields, or keywords, that define the characteristics of various data types,
  2. a list of reserved "name" fields to facilitate the grouping and organization of hierarchical data,
  3. a list of format properties for the associated "value" field to store the specific metadata of the data points,

and, in addition,

  1. a set of conversion rules between the text and binary forms.

In the following sections, we will define the JData keywords for data grouping and various data types, including scalars, 1,2,3 and N-dimensional arrays, sparse and complex arrays, structures, hashes/associative arrays, trees and graphs. The expressions of these data structures in both text and binary forms are specified and exemplified, and their conversion rules are defined.

1.4. Conventions Used in This Document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119].

The grammatical rules in this document are to be interpreted as described in [RFC 4627].

2. Data Storage Grammar

2.1. JSON Grammar

JSON grammar used in JData is defined in [RFC 4627].

3. Data Storage Models

In JData, a significant datum, either in the form of a simple data point or a complex structure, is referred to as a "data record"; a set of logically connected data records, such as the citation information of a paper, is referred to as a "dataset"; a group of datasets of similar meanings is referred to as a "data group". The top-most level of the data structure is referred to as a "root object".

A dataset can be stored directly at any level under the root object and does not have to be a member of a data group. One dataset can be embedded inside another dataset. Datasets that are part of a data group are organized in the form of a JSON array (for un-named datasets) or JSON object (for named datasets).

Multiple data groups can be stored at any level under the root object. A data group can not be a member, at any level, of a dataset.

super-root{
    root1{
	group1{
            dataset1{
        	dataset1.1{...}
        	...
                ... other auxiliary data ...
            }
            dataset2{...}
	},
	group2{
            dataset3{...}
	}
	dataset4{...}
        ... other auxiliary data ...
    }
    root2{
        ...
    }
}

4. Data Storage Keywords

4.1. Data Group Keywords

4.2. Data Record Keywords

3. JData Keywords

3.1 Data Representing Keywords

All JData keywords are case sensitive.

In this document, we define the meanings for the following JData keywords that represents mesh-related data structures:

3.1 Data Array Storage Keywords

In this document, we consider the following general data storage models:
  • a full array (empty, 1D, 2D, 3D or in higher dimensions)
  • a sparse array (empty, 1D, 2D, 3D or in higher dimensions)

 _ArraySize_, _ArrayType_, _ArrayData_,
 _ArrayIsComplex_, _ArrayIsSparse_

3.1 Meta-data Keywords

In this document, we defines the following JMesh keywords to represent non-geometric meta-data:

 _Author_, _CreationTime_, _Comment_

4.3. Comments

Powered by Habitat