The latest draft of this specification is maintained in this Github repository: https://github.com/fangq/jdata
JData is a general-purpose data interchange format aimed for portability,
readability and simplicity. It utilizes the JavaScript Object Notation
(JSON) [RFC 4627] and Universal Binary JSON (UBJSON) specifications to
store complex hierarchical data in both text and binary formats. In this
specification, we define a list of JSON-compatible constructs to store
a wide range of data structures, including scalars, arrays, structures,
tables, hashes, linked lists, trees and graphs, including optional data
grouping and metadata for each data element. The generated data files are
compatible with JSON/UBJSON specifications and can be readily processed by
most existing parsers. Advanced features such as array compression, data
links and anchors were supported to greatly enhance portability and
scalability of the generated data files.
Data are the digital representations of our world. Generating and processing
data are the essential parts of our daily lives, and are at the very
foundations of modern sciences, information technologies, businesses, and the
interactions between our global societies.
Data take many different forms. Some data can be represented by simple scalars;
some others have complex forms with hierarchical structures. An efficient
representation of data also strongly depends on the application needs. In some
cases, plain text files with white-space separated fields are sufficient;
however, for performance-sensitive applications, using binary formats can
significantly reduce loading and processing time. The abilities to store and
parse complex data structures are particularly important to the scientific
communities.
It is a challenging task to encapsulate wide varieties of data forms in a
single data interchange format. There have been many previous efforts in
designing a general-purpose data storage specification. Some of them have
become popular choices in one or multiple categories of applications.
Extensible Markup Language (XML), for example, is ubiquitously used as a
data-exchange format, but the verbosity of the syntax, moderate complexity for
parsing, impeded readability and inefficiency in expressing structured data
suggested room for alternative formats. Comma Separated Value (CSV), a rather
simple plain-text format, is used among some applications to exchange a tabular
data structure (such as a spreadsheet); yet, its inability to encode more
complex data forms, lack of flexibility and precision restrict it to specific
applications.
Hierarchical Data Format (HDF) is a format targeting at the broad needs from
the scientific communities. It has an extensible hierarchical data model with a
large capacity to represent complex binary data. However, to effectively use
HDF requires skillful implementation and in-depth understanding to the
underlying data models. For small projects with non-critical performance needs,
using an advanced data format such as HDF may requires additional development
and maintenance efforts. Similar arguments can be made to the Common Data
Format (CDF) or Network Common Data Format (netCDF) that are partly derived
from HDF. In addition, the MATLAB mat-file format and Tecplot data format are
also used among the research communities. Because of the requirement of
propitiatory software or libraries, these formats have also found difficulties
to achieve wide-spread use outside of the user communities of the associated
software.
The JavaScript Object Notation (JSON) format is a text-based data format that
is known for its capability of storing complex data, excellent portability and
human-readability. JSON is widely adopted in modern web applications, and is
becoming popular among local/native applications. The key advantages of JSON
include:
"name":value
pairs;
JSON also has limitations. JSON's "value"
fields are weakly-typed. They only
support strings, numbers, and Boolean types, but lack of the fine-granularity
to represent various numerical types of different byte-lengths (in C-language,
for example, short, int, long int, float, double, long double) and their signs
(signed and unsigned). Because JSON is a text-based format, the size of the
data file can be significantly larger than a binary format and requires
additional conversion when used in an application. This introduces overhead in
both storage and processing.
The Universal Binary JSON (UBJSON) is one of the binary counterparts to the JSON format.
It specifically addresses the above mentioned limitations, yet, adheres to a
simple grammar similar to the text-based JSON. As a trade-off, it loses the
"human-readability" to a certain extend. Although the implemented parsers for
UBJSON are not as abundant as JSON, due to the simplicity of the format itself,
the development cost for implementing a parser for a new programming language
is significantly lower than other more complex data formats.
With the ease-of-use, superior portability and parser availability, JSON and
UBJSON have the potentials to serve as the main-stream data storage and
interchange formats for general needs, especially for storage and interchange
scientific data. A combination of JSON and its binary counterpart offers features
that are not currently available with the existing data storage schemes. Although
they do not provide all the advanced features found in the more sophisticated
formats, their greatly simplified encoding and decoding strategies permit
efficient data sharing among the general audiences.
JData is a specification for storing, exchanging and processing general-purpose
data that are commonly encountered in information technology (IT) industries and
research communities. It has a text/UNICODE format derived from the JSON
specification and a binary format derived from the UBJSON specification. JData
is designed to represent the commonly used data structures including arrays,
structures, trees and graphs. A round-trip conversion is defined between the
text and binary versions of JData documents.
The purpose of this document is to define the text and binary JData format
specifications. This is achieved through the definition of a semantic layer
over the JSON/UBJSON data storage syntax to map various types of complex data
structures. Such semantic layer includes
"name"
fields, or keywords, that define the containers"name"
fields and formats to facilitate the grouping and
In the following sections, we will define the basic JData grammar, data models,
followed by the keywords for data grouping and various data types, including
scalars, N-dimensional arrays, sparse and complex arrays, structures, tables,
hashes/associative arrays, trees and graphs. The expressions of these data
structures in both text and binary forms are specified and exemplified, and
their conversion rules are defined.
The text-based JData grammar is identical to the JSON grammar defined in
[RFC 4627], except that JData also accepts the Concatenated JSON (CJSON),
streaming format, defined in the following form:
{ "object1":{} } { "object2":{} } ...In CJSON, multiple JSON objects can be included inside a single file or stream,
LF (U+000A), CR (U+000D), tab (U+0009), or space (U+0020)
The binary JData grammar is identical to the UBJSON grammar defined in
ubjson.org (Draft 12), with the following two exceptions
[N]
("no-op"
) record, andoptimized array container header was extended to support N-dimensional dense arrays:
[[] [$] [type] [#] [[] [$] [nx type] [#] [ndim type] [ndim] [nx ny nz ...] []] [nx*ny*nz*...*sizeof(type)] []]
or
[[] [$] [type] [#] [[] [nx type] [nx] [ny type] [ny] [nz type] [nz] ... []] [nx*ny*nz*...*sizeof(type) ] []]
where ndim
is the number of dimensions, and nx
, ny
, and nz
... are
all non-negative numbers specifying the dimensions of the N-dimensional array,
nz/ny/nz/ndim
types must be one of the UBJSON integer types (i,U,I,l,L
), .
The binary data of the N-dimensional array is then serialized in the column-major format
(similar to MATLAB or FORTRAN) order.
As a special note, all UBJSON integer types must be stored in the Big-Endian
format, according to the specification; the storage to the floating point types
(d,D
) follows the IEEE 754 specification.
Topologically, we define different parts of a JData document using the below terminologies
"name":{...}
or "name":[]
, in the case"name":value
in the case{...}
or []
, in the case of namedvalue
in the case{...}
,[...]
,
A leaflet can only take one of the two possible forms: value
, referred to as the
"index leaflet", and a "name": value
pair, referred to as the "named leaflet".
Here value
should not contain any sub-element.
The topological elements of a JData document - "leaflet", "leaf", "branch", "root"
and "super-root" - have specific meanings according to the definitions above.
To efficiently process the represented data, we can semantically annotate the
underlying data structure and organize them for fine granularity. The data grouping
annotations supported in JData include
The interpretations to the "meaningful datum", "logically connected data",
"primary data" and "weakly-relevant or irrelevant data" are application dependent.
The data group annotations are ranked in the below order, from the highest
level to the lowest level:
super-root > root > data group >= dataset >= data record >= leaf > leaflet
for each annotation level, it can only contain elements annotated of the same
level or lower levels, but not the upper level annotations. All data annotation
objects can be empty, i.e. do not contain any element. An example JData
organization schematic is shown below
super-root{ root1{ group1{ dataset1{ dataset1.1{...} ... ... other auxiliary data ... } dataset2{...} }, group2{ dataset3{...} } group3{} dataset4{...} ... other auxiliary data ... } root2[ ... ] }A data group, dataset or data record can be topologically presented as a branch
All JData keywords are case sensitive. Data groups, datasets and data records
can contain metadata to include additional information regarding the data themselves,
such as name, create date and user-defined headers. However, metadata can
also present in any branch or leaf.
Below is a short summary of the JData data annotation/storage keywords to be introduced
_DataGroup_
, _Dataset_
, _DataRecord_
_ArrayType_
, _ArraySize_
, _ArrayIsComplex_
,_ArrayIsSparse_
,_ArrayData_
,_ArrayCompressionMethod_
,_ArrayCompressionSize_
,_ArrayCompressionEndian_
, _ArrayCompressedData_
_TreeNode_
,_TreeChildren_
,_TreeData_
_ListNode_
,_ListNext_
,_ListPrior_
,_LinkedList_
_GraphNodes_
,_GraphEdges_
,_GraphMatrix_
,_GraphData_
_TableData_
"item_name::Property1=value1,Property2=value2,...": ...
{"_DataInfo_":{...}}
_DataLink_
, _DataAnchor_
The use of data grouping keywords is not mandatory in a JData document.
Nonetheless, properly partitioning and grouping the data based on their semantic
relationships and name the components accordingly can greatly enhance the
portability and readability of the data, and thus, are strongly recommended.
A data group can be stored in the form of a named node, either
"_DataGroup_": { ... }
or
"_DataGroup_": [ ... ]
A data group annotation can have a name in the below form
"_DataGroup_(a name string): {}
or
"_DataGroup_(a name string): []
If multiple data groups are stored inside a structure, a unique name
string, among the items under the same parent object, must be provided
for each data group. For example
{ "_DataGroup_(name1)":{...}, "_DataGroup_(name2)":{...}, "_DataGroup_(name3)":[...], ... }When defining one of the elements of an array object as a data group, one
[ indexed_node1, indexed_node2, indexed_node3,... ]if one needs to define the
indexed_node2
into a data group, one must define
[ indexed_node1, { "_DataGroup_(name1)": [index_node2] }, index_node3, ... ]where the name
"(name1)"
is optional.
Similarly, a dataset is specified by
"_Dataset_": { ... }
or
"_Dataset_": [ ... ]
with an optional name parameter in the following format:
"_Dataset_(a name string)": ...
Also similarly, a dataset is specified by
"_DataRecord_": { ... }
or
"_DataRecord_": [ ... ]
with an optional name parameter in the following format:
"_DataRecord_(a name string)"
Metadata records can be associated with a data group, dataset or data record
(or any branch or leaf). It is optional. Metadata can have two forms
The inline metadata is defined by attaching a series of comma-separated property=value
strings to the data group keywords (_DataGroup_
,_Dataset_
,_DataRecord_
), separated by
two colons, i.e. "::"
, for example
"_{DataGroup,Dataset,DataRecord}_(name)::Property1=...,Property2=...,..."
The property
name should not contain space or comma. If comma appears inside the value
,
it must be escaped using the form "\,"
.
An inline metadata can be attached to any named node, for example
"name::Property1=...,Property2=...,..."
One or more above permitted white spaces, see above, may be inserted before or after
separators "::"
, ","
and "="
without changing the interpretations of the data.
The inline metadata is primarily used to store simple properties. When storage of more complex
metadata is needed, a dedicated "metadata node" shall be inserted to the data object as the
first child of the annotated object.
For example, if the data to be annotated is a structure, the metadata node is defined
by a named leaf with a specific keyword "_DataInfo_"
. An example can be
found below:
{ "_DataInfo_": { "Property1": "...", "Property2": "...", "Property3": "...", ... }, named_node1, named_node2, ... }If the data to be annotated is an array, the metadata record is an indexed leaf as
[ { "_DataInfo_": { "Property1": "...", "Property2": "...", "Property3": "...", ... }, } indexed_node1, indexed_node2, ... ]The property names are user-defined. Recommended properties include but not limited
Version Author Comment UniqueID CreateTime ModifiedTime
JData is designed to store wide varieties of data forms. The most common data structures
used in the scientific communities include scalars, constants, arrays, structures,
tables, associative arrays, trees, and graphs.
The following constants are supported by this version of the specification
NaN
: An NaN
defined by the IEEE 754 standard shall be stored as "_NaN_"
as a string+/-Inf
: A +infinity
or -infinity
defined by the IEEE 754 standard shall be stored as a"+_Inf_"
and "-_Inf_"
, respectively, in the text-based JData ("+" signtrue
/false
: A logical true
/false
should be represented by the JSON true/false
[T]
/[F]
markers in UBJSON
Null
: a Null
(empty) value can be stored as null
in JSON and [Z]
in UBJSON
An N-dimensional array is serialized using the column-major format (i.e. the fastest index
is the left-most index, similar to arrays in MATLAB and FORTRAN; in comparison, C arrays
are row-majored).
A solid (non-sparse) N-dimensional array can be represented in two forms in JData - the
direct storage format and the annotated storage format.
A solid N-D array can be stored directly using JSON/UBJSON nested array constructs. For example,
a 1-D column vector can be stored as
[1,2,11,9,2.1,10,...]
while a 1-D row-vector can be stored as
[[1],[2],[11],[9],[2.1],[10],...]in the JSON-formatted JData.
Below is an example of a 4x3x2 3-D array stored in JSON-formatted JData:
[ [ [1,9,6,0], [2,9,3,1], [8,0,9,6] ], [ [6,4,2,7], [8,5,1,2], [3,3,2,6] ] ]The direct storage format of a solid N-D array does not have the ability to store information
Please be aware that the UBJSON format supported by JData includes an extended syntax
to facilitate storage and loading of N-D arrays in the binary format. Please refer to the
"Grammar" section for details.
In the annotated array storage format, one shall use a structure to store an N-D array. This
allows one to use additional fields to store additional information regarding the encoded
array.
The annotated array format is shown below for a solid 2x3 array a=[[1,2],[3,4],[5,6]]
{ "_ArrayType_": "typename", "_ArraySize_": [N1,N2,N3,...], "_ArrayData_": [1,2,3,4,5,6] }Here, the array annotation keywords are defined below:
"_ArrayType_"
: (required) a case-insensitive string value to specify the type of the data, see below
"_ArraySize_"
: (required) an integer-valued (see below) 1-D column vector, storing the dimensions"_ArrayData_"
: (required) a 1-D column vector storing the serialized array values, using the
To facilitate the pre-allocation of the buffer for storage of the array in the parser,
it is required that the "_ArrayType_"
and "_ArraySize_"
nodes must appear before
"_ArrayData_"
.
The supported data types are similar to those supported by the UBJSON format, i.e.
[U]
in UBJSON
[i]
in UBJSON
[I]
[I]
in UBJSON
[l]
[l]
in UBJSON
[L]
[L]
in UBJSON
[d]
in UBJSON
[D]
in UBJSON
The first 8 data types are considered "integer" types, and the last two types are considered
"floating-point" types.
JData supports the storage of complex values or arrays using only the annotated N-D array storage
format. In this case, a named node "_ArrayIsComplex_":true
or "_ArrayIsComplex_":1
is added into the structure. In the meantime, the "_ArrayData_"
becomes the concatenation of the
serialized real and imaginary parts of the complex array, with the real-part in the first half, and
the imaginary-part in the 2nd half.
For example, a complex double-precision 1x3 row vector a=[2+6*i, 4+3.2*i, 1.2+9.7*i]
can be stored as
{ "_ArrayType_": "double", "_ArraySize_": [1, 3], "_ArrayIsComplex_": true, "_ArrayData_": [2,4,1.2,6,3.2,9.7] }The
"_ArrayIsComplex_"
node must be presented before "_ArrayData_"
.
JData also supports the storage of N-D sparse arrays using the annotated N-D array storage
format. In this case, a named node "_ArrayIsSparse_":true
or "_ArrayIsSparse_":1
is added into the structure. In the meantime, the "ArrayData" becomes the concatenations
of the serialized N-tuple integer indices, followed by the non-zero array element values.
For an N-D sparse array, each non-zero value requires an N-tuple index to specify its
location.
For example, if a 3-D sparse array has the following non-zero element at the specified
locations by a triplets [i1, i2, i3]
:
a: i1, i2, i3, value 2 3 1 10.1 3 1 1 9.0 3 3 1 8.1 5 1 2 17 5 2 2 9.4 2 2 3 20.5it can be saved as the following JSON format
{ "_ArrayType_": "double", "_ArraySize_": [5, 4, 3], "_ArrayIsSparse_": true, "_ArrayData_": [2,3,3,5,5,2, 3,1,3,1,2,2, 1,1,1,2,2,3, 10.1,9.0,8.1,17,9.4,20.5] }The
"_ArrayIsSparse_"
node must be presented before "_ArrayData_"
.
Using the combination of "_ArrayIsComplex_"
and "_ArrayIsSparse_"
, one can store
a complex-valued sparse array using JData. In this case, both "_ArrayIsComplex_":true
and "_ArrayIsSparse_":true
shall be presented in the structure, with "_ArrayData_"
ordered by serialized non-zero element indices (left-most index first, and so on), followed
by the serialized real-values of the non-zero elements, and lastly the imaginary-values of
the non-zero elements.
For example, if a 3-D sparse array has the following non-zero complex element at the
specified indices (i1, i2, i3)
a: i1, i2, i3, complex values 2 3 1 10.1+19.0*i 3 1 1 9.0+11*i 3 3 2 8.1+8.2*iit can be saved as the following JSON format
{ "_ArrayType_": "double", "_ArraySize_": [4, 3, 2], "_ArrayIsComplex_": true, "_ArrayIsSparse_": true, "_ArrayData_": [2,3,3, 3,1,3, 1,1,2, 10.1,9.0,8.1, 19.0,11,8.2] }or the corresponding UBJSON equivalents.
JData supports node-based data compression to enhance space-efficiency of the generated
files. Compressed data format is only supported in the annotated array storage format.
Four additional nodes are added to the annotated array structure
_ArrayCompressionMethod_
: a string-valued leaflet to store the compression_ArrayCompressionSize_
: the dimensions of the pre-processed array, i.e. the data_ArrayData_
in the format specified by "_ArrayType_"
,_ArrayCompressionEndian_
: a case-insensitive string, either "little" or "big",_ArrayCompressedData_
: in addition, the "_ArrayData_"
node is replaced by"_ArrayCompressedData_"
. In the case of JSON-formatted JData files,"_ArrayCompressedData_"
has a string value storing the "Base64" encoded compressed"_ArrayCompressedData_"
directly stores the compressed byte stream of the
When a compressed array format is used, "_ArrayCompressionMethod_"
and
"_ArrayCompressionSize_"
must appear before "_ArrayCompressedData_"
.
In an associative array or hashed array, the element can be accessed using a string-valued
key. For example, the below pseudo code defines a 3-element associative array @a
with
3 unique keys
@a = {'Mike'->21, 'Julia'->25, 'Steve'->26}Such data structure can be conveniently represented using JSON/UBJSON as
{ "Mike": 21, "Julia": 25, "Steve": 26 }In an associative array, the keys are supposed to be unique. Only string-keyed associative
A table data structure is defined by a 2-dimensional grid of data indexed by the table
columns (fields) and rows (records). For example
Name Age Degree Height ---- ------ ------ ------ Mike 21 BS 71.2 Julia 25 MS 67.0 Steve 26 BE 69.1Such data can also be stored in JSON/UBJSON in two forms:
[ {"Name": "Mike", "Age": 21, "Degree": "BS", "Height": 71.2}, {"Name": "Julia", "Age": 25, "Degree": "MS", "Height": 67.0}, {"Name": "Steve", "Age": 26, "Degree": "BE", "Height": 69.1}, ]
Structure of arrays: the table data can also be organized in columns, where often
contains the same data types. Using the same above example, one can write the following
JData form
{ "Name": ["Mike","Julia","Steve"], "Age": [21,25,26], "Degree": ["BS","MS","BE"], "Height": [71.2,67.0,69.1] }
The choice of the forms is user-dependent. Typically, the "structure of arrays" form
leads to a smaller file size.
The above JData table can be enclosed inside an optional field "_TableData_(table_name)":{...}
(if inside a structure) or {"_TableData_":{...}}
(if inside an array) to inform the
parser the start of the data structure.
A tree-like data structure can be conveniently represented by JSON/UBJSON formatted files
due to the similar underlying structures. In JData, we use two keywords to encapsulate a
tree-like data structure.
The tree-node name and the associated data are stored using a named node with a keyword
_TreeNode_
:
"_TreeNode_(node_name)": node_data
If a tree-node contains children, we use a named array to store the children data:
"_TreeChildren_":[ {"_TreeNode_(child1)":data1}, {"_TreeNode_(child1)":data2}, ... ]For example, a tree data structure
root(data0) ├──── node1(data1) ├──── node2(data2) │ ├──── node2.1(data2.1) │ └──── node2.2(data2.2) └──── node3(data3)can be represented by the below JSON structure
{ "_TreeNode_(root)": data0, "_TreeChildren_": [ {"_TreeNode_(node1)": data1}, { "_TreeNode_(node2)": data2, "_TreeChildren_": [ {"_TreeNode_(node2.1): data2.1}, {"_TreeNode_(node2.2): data2.2} ] }, {"_TreeNode_(node3)": data3}, ] }and the corresponding UBJSON equivalents. The notations "data0", "data1" etc are parsed
One can add either inline or dedicated metadata record to store additional information
about the tree nodes, for example
{ "_TreeNode_(root)::NodeID=1,ParentID=0,Path=#root": data0, "_TreeChildren_": [ {"_TreeNode_(node1)::NodeID=2,ParentID=1,Path=#root#node1": data1}, { "_TreeNode_(node2)::NodeID=3,ParentID=1": data2, "_TreeChildren_": [ {"_TreeNode_(node2.1)::NodeID=4,ParentID=3": data2.1}, {"_TreeNode_(node2.2)::NodeID=5,ParentID=3": data2.2} ] }, {"_TreeNode_(node3)::NodeID=6,ParentID=1": data3}, ] }Auxiliary data can be inserted to different levels of the above JData tree document
"_TreeNode_(name)"
and "_TreeChildren_"
of the same level. Behaviors for parsing
The above JData tree can be enclosed inside an optional field "_TreeData_(tree_name)":{...}
(if inside a structure) or {"_TreeData_":{...}}
(if inside an array) to inform the
parser the start of the data structure.
Similar to the storage of trees, in JData, we use additional keywords to encapsulate a singly
or doubly linked list inside a JSON/UBJSON array construct. The relevant keywords are
"_ListNode_(unique_name)": node_data
and
"_ListNext_": "unique_name_of_next_node"
"_ListPrior_": "unique_name_of_prior_node"
For example, the below linked list
head(data0)->node1(data1)->node2(data2)->node3(data3)shall be stored as
[ { "_ListNode_(head)": data0, "_ListNext_": "node1" }, { "_ListNode_(node1)": data1, "_ListNext_": "node2" }, { "_ListNode_(node2)": data2, "_ListNext_": "node3" }, { "_ListNode_(node3)": data3, "_ListNext_": null } ]If a node does not have a next or prior element, the
"_LinkNext_"
or "_LinkPrior_"
valuenull
.
The name label referred in the "_LinkNext_"
or "_LinkPrior_"
fields has a scope limited to
the current list, i.e. the parent array one-level above the node. Multiple linked lists can
share the same names if they are stored in a parallel or nested fashion.
The above linked list can be enclosed inside an optional field "_LinkedList_(list_name)":[...]
(if inside a structure) or {"_LinkedList_":[...]}
(if inside an array) to inform the parser
the start of the data structure.
In JData, we use the below keywords to encapsulate a graph data structure
"_GraphNodes_"
: a JData array object to store the serialized node data
"_GraphEdges_"
: a JData array object to store the connections between nodes. Each edge["start node name", "end node name", (optional) edge data]
.
For example, if we modify the above linked list by adding a directed edge from "node1"
to
"node3"
and from "node3"
to "node2"
, such as the data suggested by this diagram
┌--------------------------┐ │ V head(data0)->node1(data1)->node2(data2)->node3(data3) ^ │ └-------------┘we can then store this data structure as
{ "_GraphNodes_":{ "head": data0, "node1": data1, "node2": data2, "node3": data3 }, "_GraphEdges_":[ ["head", "node1", edgedata01], ["node1", "node2", edgedata12], ["node1", "node3", edgedata13], ["node2", "node3", edgedata23], ["node3", "node2", edgedata32] ] }The data associated with each edge (edgedata) in this example is optional and can be any JData
By default, the graph is assumed to be a directed graph. If a user intends to store undirected
graph using the above format, one must use "_GraphEdges_(false)"
or "_GraphEdges_(0)"
to
enclose the edge data.
The above graph data can be enclosed inside an optional field "_GraphData_(graph_name)":{...}
(if inside a structure) or {"_GraphData_":{...}}
(if inside an array) to inform the parser the
start of the data structure.
When no data or metadata are associated with the edges, one can also use the graph's adjacency
matrix to represent the connections between graph nodes. The adjacency matrix must only have 0
and 1 values, with a 1 at A(i,j) indicating a directed edge from the i-th node in the
_GraphNode_
list to the j-th node in that list, and 0 indicating no connection.
For example, the adjacency matrix of the above graph can be written as
head node1 node2 node3 head 0 1 0 0 node1 0 0 1 1 node2 0 0 0 1 node3 0 0 1 0In this case, instead of using
_GraphEdges_
, we use _GraphMatrix_
to store the
{ "_GraphNodes_":{ "head": data0, "node1": data1, "node2": data2, "node3": data3 }, "_GraphMatrix_":[ [0,0,0,0], [1,0,0,0], [0,1,0,1], [0,1,1,0] ] }Notice that the matrix is serialized in the column-major order in the above example. Alternatively, one
{ "_GraphNodes_":{ "head": data0, "node1": data1, "node2": data2, "node3": data3 }, "_GraphMatrix_":{ "_ArrayTye_": "uint8", "_ArraySize": [4,4], "_ArrayIsSparse": true, "_ArrayData_": [1,2,2,3,4, 2,3,4,4,2, 1,1,1,1,1] } }One can also apply array compression, as explained above, to further reduce the file size
{ ... "_GraphMatrix_": { "_ArrayType_": "uint8", "_ArraySize_": [4,4], "_ArrayCompressionSize_": [1,16], "_ArrayCompressionMethod_": "zlib", "_ArrayCompressionEndian_": "little", "_ArrayCompressedData_": "eJxjYGBgYGQAE0DIyAAAAC0ABg==" } }here the
"_ArrayCompressedData_"
stores the column-major-serialized, byte-typecasted,
A JData-compliant parser should support the below programming interfaces for accessing
and processing JData files
JD_GetNode(root_node, index_vector, is_compact)
: returns the JData node pointed by the indexroot_node
JD_GetName(node)
: returns the full name string of the specified node
JD_GetData(node)
: returns the value associated with the specified node
JD_GetType(node)
: returns the node type
JD_GetLength(node)
: returns the number of children of the specified node
Essentially, JData stores a serialized version of the complex data using collections
of sequential or nested nodes, either in the named or indexed form. To access
any element (a leaflet, leaf or branch) of the JData document, one should use a vector
of indices that points to the specific node.
A JData-compliant parser must be able to retrieve JData elements via the below pseudo-code
interface using a linear index vector
JD_Node item = JD_GetNode(JD_Node root, [i1, i2, i3, i4, ...], is_compact)where
i1
is the index of the data on the top-most level (relative to the root level of the"root"
object), i2
, is the index along the 2nd level, and so on. Each index is an
{ "_TreeNode_(root)": data0, <- [1] "_TreeChildren_": [ <- [2] {"_TreeNode_(node1)": data1}, <- [2,1] { <- [2,2] "_TreeNode_(node2)": data2, <- [2,2,1] "_TreeChildren_": [ <- [2,2,2] {"_TreeNode_(node2.1): data2.1}, <- [2,2,2,1] {"_TreeNode_(node2.2): data2.2} <- [2,2,2,2] ] }, { <- [2,3] "_TreeNode_(node3)": data3 <- [2,3,1], or [[2,3]] }, ] }One can insert zeros to the right-side of the indexing vector if the array storing the vector
[2,2]
, [2,2,0]
and [2,2,0,0]
are equivalent.
The third parameter, is_compact
is a boolean flag. If set to true
, JD_GetNode
shall skip the index if any of the dimensions along the indexing vector is a singlet,
i.e. the child count is 1. The compact indexing vector, enclosed by double-square-brackets
as [[...]]
, shall be passed to JD_GetNode
as the 2nd input when is_compact
is true
.
Using the above example, both index vectors 2,3 and [2,3,1] refers to
"_TreeNode_(node3)": data3
. Please be aware that the compact indexing vector can not
distinguish between row and column vectors, as the row vector in JData has a trailing
singlet dimension (see N-D array section).
An optional alternative indexing vector definition allows to replace the index within a
structure by a corresponding string, can be the name of the data item or a hashed version.
For example, the item {"_TreeNode_(node2.1): data2.1}
in the above may also be accessed
via ["_TreeChildren_",2,"_TreeChildren_",1]
. This alternative indexing scheme is less
sensitive to data serialization orders, but requires the parser to handle both string
and integer inputs.
For each JData item identified via an indexing vector, a JData-compliant library must be able
retrieve the "name"
and "data"
properties of the object via the below pseudo-code interface
string name=JD_GetName(JDataNode item) JD_Node data=JD_GetData(JDataNode item)here
"name"
is a string variable recording the full item name, including the inline metadata"data"
is the "value" of the object. When the inquired data object is an element
A JData-compliant parser should also allow users to retrieve the type and the children count
of the node via the below interfaces
JD_NodeType type = JD_GetType(JD_Node item)The type must be able to distinguish the below 3 basic types
Combining with the JD_GetName
, one can also query if the element is a named one or indexed one.
Additionally, JData-compliant library must provide the below interface to obtain the count of
the childrens for each data item. A leaflet, an empty structure or an empty array should return
a length of 0.
integer length = JD_GetLength(JD_Node item)
One can choose either the text or binary format to save the raw data into
a JData file, with the former following the JSON storage requirements and the
latter following the UBJSON storage requirements. This specification permits
both lossless and lossy conversions between the raw data to JSON, raw data to
UBJSON, and between JSON and UBJSON files.
For best practices, use of lossless conversion is highly recommended. Here are
some general recommendations to best preserve data precision:
In addition, if a transformation to the data does not alter the (full or compact)
indexing vector to all leaflets in a JData document, it is referred to as an
"isometric transform", and is permitted. An example of an isometric transform
is the conversion from a structure to an array as shown in the below example:
{ "a": { "name1": value1, <- [1,1] or ["a","name1"] "name2": value2, <- [1,2] or ["a","name2"] "name3": value3 <- [1,3] or ["a","name3"] }, "b": "value4" <- [2] or ["b"] }to
{ "a": [ { "name1": value1 <- [[1,1]] or [["a","name1"]] }, { "name2": value2 <- [[1,2]] or [["a","name2"]] }, { "name3": value3 <- [[1,3]] or [["a","name3"]] } ], "b": "value4" <- [2] or ["b"] }The only permitted "non-isometric transform" is the conversion between a direct
JData files support referencing and internal/external linking via the definitions of
data links and anchors.
A link is defined by a named leaflet or leaf as shown in the below two styles
"link_style1": { "_DataLink_": "path" }, "link_style2": { "_DataLink_": { "URI": "path", "Parameters": [...], "MaxRecursion": 1 } }The
"path"
string specifies the Uniform Resource Identifier (URI) of the referenced
{ "_DataLink_": "file:///space/test/jdfiles/tree.jdat:[1,2,2]" }asks the parser to read a local file located at "/space/test/jdfiles/tree.jdat" and
[1,2,2]
, starting from the root (or super-root"_DataLink_"
node in the current document.
If using a "_DataLink_"
structure, additional parameters can be specified via
user-defined parameters, such as "Parameters"
and "MaxRecursion"
to fine-tune
the linking behavior.
For easy referencing, JData permits the definitions of named anchors inside the
"metadata" section for each node using the following format
"obj::_DataAnchor_=a_unique_anchor_name": ...or
"obj": { "_DataInfo_"{ ... "_DataAnchor_": "a_unique_anchor_name", ... } ... }Then, the data object can be referenced as shown in the below example
"global_link1"
"global_link1": { "_DataLink_": "https://example.com:8080/space/test/jdfiles/tree.jdat#a_unique_anchor_name" }, "local_link1": { "_DataLink_": "#a_unique_anchor_name" } "local_link2": { "_DataLink_": [1,2,2] } "local_compact_link3": { "_DataLink_": [[1,2,2]] }A data link URI starting with "#" refers to the data anchor defined within the same document,
"local_link1"
example above. Similarly, one can directly use the"_DataLink_"
to cite"local_link2"
example above. A compact indexing vector"local_compact_link3"
example above. The
For the text-based JData file, the recommended file suffix is ".jdat"
; for
the binary JData file, the recommended file suffix is ".ubjd"
.
The MIME type for the text-based JData document is
"application/jdata-text"
; that for the binary JData document is
"application/jdata-binary"
The major attractions of JSON and UBJSON are their simplicity and portability, which
are often missing from other alternatives. In this document, we aim to extend the ability
of JSON/UBJSON in storage and interchange complex data structures without needing to
modify the language syntax, making the generated JData files readily usable for most
existing JSON/UBJSON encoders and decoders.
Specifically, we defined JSON/UBJSON-based constructs to store N-D arrays, tables, trees,
linked lists, and graphs, and added the ability to associate metadata to any elements
of the JData document. In addition, we also define a set of core library interfaces to
query and access the values and properties of the data units stored in a JData document.
To enhance space-efficiency and flexibility, we also introduced array data compression
and data linking/referencing mechanisms.
Although JData does not have all the sophisticated features as other advanced binary
data exchange formats, such as HDF5, it is well suited for storage of small to medium
sized datasets generated in many scientific domains or IT applications. Combined with
the excellent availability of parsers and web-friendliness, JData is expected to be
easily adopted and extended in the future.