Serialisation data format


(Rick Parker) #1

Serialisation

Introduction

We have always said that we would create a defined and documented binary format for data stored in Corda in the form of transactions and contract states, for the messages passing between nodes peer-to-peer and for RPC messages passing between a node and RPC clients. This document outlines the requirements for finding a long term solution.

We are experimenting with using byte code driven analysis of classes to determine suitability for serialisation, rather than using a distinct IDL, and looking to AMQP, Avro, protobuf and others as the basis for a wire format and schema description. We will discuss progress here occasionally. If you have a serialisation technology or wire format you think we should consider, do bring it to our attention in this topic. We have already had a suggestion to look at Google FlatBuffers.

Background

Presently all serialisation requirements of the Corda node are implemented using Kryo. There are several pros and cons of this solution.

Pros
  • Relatively performant
  • Relatively widely used, so some support community and many bugs already uncovered or fixed.
  • Seemingly can serialise “anything” without having to implement java.io.Serializable (also see Cons section)
  • Can be extended with custom serialisers.
  • Can register custom serialisers for all subclasses of a particular type.
  • Can lock down which classes can be serialised to improve security (dynamic registration disabled).
  • Integrated with Quasar
  • Does allow the default serialiser to be set to one of several implementations, some of which support some degree of forwards and backwards compatibility.
  • Supports object graphs.
Cons
  • Even though there’s no need to implement java.io.Serializable, in reality there are several caveats as to what can be successfully serialised.
  • Doesn’t leverage custom serialisation logic in many classes implemented for use with java.io.Object(Input/Output)Stream (at least not without additional configuration).
    • End up needing to register and sometimes implement custom serialisers.
  • No documented or necessarily stable binary format.
    • Requires and depends on class files.
    • Consequently cannot easily be made cross platform (non-JVM).
  • Requires classes to be present and instantiated during deserialisation, presenting a possible security issue
  • Class identifiers when dynamic registration disabled are sensitive to registration ordering, unless they are statically allocated some other way (Kryo lets us choose the identifier, but it’s still an integer).
  • Dynamic registration is wasteful of bytes as fully qualified class names included in stream.
  • Does not allow selective deserialisation of sub-elements or navigation within the binary data.
    • For performance
    • For query support
  • @DefaultSerializer annotation allows for running custom code on deserialization.

###Use cases

Corda uses serialisation via Kryo in 4 main use cases.

  • Transaction storage (non-ORM’d).
    • Standard JVM/language types (int, Integer, String, etc)
    • Support for common collection types: Array, List, Map, maybe Set
    • Transactions, states, signatures, commands, attachments etc.
    • Within states and commands, supported JVM and 3rd party types.
    • Stable binary form as we use SecureHash of the serialised data as primary key, so it must be reliably reproducible on every node.
  • Passing messages from Node to Node (P2P). Messages contain/include:
    • Everything specified for transaction storage
    • Session set up, tear down messages.
    • Message types defined by flows, and some common built-in message types.
    • Identifiers for JVM classes (instances of java.lang.Class etc)
    • Attachments (large binary blobs)
  • Passing messages from Node to RPC client and RPC client to Node (RPC). Messages contain/include:
    • Everything specified for transaction storage
    • Predefined RPC message types
    • Identifiers for JVM classes (instances of java.lang.Class etc)
    • Attachments
    • Soon to be: Query definitions, primarily for the Vault but perhaps more general purpose than that and could be done with Lambdas.
  • Checkpoints.
    • Can encounter any type expected in the other use cases since they could be residing on the stack etc.
    • Any types required by Quasar to support the structure of the checkpoint and the stack contained within.
    • Support for singleton services and other elements of the Node that should not be serialised into the checkpoints directly and should be substituted for the equivalent instance when deserialised.
      • It’s not clear whether this is needed for P2P or RPC messaging, but it might.
    • Plus, any type within the JVM
      • Encountered as either a property of the FlowLogic subclass or on the Fiber stack
      • It’s plausible that we can constrain the types that need to be supported here, or enforce some rules.
    • Not necessarily required to be a stable format. The likely intention is to complete a flow and delete all checkpoints before attempting any upgrade to it.

Requirements

We need to satisfy all of the use cases listed above. It would be advantageous to settle on a single serialisation solution across all the use cases, but the Checkpoint requirement is a much bigger ask due to all the edge cases that Kryo has attempted to tackle, and does not share the same security requirements, so it might be that we retain Kryo for this purpose, or attempt to replace it much later.

  1. Our preference is for a JVM class driven schema. That is, Corda and CorDapp developers just write Java, Kotlin etc classes adhering to some documented subset of features, interfaces or perhaps decorated with annotations.
  2. A documented, stable, binary format.
  3. Format typing and versioning, in case of the need to vary the format over time or to support a migration to an improved, distinct, format.
  4. Serialisation support for the following types:
  5. A set of supported core types, covering standard JVM types, java.time.*, BigDecimal, int, String, SecureHash etc.
  6. Common collection types (List/Array, Map and perhaps Set).
  7. Simple objects with fields containing only other supported types, such as ContractStates, that require no custom serialisation logic or custom unmarshalling or initialisation logic that could be a security risk to execute on deserialisation.
  8. Support for simple classes besides a small set of built in types. Simple classes will be approximately the equivalent of Java Beans or Kotlin’s data classes. A constructor matching the properties/fields of the objects will be required.
  9. Those classes will be marked with an annotation to indicate that they are intended to be serialised / deserialised, to support a dynamic population of classes.
  10. Raw binary data (byte[], ByteArray)
  11. Class identifiers. Need to encode the class name and something about the class loader (e.g. which attachment).
  12. Supported types that cannot be annotated will be included on a white list to prevent the possibility of exploits being developed using arbitrary classes available on the class path.
  13. We may consider the possibility that the whitelist can be upgraded to include new classes, and implement mechanisms for doing so across nodes running older code bases.
  14. For deterministic transaction hashing, the serialised forms of the same source data must be identical. No unordered data support (HashSet, HashMap).
  15. As part of defining what the supported subset of the JVM features is for these classes, we can band the use of unordered collections classes.
  16. No requirement for the deserialiser to have the class loaded. We can spin an implementation given a description / schema for the class.
  17. Drives a requirement that serialised classes adhere to simple equals() and hashCode() contracts involving all serialised fields only, and have nothing more than getter and setters.
  18. To be able to serialise or deserialise on non-JVM platforms
  19. A self describing, documented format.
  20. Documented schema for well known / JDK / pre-registered classes that doesn’t need to be embedded in the data stream for non-JVM support.
  21. A preference for the low level coding, at least, to already be available on other platforms. e.g. AMQP / Avro / Protobuf integer, string encoding etc.
  22. Ability to embed large binary blobs, such as attachments and skip over them when desired, for use in the messaging layer when uploading / downloading attachments.
  23. This need not be anything more than support for byte arrays, and our existing SerialisedBytes wrapper class to indicate some meaning to the bytes.
  24. To support navigation within the serialised data stream to select sub elements for deserialisation when the whole object graph is not required, and to allow subgraphs to be removed and replaced with merkle trees.
  25. Loading contract states from transactions and possibly querying transaction blobs directly.
  26. Each contract state needs to be held as an independent object graph for this. Again, we can use the SerialisedBytes wrapper within the serialised stream.
  27. Individual state loading could be solved another way by breaking up a transaction and storing states independently (e.g. as distinct database rows of blob data), or hand rolling the transaction serialisation.
  28. As compact as possible in terms of byte count, within reason, but don’t make decoding on other platforms too difficult.
  29. No requirement for extensible list of 3rd party custom (de)serialisers to tackle “problem classes”
  30. We will provide custom serialisation for supported core types, such as java.time.*, BigDecimal etc. That set of supported types could be expanded in future releases. Whitelisting of classes that can be directly constructed.
  31. Mandate and verify use of simple value classes that represent simply the schema and can be serialised trivially, to minimise the impact of this. Class scanning to determine if non-white listed classes meet the necessary criteria.
  32. If 3rd party custom serialisation is required, the objects will have to be represented as bytes and deserialised manually in the sandbox as part of a whitelisted flow or during verify.
  33. These values would be opaque to non JVM platforms and should not be deserialised outside a sandbox.
  34. Minimal fragility in terms of order of configuration, custom serialiser registration
  35. Support for tokenising of certain types (service object references etc), perhaps using the existing technique based on implementation of an interface or see later discussion point about using class loader to distinguish.
  36. Optional control over the serialised form. This needs further clarification and detail as to what extent we need this, but some options here are:
  37. Backwards compatibility between class revisions, even if fields have been added, removed or renamed.
  38. Potential support for conversion/upgrade from old serialised forms.
  39. Support for recording the associated source for classes. As an example, if we have classes loaded from attachments, then noting which attachment a class belongs to (or which CorDapp etc) will be important.
  40. We can likely do this by having our own custom class loaders that yield up some description / handle for recording into the serialised stream as part of the class handle or identifier.
  41. Object graph support. If we want to use the same technology in the checkpointing, then we will need graph support. For messages and transactions, it’s entirely plausible that a tree, perhaps with branch de-duplication, should be sufficient. But graph support is not difficult if we build something custom, but perhaps more difficult/impossible if trying to use something like AMQP / Avro.
  42. Support for isolating an object graph so that it is serialised in a consistent fashion irrespective of what parts of that graph have previously been seen by the serialisation process.

Serialisation part II: Planned approach
(Charles) #2

why is Requirement #8 a “must have” ? as opposed to a “nice to have” ?


(Rick Parker) #3

Good question. There are a number of reasons we want non-JVM interoperability, including:

  1. We never intended for Corda to only be accessible from JVM based platforms. The JVM is key to running the contract verify and flows but being able to pass data into and out of non-JVM platforms is completely viable. Having that data be introspected, as opposed to just treated as a blob by those platforms, is equally viable.
  2. The financial services industry has a large installed base, and a continuing trend for, .Net based desktop applications. We don’t want to preclude those applications from being able to understand / introspect transactions and states and generally interoperate via RPC with Corda nodes. Whilst it’s possible to embed a JVM in-process and do some adapting, native support is highly preferable.
  3. Similarly, an iPad or other iOS device has interesting authentication features that might be of interest in the context of Corda, and limited support for a JVM. There may be other constrained devices that fall into this category for similarly philosophical or technical reasons.

We could follow an approach that assumes these are nice to have, but we knew from day one we had at least some of these interoperability requirements and it would be naive to ignore them and let every non-JVM application invent their own solution to interoperability, or provide something suboptimal when we have had the foresight. Supporting a well defined and documented data format within Corda is important for other reasons (e.g. ongoing compatibility - we don’t want to change some code and suddenly not be able to read old transactions) and the incremental requirement to not have to have the JVM class file and be able to parse it in order to extract some data outside the JVM should not be a significant additional technical burden.


(Charles) #4

understood, the only problem is that is such a feverishly hot market is that too many high priorities will delay the delivery of an acceptable viable production core which folks like us need to be able to make our sales pitches.

I should probably start another thread and I will if requested but what is the Roadmap to a production release ?


(Mike Hearn) #5

Hi Charles,

Different people have different definitions of roadmap. The tech white paper says what we’re aiming for feature-wise. How much of it we get through by EOQ3 (api stability target) depends a lot on factors outside of our direct control related to speed of fundraising, etc.

We have an open JIRA. You can see our current sprint there, and we routinely triage the backlog to make sure it’s roughly in order of priority:

https://r3-cev.atlassian.net/projects/CDT/issues/CORDA-5?filter=allopenissues

The .NET API is not currently scheduled onto the critical path for a 1.0 release, so it wouldn’t delay anything.