Core Development · Distributed Algorithms and Communication Protocols · Distributed Computing · Performance · Persistence

Apache Avro Schema-less Serialization: How To

Apache Avro is a well-know and recognized data serialization framework, already officially in use in toolkits like Apache Hadoop.

Avro distinguishes itself from the competitors (like Google’s Protocol Buffers and Facebook’s Thrift) for its intrinsic i. simplicity, ii. speed and iii. serialization rate close to 1:1 (i.e. almost no metadata needed). It has a set of powerful APIs that make very easy to serialize and deserialize text-data; the serialization scheme/algorithm is very effective, since it is very fast and do not introduce byte overhead (here a comprehensive benchmark to get a big picture).

Surfing the web it is possible to read that Avro embeds the schema (abstract data representation) into the serialized byte stream, and this in order to make the deserialization possible anywhere, at any time, by any Avro-powered application. Embedding the schema means the that footprint of Avro’s byte stream might be relevantly higher on average, in comparison with byte streams of its competitors.

In many contexts the flexibility given by Avro, in terms of embedded schema, are not needed: for example, applications that need to exchange messages (well defined data type) over the network, using a binay channel, might avoid the size overhead by avoiding to embed the schema.

Avro offers schema-less serialization capabilities by directly operating on its own Datum Reader and Writer classes. Adopting this kind of serialization schema, the data type specification is not introduced in the byte stream assuming that the receiver or the party involved in the deserialization knows a-priori the data schema. Such technique is used in the HDFS (Hadoop Distributed File System) for the network communications.

On my GitHub account the code of a Data Serializer shows how to adopt the schema-less serialization with very few API calls. Moreover, the  test class provides a benchmark of multiple cases: from 64 to 4096 byte per payload (be careful, in the message header a Long is written, a further bunch of byte has to be taken into consideration), in order to measure the performance. The outcome of such benchmark shows that, on average, a combined serialization-deserialization is operated in few microseconds. Hereafter an extract of one of the tests.



Benchamrk Summary
  #tests: 100
  #repetitions: 5

  64bytes payload:
       mean[us]: 0.08
    size[bytes]: 80.0
  128bytes payload:
       mean[us]: 0.144
    size[bytes]: 144.0
  256bytes payload:
       mean[us]: 0.272
    size[bytes]: 272.0
  512bytes payload:
       mean[us]: 0.528
    size[bytes]: 528.0
  1024bytes payload:
       mean[us]: 1.04
    size[bytes]: 1040.0
  2048bytes payload:
       mean[us]: 2.064
    size[bytes]: 2064.0
  4096bytes payload:
       mean[us]: 4.112
    size[bytes]: 4112.0
  8192bytes payload:
       mean[us]: 8.209
    size[bytes]: 8209.0


2 thoughts on “Apache Avro Schema-less Serialization: How To

  1. Nice article! But I have a question regarding this approach: I noticed that in your example, the Message class extended SpecificRecord. How can I avoid it? I don’t want all my message classes to extend/implement an Avro abstract class/interface.



    1. Hey there! Thanks for the feedback.
      Actually, the Message class is generated automatically by Avro, and it’s strictly needed if you want to serialize/deserialize without passing back and forth the Schema with data.
      What do you want to achieve exactly?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s