Core Development · Distributed Algorithms and Communication Protocols · Distributed Computing · Performance · Persistence

Apache Avro Schema-less Serialization: How To

Apache Avro is a well-know and recognized data serialization framework, already officially in use in toolkits like Apache Hadoop.

Avro distinguishes itself from the competitors (like Google’s Protocol Buffers and Facebook’s Thrift) for its intrinsic i. simplicity, ii. speed and iii. serialization rate close to 1:1 (i.e. almost no metadata needed). It has a set of powerful APIs that make very easy to serialize and deserialize text-data; the serialization scheme/algorithm is very effective, since it is very fast and do not introduce byte overhead (here a comprehensive benchmark to get a big picture).

Surfing the web it is possible to read that Avro embeds the schema (abstract data representation) into the serialized byte stream, and this in order to make the deserialization possible anywhere, at any time, by any Avro-powered application. Embedding the schema means the that footprint of Avro’s byte stream might be relevantly higher on average, in comparison with byte streams of its competitors.

In many contexts the flexibility given by Avro, in terms of embedded schema, are not needed: for example, applications that need to exchange messages (well defined data type) over the network, using a binay channel, might avoid the size overhead by avoiding to embed the schema.

Avro offers schema-less serialization capabilities by directly operating on its own Datum Reader and Writer classes. Adopting this kind of serialization schema, the data type specification is not introduced in the byte stream assuming that the receiver or the party involved in the deserialization knows a-priori the data schema. Such technique is used in the HDFS (Hadoop Distributed File System) for the network communications.

On my GitHub account the code of a Data Serializer shows how to adopt the schema-less serialization with very few API calls. Moreover, the  test class provides a benchmark of multiple cases: from 64 to 4096 byte per payload (be careful, in the message header a Long is written, a further bunch of byte has to be taken into consideration), in order to measure the performance. The outcome of such benchmark shows that, on average, a combined serialization-deserialization is operated in few microseconds. Hereafter an extract of one of the tests.



Benchamrk Summary
  #tests: 100
  #repetitions: 5

  64bytes payload:
       mean[us]: 0.08
    size[bytes]: 80.0
  128bytes payload:
       mean[us]: 0.144
    size[bytes]: 144.0
  256bytes payload:
       mean[us]: 0.272
    size[bytes]: 272.0
  512bytes payload:
       mean[us]: 0.528
    size[bytes]: 528.0
  1024bytes payload:
       mean[us]: 1.04
    size[bytes]: 1040.0
  2048bytes payload:
       mean[us]: 2.064
    size[bytes]: 2064.0
  4096bytes payload:
       mean[us]: 4.112
    size[bytes]: 4112.0
  8192bytes payload:
       mean[us]: 8.209
    size[bytes]: 8209.0


7 thoughts on “Apache Avro Schema-less Serialization: How To

  1. Nice article! But I have a question regarding this approach: I noticed that in your example, the Message class extended SpecificRecord. How can I avoid it? I don’t want all my message classes to extend/implement an Avro abstract class/interface.



    1. Hey there! Thanks for the feedback.
      Actually, the Message class is generated automatically by Avro, and it’s strictly needed if you want to serialize/deserialize without passing back and forth the Schema with data.
      What do you want to achieve exactly?


  2. Hi Paolo. I am new to avro and hadoop thing.I have a question as to how the avro will generate on its own. What do we need to do, in order to generate a java class corresponding to the type of data that we want to write in the avro file. I am trying to read quite a huge amount of data from xml files, and want to dump it to an avro file, so that I can run a spark job on it to index it to solr collection for faster performance.

    Thanks in Advance. Your inputs will be greatly appreciated.


    1. Hi Paolo, I found out that we still need to give a schema file to generate the Java Code (POJO) which is represented by your, and use avro tools jar to generate it. Then, how is this avro file schemaless? I wondered, if it could make the schema on its own, if we just write some data to it. I guess, it doesn’t happen like that. So, what if I have data in xml files (from which I am parsing the different fields and writing to avro) which can have different set of fields and all fields are not present in each xml. As I can understand, I would need to define a superset of fields in the schema file andtheir data type should be a union of null and the specific type they could be, say, “String and null”. Could you please suggest if I am going in the right direction?

      Thanks Paolo for this Article. It was really helpful.
      Dimpal Kalra 🙂


      1. Hi there,
        yes, as pointed out earlier on by [1], once the JSON schema is defined, the code can be generated and used in your ETL application.
        Avro files are serialized according to the JSON schema; typically, the schema is embedded in the serialized file itself for easy umarshalling; the stub code generated by the compilation process doesn’t really need the schema embedded in the file (so schema-less in the title).
        To your point: your ETL process should transform the XML to Avro, and the load the Avro data to Solr, if I understand correctly. The step XML to Avro should be defined by custom logic or automated using [2]; once got the code and loaded the data in memory (Avro unmarshalling process), it’s a matter to setup the Spark jobs to write to Solr.
        Let me know if it makes sense, and thanks for your feedback!



    2. Hi,
      I’m probably missing some context. The process to generate stub classes from Avro is explained in [1], in particular: once the Avro schema is defined (JSON schema structure), running the generation utils allows to generate the stub classes that can ben used in your program.
      Other alternative consists in dealing directly with Avro GenericRecord [2] in your code.
      Your use case seems very much like to a ETL (Extraction Transformation and Loading) process; and, yes, Spark might be a good way to go.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s