Apache Avro is a well-know and recognized data serialization framework, already officially in use in toolkits like Apache Hadoop.
Avro distinguishes itself from the competitors (like Google’s Protocol Buffers and Facebook’s Thrift) for its intrinsic i. simplicity, ii. speed and iii. serialization rate close to 1:1 (i.e. almost no metadata needed). It has a set of powerful APIs that make very easy to serialize and deserialize text-data; the serialization scheme/algorithm is very effective, since it is very fast and do not introduce byte overhead (here a comprehensive benchmark to get a big picture).
Surfing the web it is possible to read that Avro embeds the schema (abstract data representation) into the serialized byte stream, and this in order to make the deserialization possible anywhere, at any time, by any Avro-powered application. Embedding the schema means the that footprint of Avro’s byte stream might be relevantly higher on average, in comparison with byte streams of its competitors.
In many contexts the flexibility given by Avro, in terms of embedded schema, are not needed: for example, applications that need to exchange messages (well defined data type) over the network, using a binay channel, might avoid the size overhead by avoiding to embed the schema.
Avro offers schema-less serialization capabilities by directly operating on its own Datum Reader and Writer classes. Adopting this kind of serialization schema, the data type specification is not introduced in the byte stream assuming that the receiver or the party involved in the deserialization knows a-priori the data schema. Such technique is used in the HDFS (Hadoop Distributed File System) for the network communications.
On my GitHub account the code of a Data Serializer shows how to adopt the schema-less serialization with very few API calls. Moreover, the test class provides a benchmark of multiple cases: from 64 to 4096 byte per payload (be careful, in the message header a Long is written, a further bunch of byte has to be taken into consideration), in order to measure the performance. The outcome of such benchmark shows that, on average, a combined serialization-deserialization is operated in few microseconds. Hereafter an extract of one of the tests.
Benchamrk Summary { #tests: 100 #repetitions: 5 64bytes payload: mean[us]: 0.08 size[bytes]: 80.0 128bytes payload: mean[us]: 0.144 size[bytes]: 144.0 256bytes payload: mean[us]: 0.272 size[bytes]: 272.0 512bytes payload: mean[us]: 0.528 size[bytes]: 528.0 1024bytes payload: mean[us]: 1.04 size[bytes]: 1040.0 2048bytes payload: mean[us]: 2.064 size[bytes]: 2064.0 4096bytes payload: mean[us]: 4.112 size[bytes]: 4112.0 8192bytes payload: mean[us]: 8.209 size[bytes]: 8209.0 }
Nice article! But I have a question regarding this approach: I noticed that in your example, the Message class extended SpecificRecord. How can I avoid it? I don’t want all my message classes to extend/implement an Avro abstract class/interface.
Thanks!
Xihui
LikeLike
Hey there! Thanks for the feedback.
Actually, the Message class is generated automatically by Avro, and it’s strictly needed if you want to serialize/deserialize without passing back and forth the Schema with data.
What do you want to achieve exactly?
LikeLike
Hi Paolo. I am new to avro and hadoop thing.I have a question as to how the avro will generate Message.java on its own. What do we need to do, in order to generate a java class corresponding to the type of data that we want to write in the avro file. I am trying to read quite a huge amount of data from xml files, and want to dump it to an avro file, so that I can run a spark job on it to index it to solr collection for faster performance.
Thanks in Advance. Your inputs will be greatly appreciated.
Dimpal
LikeLike
Hi Paolo, I found out that we still need to give a schema file to generate the Java Code (POJO) which is represented by your Message.java, and use avro tools jar to generate it. Then, how is this avro file schemaless? I wondered, if it could make the schema on its own, if we just write some data to it. I guess, it doesn’t happen like that. So, what if I have data in xml files (from which I am parsing the different fields and writing to avro) which can have different set of fields and all fields are not present in each xml. As I can understand, I would need to define a superset of fields in the schema file andtheir data type should be a union of null and the specific type they could be, say, “String and null”. Could you please suggest if I am going in the right direction?
Thanks Paolo for this Article. It was really helpful.
Dimpal Kalra 🙂
LikeLike
Hi there,
yes, as pointed out earlier on by [1], once the JSON schema is defined, the code can be generated and used in your ETL application.
Avro files are serialized according to the JSON schema; typically, the schema is embedded in the serialized file itself for easy umarshalling; the stub code generated by the compilation process doesn’t really need the schema embedded in the file (so schema-less in the title).
To your point: your ETL process should transform the XML to Avro, and the load the Avro data to Solr, if I understand correctly. The step XML to Avro should be defined by custom logic or automated using [2]; once got the code and loaded the data in memory (Avro unmarshalling process), it’s a matter to setup the Spark jobs to write to Solr.
Let me know if it makes sense, and thanks for your feedback!
Cheers,
P
[1] https://www.tutorialspoint.com/avro/serialization_by_generating_class.htm
[2] https://github.com/elodina/xml-avro
LikeLike
Hi,
I’m probably missing some context. The process to generate stub classes from Avro is explained in [1], in particular: once the Avro schema is defined (JSON schema structure), running the generation utils allows to generate the stub classes that can ben used in your program.
Other alternative consists in dealing directly with Avro GenericRecord [2] in your code.
Your use case seems very much like to a ETL (Extraction Transformation and Loading) process; and, yes, Spark might be a good way to go.
Cheers,
P
[1] https://www.tutorialspoint.com/avro/serialization_by_generating_class.htm
[2] https://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/generic/GenericRecord.html
LikeLike