Introduction to Google Protocol Buffers – Protobuf in Java

Introduction to Google Protocol Buffers – Protobuf in Java

Default serialization mechanism provided in Java is not that efficient and has a host of well-known problems (see Effective Java, by Josh Bloch pp. 213). Also the Java serialization doesn’t work very well if you want to share data with applications written in C++ or Python. Google protocol buffers also known as protobuf is an efficient alternative to serialize objects. Protobuf is faster and simpler than XML and more compact than JSON. It was designed to be language/platform neutral and extensible. Currently, protobuf has support for C++, C#, Go, Java, and Python. In this tutorial we will see an introduction to Google Protocol Buffers(Protobuf) in Java.

What is Protocol Buffer?

Protocol Buffer is a mechanism for serializing structured data. All you have to do is specify a message for each data structure you want to serialize (in a Java class like format) using a .proto specification file.

From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.

The protobuf API in Java is used to serialize and deserialize Java objects. You don’t need to worry about any encoding and decoding detail.

Defining Your Protocol Format

Basically, you will define how you want your data to be structured, using a message format, in a .proto file. This file is consumed by the protocol buffer compiler (protoc) which will generate a Java class with getter and setter methods so that you can serialize and deserialize Java objects to and from a variety of streams.

The message format is similar to a Java class. Each message type has one or more fields. Each field has a name and a type. The types can be numbers, booleans, strings, bytes, collections and enumerations. Also, you can nest other message types, allowing you to structure your data hierarchically in much the same way JSON allows.

Fields can be specified as optional, required, or repeated. The types in the field are hints to protoc compiler about how to serialize a fields value and produce the message encoded format of your message. The encoded format looks like compressed representation of your object.

Here is a sample .proto file.

Employee.proto

Let us try to understand the definitions in the above file.

  • The .proto file starts with a package declaration, which helps to prevent naming conflicts between different projects.
  • The java_package is used as the name of the Java package under which the generated Java files should be present or in other words it defines in what Java package name your generated Java classes should be present. If you don’t provide java_package declaration then the name in the package declaration(first line) will be used as the Java package name.
  • The java_outer_classname option defines the name of the generated class which should contain all of the classes in this proto file. If you don’t give a java_outer_classname explicitly, it will be generated by converting the file name to camel case. For example, “my_proto.proto” would, by default, use “MyProto” as the outer class name. Note that the outer class name cannot be same as the name of a message in the .proto file.
  • Next, we have message definitions. A message is just an aggregate containing a set of typed fields. As I mentioned earlier, message definition is similar to a Java class. Many standard simple data types are available as field types, including bool, int32, float, double, and string. You can also use other message types as field types – in the above example the Employee message contains Department messages. You can even define message types nested inside other messages.
  • The ” = 1″, ” = 2″ markers on each field identify the unique “tag” that field uses in the binary encoding. In other words these tags identify the field order of your message in the binary representation on an object of this specification.
  • Tag values 1 – 15 requires 1 byte, whereas fields tagged with values 16 – 2047 take 2 bytes to encode. Google recommends you to use tags 1 – 15 for the commonly used or repeated elements and also reserve some tag values in this range for any future updates. Each element in a repeated field requires re-encoding the tag number, so repeated fields are particularly good to be tagged using 1-15.
  • Tags 16 and higher can be used for less-commonly used optional elements.

 

Each field must be annotated with one of the following modifiers:

  • required – a value for the field must be provided, otherwise the message will be considered “uninitialized”. Trying to build an uninitialized message will throw a RuntimeException. Parsing an uninitialized message will throw an IOException. Other than this, a required field behaves exactly like an optional field. You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field
  • optional the field may or may not be set. If an optional field value isn’t set, a default value is used. For simple types, you can specify your own default value, as we’ve done for the salary field in the example. Otherwise, a system default is used: zero for numeric types, the empty string for strings, false for bools. Calling the accessor to get the value of an optional (or required) field which has not been explicitly set always returns that field’s default value.
  • repeated – the field may be repeated any number of times (including zero). The order of the repeated values will be preserved in the protocol buffer. Think of repeated fields as dynamically sized arrays.

You’ll find a complete guide to writing .proto files – including all the possible field types – in the Protocol Buffer Language Guide.

Compiling Your Protocol Buffers

The next thing is to compile your .proto files. The compiler will convert your message types(Employee and Department in our example) into augmented classes providing, among other things, getters and setters for your fields. Each class has its own Builder class that you use to create instances of that class. The compiler also generates convenience methods to serialize messages to and from output streams and strings.

To convert messages to Java class, you need to run the protocol buffer compiler protoc on your .proto file.

For Java, the simplest way to install the protocol compiler is to download a pre-built binary from this link. In the downloads section of each release, you can find pre-built binaries in below format

zip package: protoc-$VERSION-$PLATFORM.zip.

I have downloaded protoc-2.5.0-win32.zip file to use in our example. It contains the protoc compiler binary.

Now run the compiler, specifying the path to your .proto file and the destination directory (where you want the generated class files to go).

The syntax for the protoc command is: protoc [OPTION] PROTO_FILES.

  • proto_path option is used to specify where to search for the .proto files. If you don’t specify this option, then the compiler will search for .proto files in the current directory.
  • java_out option is used to specify where the generated Java files should be stored(destination directory). In our example we will mention the current directory to be used for storing generated source files.

You can run protoc –help command to see all the available options.

Below is the command I have used to compile Employee.proto file.

The above command generates EmployeeProto.java class under com/tutorial/protobuf(specified in java_package definition) relative to the current directory(specified by –java_out option).

Note that the message classes generated by the compiler are immutable i.e. once built, they cannot be changed.

Serialize proto buffers using protobuf Java API

Now that we have created the Java class from the messages. The next step is to serialize this message objects in Java. To do this you need protobuf Java API. You can download the protobuf Java library(.jar file) from maven repository and add it to your Java project.

Note that the compiler version should be same as the Java API version. So I have downloaded protobuf-java-2.5.0.jar to use in our example.

If your’s is a maven project then you can add the following maven dependency.

Here is a simple Java example to serialize message objects to OutputStream and read the same from InputStream.

ProtoBufferExample.java

Below is the output of running the above program.

That’s all about Google Protocol Buffers in Java. If you have any queries, post it in the comments section.

Further Reading

Protocol Buffer Basics: Java

5 Reasons to Use Protocol Buffers Instead of JSON

 

 

The following two tabs change content below.
Working as a Java developer since 2010. Passionate about programming in Java. I am a part time blogger.

Add Comment

Required fields are marked *. Your email address will not be published.