Protobufs – Protocol Buffers
As you can see from my blog, last week I attended the PyCon Israel 2022 conference (Day 1 summary, Day 2 summary). As one would expect for a Python conference in 2022, the hot topic was Machine Learning. Along with Machine Learning comes something called “Big Data”, which is still an ambiguous term, but basically it means your app or your company is sitting on large amounts of data. These companies often either want to leverage their data, or, more importantly, process it using a Machine Learning algorithm or model. So how do Protobufs help solve this problem, and what even are they? (No, they are not some sort of Neanderthal swamp monster wielding clubs, which is the image that appears in my head when I hear the word).
The Problem
If your data is of a certain scale, then you run into a few issues:
- Storage can become very expensive
- Individual records may become too large to process in the model efficiently
If your application has millions of users, and is complex (think electronic medical records, or financial documentation), then storing your data across multiple tables and running tons of joins and unions to get results will be slow. This has led to a movement to something called “NoSQL” in which all of the data is stored in a single object, or model, and then ALL the data is retrieved when requested and the application itself finds the value it needs to perform the task. Many NoSQL solutions are based on JSON or XML. Storing text as JSON or XML means that as your data model grows, so does the size of the object. This then increases the amount of storage you need, and if your model is so large, then consuming and traversing that data can slow down the execution of your program.
Sample JSON
{
"glossary": {
"title": "example glossary"
}
}
The Solution – Protocol Buffers
Probufs solve this problem by reducing the entire model and its data to a string of characters. Then, embed encoding and decoding into the program.
Model for Protobuf:
message Glossary{
optional string title = 1;
}
In this example, the “1” refers to the “title” field and then the model encodes the actual as well. Load the message model into your code during both encoding and decoding your data. Doing so reduces the amount of data that is transmitted over the network. Now, since machines are processing this data and not humans, then there is no need to include readability of the actual data as an aspect of data transmission.
Protobufs: are they overkill?
As with anything, the answer depends on what you are trying to achieve. There are many other serialization tools out there on the market. This includes ones which have increased readability than Protobufs, such as PyDantic. So if your goal is fast transmission of data between points, it would seem that Protbuf should be your choice. But if you want to maintain readability even outside of your code, then perhaps serializing your data another way is advisable.
Watch a video that taught me the basics about Protobufs. It also has a really cool example at the end.