Data Serialization: How to Run Multiple Big Data Apps on a Single Data Store with GemFire

September 17, 2013 Bruce Schuchardt

With the race to harness the value of big data in full swing, no one is happy about storing multiple versions of data. Its expensive, delays data availability, and just seems like a waste of time. Why can’t we separate the application logic and share the same data?

The short answer is we can now, but historically this has been a challenge for developers, mostly due to dependencies across multiple codebases and data access layers. As developers, we have all faced situations where we want to deploy new code, but some other client relies on the old code. It can stop a deployment dead in its tracks. As our models and needs mature more towards data as a service, we expect that multiple clients will run different versions of a codebase and even use different languages. In these cases, data serialization and class versioning become quite important.

The vFabric GemFire data grid offers several options for serializing data, and its Portable Data eXchange (PDX) serialization, introduced in v6.0, allows you to easily deploy new versions of domain classes into a running system. This is a must-have feature if you have client applications that can’t be updated every time you update the server data grid—old clients having old domain classes should continue working when newer versions of the classes are introduced on servers.

Where Other Solutions Fall Short

In designing the GemFire serialization approach, GemFire product engineers surveyed the available solutions for high performance serialization and versioning. They were lacking. Java Serializable is too slow and doesn’t offer much help versioning classes. Google Protobuf requires external descriptions of classes with numeric identifiers for each field. Oracle Coherence Portable Object Format requires similar numeric identifiers and descriptions and does not handle subclasses well. Apache Avro loses new fields if an old client reads and updates an object.

Since speed is often top of mind in architecture, we also ran our own test based on open source benchmarks to compare. Here were the results:

As you can see, there are multiple options with GemFire.

GemFire’s solution focuses on performance and ease of use. We wanted to make developers more productive, still provide depth, and offer the option of tuning for higher performance. There are several serializer options, including the addition of your own serializer class and field access without full deserialization, as well as access by Java, .NET, or C++ clients.

How the Auto-Serializer Works

The easiest, most automated form of PDX serialization uses the reflection-based auto-serializer. This serializer analyzes classes to extract type information and stores the info in a distributed registry. The registry acts behind the scenes across the distributed system and even between sites over WAN connections. The type registry allows PDX serialized objects to be small and keeps track of the fields in your objects so that there is no need to assign numbers to them.

When the auto-serializer deserializes a PDX object that was created with a different version of the class, the serializer leaves any new fields in the class at their default values. If it finds that fields have been deleted, the serialized data for those fields is ignored. As well, there is no need to allocate new field numbers for added fields or update external metadata files. GemFire PDX does it for you.

If you want to exercise more control over what is serialized, you can configure the auto-serializer in your cache.xml file. PDX lets you exclude fields and specify identity fields for equality comparisons if the object is used as a key.

< parameter name="classes" > com.company.DomainObject.*#identity=id.*#exclude=creationDate < /parameter >

In the example above, class DomainObject is configured to be equal if its “id” fields match, and field “creationDate” is not serialized by PDX.

Lost Fields on Objects in Old Applications

Another problem people have with versioning is loss of new fields if an older class is used to perform an update on a serialized, newer-version object. The older class deserializes the object, modifies it, and reserializes it to store in GemFire. In these cases, the reserialized object doesn’t contain the new fields that were in the original serialized object. This is a crucial problem when dealing with old client applications or upgrading applications that are connected to other sites. No-one wants to shut down everything and perform an across-the-board upgrade just because some crucial new fields were added to a class.

In GemFire if PDX serialization is being used the system will automatically keep track of what’s been read from the serialized object and will preserve the data that hasn’t been looked at by the application. When the value is put back into the cache all “unread” fields are transferred to the new serialized form.

Since this incurs some overhead you can also turn automatic tracking off and do the work yourself. This is most easily done with the PdxSerializable interface. In the example below, the fromData method needs to preserve the unread fields and the toData method must pass them to the output writer. This is similar to what you must do with Oracle Coherence EvolvablePortableObject.

private Object unreadFields;

public void fromData(PdxReader reader) { this.unreadFields = reader.readUnreadFields(); this.id = reader.readInt("id"); this.creationDate = reader.readDate("creationDate");

this.pkid = reader.readString("pkid"); this.positions = (Map<String, PositionPdx>)reader.readObject("positions"); this.type = reader.readString("type"); this.status = reader.readString("status"); this.names = reader.readStringArray("names"); this.newVal = reader.readByteArray("newVal"); }

public void toData(PdxWriter writer) { writer.writeUnreadFields(this.unreadFields); writer.writeInt("id", id) .writeDate("creationDate", creationDate) .writeString("pkid", pkid) .writeObject("positions", positions) .writeString("type", type) .writeString("status", status) .writeStringArray("names", names) .writeByteArray("newVal", newVal); }

With the example above, a new class knows about the new field. So, its serialization can read/write the new field. For future changes, you will want to include the read/write UnreadFields calls.

Explicit Serialization Options

If you want to exercise more control and get higher performance for certain classes, you can have them implement the interface PdxSerializable. With this, you write toData/fromData methods.

// Portfolio fields private int id; private String pkid; private Map<String, Position> positions; private String type; private String status; private String[] names; private byte[] newVal; private Date creationDate; ...

public void toData(PdxWriter writer) { writer.writeInt("id", id) .writeDate("creationDate", creationDate) //fixed length field .writeString("pkid", pkid) .writeObject("positions", positions)

.writeString("type", type) .writeString("status", status) .writeStringArray("names", names) .writeByteArray("newVal", newVal) }

In the example above, field names are used. GemFire documentation recommends that you write fixed-sized fields first. By doing this, you get better performance when accessing particular fields of a PDX object without deserializing everything.

Update Individual Fields Without Serialization

Yes, that’s right. You can actually access the serialized objects held in the datagrid and mess with them. The serialized objects are instances of PdxInstance, and you can access any fields you like without deserializing the others. Fields are accessed with getField(fieldName) and a null is returned if the field doesn’t exist. You can also update individual fields of the PdxInstance. The product comes with an example of WAN conflict resolution that uses this feature to automatically merge modifications coming from two separate, but active, GemFire datagrids. To update a field you just invoke createWriter() on the PdxInstance.

// Update a field and put it back into the cache // without deserializing the entire object WritablePdxInstance updatedPdx = myPdxInstance.createWriter(); updatedPdx.setField("fieldName", fieldValue); region.put(key, updatedPdx);

In the example above, WritablePdxInstance returned by createWriter() references all of the fields of the original PdxInstance and updates shadow the original fields.

Using PdxInstance objects allow you to work in Java on objects created in C++ or C# without creating Java classes. It also saves a lot of serialization/deserialization time because none is being done.

Using External Serializers

Portable Data eXchange serialization can also be managed by separate PdxSerializer classes that you write. Your serializer implements toData(Object, PdxWriter) and fromData(Class, PdxReader). Then, you register the serializer with GemFire in cache.xml and say which classes it controls as shown below.

< cache > < pdx > < pdx-serializer > < class-name > com.xyz.analysis.PackageSerializer < /class-name > < parameter name="classes" > < string >com.xyz.analysis..*< /string > < /parameter > < /pdx-serializer > < /pdx > ... < /cache >

public Object fromData(Class clazz, PdxReader reader) { if (clazz == Portfolio.class) { Portfolio inst = new Portfolio(); inst.id = reader.readInt("id"); inst.creationDate = reader.readDate("creationDate"); inst.pkid = reader.readString("pkid"); // etc. return inst; }
return null; }

public boolean toData(Object obj, PdxWriter writer) { if (obj.getClass() == Portfolio.class) { Portfolio p = (Portfolio)obj; writer.writeInt("id", p.id) .writeDate("creationDate", p.creationDate) .writeString("pkid", p.pkid) // etc. return true; } return false; }

PDX transparently preserves data from newer versions of a class when serializers are used to read/write objects in older applications.

Pivotal GemFire Serializer: Performance, Choice, Ease of Use, Productivity

All of these mechanisms tie together to provide a seamless, portable data exchange that can be accessed by Java, C# and C++ applications without having to write extensive IDL or XML descriptions, generate, or enhance code.

For more information on Pivotal GemFire:

About the Author

Biography

A Wrap-Up of DevCon 5 2013

HTML5 has long been one of the hottest topics in mobile technology circles. The concept of write once, depl...

writing Rails engine rspec controller tests

If you are trying to test controllers under a Rails engine, you might come across this error: Failure/Erro...

Data Serialization: How to Run Multiple Big Data Apps on a Single Data Store with GemFire

Where Other Solutions Fall Short

How the Auto-Serializer Works

Lost Fields on Objects in Old Applications

Explicit Serialization Options

Update Individual Fields Without Serialization

Using External Serializers

Pivotal GemFire Serializer: Performance, Choice, Ease of Use, Productivity

About the Author

Previous

Next

Data Serialization: How to Run Multiple Big Data Apps on a Single Data Store with GemFire

Where Other Solutions Fall Short

How the Auto-Serializer Works

Lost Fields on Objects in Old Applications

Explicit Serialization Options

Update Individual Fields Without Serialization

Using External Serializers

Pivotal GemFire Serializer: Performance, Choice, Ease of Use, Productivity

About the Author

Previous

Next

Related content in this Stream

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.