Data Serialization: How to Run Multiple Big Data Apps on a Single Data Store with GemFire

September 17, 2013 Bruce Schuchardt

header-graphic-serialization-gemfire_V03With the race to harness the value of big data in full swing, no one is happy about storing multiple versions of data. Its expensive, delays data availability, and just seems like a waste of time. Why can’t we separate the application logic and share the same data?

The short answer is we can now, but historically this has been a challenge for developers, mostly due to dependencies across multiple codebases and data access layers. As developers, we have all faced situations where we want to deploy new code, but some other client relies on the old code. It can stop a deployment dead in its tracks. As our models and needs mature more towards data as a service, we expect that multiple clients will run different versions of a codebase and even use different languages. In these cases, data serialization and class versioning become quite important.

The vFabric GemFire data grid offers several options for serializing data, and its Portable Data eXchange (PDX) serialization, introduced in v6.0, allows you to easily deploy new versions of domain classes into a running system. This is a must-have feature if you have client applications that can’t be updated every time you update the server data grid—old clients having old domain classes should continue working when newer versions of the classes are introduced on servers.

Where Other Solutions Fall Short

In designing the GemFire serialization approach, GemFire product engineers surveyed the available solutions for high performance serialization and versioning. They were lacking. Java Serializable is too slow and doesn’t offer much help versioning classes. Google Protobuf requires external descriptions of classes with numeric identifiers for each field. Oracle Coherence Portable Object Format requires similar numeric identifiers and descriptions and does not handle subclasses well. Apache Avro loses new fields if an old client reads and updates an object.

Since speed is often top of mind in architecture, we also ran our own test based on open source benchmarks to compare. Here were the results:

serialization-gemfire-V02_pdx-performance-chart

As you can see, there are multiple options with GemFire.

GemFire’s solution focuses on performance and ease of use. We wanted to make developers more productive, still provide depth, and offer the option of tuning for higher performance. There are several serializer options, including the addition of your own serializer class and field access without full deserialization, as well as access by Java, .NET, or C++ clients.

How the Auto-Serializer Works

The easiest, most automated form of PDX serialization uses the reflection-based auto-serializer. This serializer analyzes classes to extract type information and stores the info in a distributed registry. The registry acts behind the scenes across the distributed system and even between sites over WAN connections. The type registry allows PDX serialized objects to be small and keeps track of the fields in your objects so that there is no need to assign numbers to them.

versioning-with-gemfire-graphics_auto-serializer

When the auto-serializer deserializes a PDX object that was created with a different version of the class, the serializer leaves any new fields in the class at their default values. If it finds that fields have been deleted, the serialized data for those fields is ignored. As well, there is no need to allocate new field numbers for added fields or update external metadata files. GemFire PDX does it for you.

versioning-with-gemfire-graphics_new-field-existing-class

If you want to exercise more control over what is serialized, you can configure the auto-serializer in your cache.xml file. PDX lets you exclude fields and specify identity fields for equality comparisons if the object is used as a key.

< parameter name="classes" >
com.company.DomainObject.*#identity=id.*#exclude=creationDate
< /parameter >

In the example above, class DomainObject is configured to be equal if its “id” fields match, and field “creationDate” is not serialized by PDX.

Lost Fields on Objects in Old Applications

Another problem people have with versioning is loss of new fields if an older class is used to perform an update on a serialized, newer-version object. The older class deserializes the object, modifies it, and reserializes it to store in GemFire. In these cases, the reserialized object doesn’t contain the new fields that were in the original serialized object. This is a crucial problem when dealing with old client applications or upgrading applications that are connected to other sites. No-one wants to shut down everything and perform an across-the-board upgrade just because some crucial new fields were added to a class.

In GemFire if PDX serialization is being used the system will automatically keep track of what’s been read from the serialized object and will preserve the data that hasn’t been looked at by the application. When the value is put back into the cache all “unread” fields are transferred to the new serialized form.

versioning-with-gemfire-graphics_missing-field-in-class

Since this incurs some overhead you can also turn automatic tracking off and do the work yourself. This is most easily done with the PdxSerializable interface. In the example below, the fromData method needs to preserve the unread fields and the toData method must pass them to the output writer. This is similar to what you must do with Oracle Coherence EvolvablePortableObject.

private Object unreadFields;

public void fromData(PdxReader reader) {
this.unreadFields = reader.readUnreadFields();
this.id = reader.readInt("id");
this.creationDate = reader.readDate("creationDate");

this.pkid = reader.readString("pkid");
this.positions = (Map<String, PositionPdx>)reader.readObject("positions");
this.type = reader.readString("type");
this.status = reader.readString("status");
this.names = reader.readStringArray("names");
this.newVal = reader.readByteArray("newVal");
}

public void toData(PdxWriter writer) {
writer.writeUnreadFields(this.unreadFields);
writer.writeInt("id", id)
.writeDate("creationDate", creationDate)
.writeString("pkid", pkid)
.writeObject("positions", positions)
.writeString("type", type)
.writeString("status", status)
.writeStringArray("names", names)
.writeByteArray("newVal", newVal);
}

With the example above, a new class knows about the new field. So, its serialization can read/write the new field. For future changes, you will want to include the read/write UnreadFields calls.

Explicit Serialization Options

If you want to exercise more control and get higher performance for certain classes, you can have them implement the interface PdxSerializable. With this, you write toData/fromData methods.

// Portfolio fields
private int id;
private String pkid;
private Map<String, Position> positions;
private String type;
private String status;
private String[] names;
private byte[] newVal;
private Date creationDate;
...

public void toData(PdxWriter writer)
{
writer.writeInt("id", id)
.writeDate("creationDate", creationDate) //fixed length field
.writeString("pkid", pkid)
.writeObject("positions", positions)

.writeString("type", type)
.writeString("status", status)
.writeStringArray("names", names)
.writeByteArray("newVal", newVal)
}

In the example above, field names are used. GemFire documentation recommends that you write fixed-sized fields first. By doing this, you get better performance when accessing particular fields of a PDX object without deserializing everything.

Update Individual Fields Without Serialization

Yes, that’s right. You can actually access the serialized objects held in the datagrid and mess with them. The serialized objects are instances of PdxInstance, and you can access any fields you like without deserializing the others. Fields are accessed with getField(fieldName) and a null is returned if the field doesn’t exist. You can also update individual fields of the PdxInstance. The product comes with an example of WAN conflict resolution that uses this feature to automatically merge modifications coming from two separate, but active, GemFire datagrids. To update a field you just invoke createWriter() on the PdxInstance.

// Update a field and put it back into the cache
// without deserializing the entire object
WritablePdxInstance updatedPdx = myPdxInstance.createWriter();
updatedPdx.setField("fieldName", fieldValue);
region.put(key, updatedPdx);

In the example above, WritablePdxInstance returned by createWriter() references all of the fields of the original PdxInstance and updates shadow the original fields.

Using PdxInstance objects allow you to work in Java on objects created in C++ or C# without creating Java classes. It also saves a lot of serialization/deserialization time because none is being done.

Using External Serializers

Portable Data eXchange serialization can also be managed by separate PdxSerializer classes that you write. Your serializer implements toData(Object, PdxWriter) and fromData(Class, PdxReader). Then, you register the serializer with GemFire in cache.xml and say which classes it controls as shown below.

< cache >
< pdx >
< pdx-serializer >
< class-name >
com.xyz.analysis.PackageSerializer
< /class-name >
< parameter name="classes" >
< string >com.xyz.analysis..*< /string >
< /parameter >
< /pdx-serializer >
< /pdx >
...
< /cache >

public Object fromData(Class clazz, PdxReader reader) {
if (clazz == Portfolio.class) {
Portfolio inst = new Portfolio();
inst.id = reader.readInt("id");
inst.creationDate = reader.readDate("creationDate");
inst.pkid = reader.readString("pkid");
// etc.
return inst;
}

return null;
}

public boolean toData(Object obj, PdxWriter writer) {
if (obj.getClass() == Portfolio.class) {
Portfolio p = (Portfolio)obj;
writer.writeInt("id", p.id)
.writeDate("creationDate", p.creationDate)
.writeString("pkid", p.pkid)
// etc.
return true;
}
return false;
}

PDX transparently preserves data from newer versions of a class when serializers are used to read/write objects in older applications.

Pivotal GemFire Serializer: Performance, Choice, Ease of Use, Productivity

All of these mechanisms tie together to provide a seamless, portable data exchange that can be accessed by Java, C# and C++ applications without having to write extensive IDL or XML descriptions, generate, or enhance code.

For more information on Pivotal GemFire:

About the Author

Biography

More Content by Bruce Schuchardt
Previous
A Wrap-Up of DevCon 5 2013
A Wrap-Up of DevCon 5 2013

HTML5 has long been one of the hottest topics in mobile technology circles. The concept of write once, depl...

Next
Pivots Talking Tuesday: Paired Ruby
Pivots Talking Tuesday: Paired Ruby

Being a Pivot means pair programming 8 hours a day, 5 days a week. If you’ve never paired before that might...