There are various changes to GemFire server clusters or to their application-deployed artifacts that can’t be applied without a cold cluster restart, where all servers in a cluster must be stopped and restarted together, and the service experiences downtime. Examples are changes to Region definitions, such as enabling partitioning or disk persistence, certain GemFire version upgrades, and complex changes to domain object classes. To avoid incurring downtime for mission-critical GemFire systems needing such changes, we can use a Blue-Green style migration that bridges the old and new systems via a custom “migration client” application.
Due to growth in data, a GemFire system needs to migrate one of its Regions from Replicated to Partitioned. Replicated regions store and update all data on every node, while partitioned regions are sharded, and thus scalable to very large data volumes. The system is mission-critical and cannot experience significant down-time as part of the migration procedure. This is a structural data storage change that can’t be applied to a single cluster without a simultaneous restart of every server—a cold restart of the cluster and system service downtime. Proceeding with a Blue-Green approach, we set-up a new cluster with the target configuration in parallel with the old one. The problem then becomes: how do we synchronize the state between the old and new clusters, keep them synchronized even as the old “Blue” cluster remains active in production, and finally cut-over to the new “Green” cluster without service interruption. Further, we want to accomplish the migration with minimal risk and effort.
We will also briefly discuss how to reverse the replication from Green to Blue after cut-over in order to keep the old system fully available for rollback without potential loss of data.
In this case, we show how to migrate a Region from type Replicated to type Partitioned, inclusive of enriching the Region keys to support custom partitioning/co-location via a Partition Resolver plug-in. Although the original use-case is far more complex, we have reduced it to the most essential components to demonstrate the migration pattern.
Basic Use-Case and Related GemFire Artifacts
The system handles the input and in-memory storage of streaming data updates, and services client applications that use the data for real-time decision making. The system holds its data in a GemFire Region named “Data”.
Originally the system load and set of unique keys was small enough that Data could be replicated to all nodes without filling-up the available memory and without too much update latency, and it used a Region of type “Replicated”. This configuration was optimized for scalable parallel read access. The Data system has been very successful, and due to rapid growth of Data’s entry count, overall update rate, and client application instances needing Data, a replicated-everywhere Region storage model was no longer optimal. Too much overall memory was used to support more copies of each Data entry than client applications actually needed for parallel access, and update overhead grew with each added node. Thus, the decision was made to migrate Data to use Partitioned Region storage, spreading the data over the cluster. Since GemFire allows read operations on redundant copies of partitioned data, the change from Replicated to Partitioned still supports plenty of parallel read access for any “hot” data. A more detailed discussion of Region types can be found here.
Sidebar Note: When you partition data in GemFire, you also partition the load generated by update and client access activity. Thus, the total size of the data set is only one of the considerations for choosing data partitioning vs. replicating a copy of each entry to every node.
The Data system has Two types of GemFire clients:
Input Data Feeds — Relay updates to Data in real-time from some external source. Our example keeps things very simple—targeting a single Region—so we can focus on the migration procedure details. There could be many different Regions, and the update rate may be as high as ten’s of thousands per second without complicating our solution. In theory we can scale the solution as much as needed so long as the Blue cluster has sufficient headroom to forward updates to the migration clients (we discuss this in more detail later in Performance Considerations).
Application Server Clients -- Use the Data Region to optimize real-world/real-time business decisions. Clients may read, create, and update on-demand or subscribe to live Data updates in order to react to real-world events as quickly as possible.
These are the typical sorts of clients we expect for GemFire, and their role here is to show the techniques used to migrate the server cluster without service interruption. The clients read and write data using the Region API and the OQL Query Service. In most cases, the server-side changes that benefit from Blue-Green style migrations do not require changes to client applications.
Additionally, we introduce the concept of a Continuous Query Migration Client, or CQMC, whose job is to synchronize data between the Blue and Green clusters with minimum overhead, strong resiliency, and operational ease. More on the CQMC after we describe the basic server-side components and configuration.
Blue GemFire Cluster Components
The Blue GemFire cluster already exists and is running in production. It has the following simplified attributes/configuration:
Servers — Some number of nodes, but with limited scalability of total data and number of nodes due to Data being Replicated, with each entry being stored and updated on every node. After we migrate Data to partitioned, the server cluster becomes far more scalable in terms of total data size, update rate, and concurrent access.
Regions — Just the one: Data. The real-world system would likely have many different Regions, but that’s not relevant to demonstrating the CQMC migration pattern. All Regions can use the same technique to synchronize the Blue/Green clusters.
Cache Server — Provides access to GemFire Client applications via Client Pools. Also manages the GemFire client queues used to deliver micro-batched asynchronous notifications to the CQMC’s registered Continuous Queries.
Green GemFire Cluster Components
Our Green GemFire cluster configuration has some changes that create the need for our migration process:
Servers — Some number of nodes, but now more scalable and able to manage far more Data entries even with the same node count. A key advantage is that the cluster size does not need to be the same for the Blue and Green systems. Many other server attributes may also differ between Blue and Green, such as the GemFire or Java version, use of Disk Persistence, use of WAN Gateways, and even domain object class types. All of these “cold restart” changes can be applied without system downtime via the use of a CQMC-driven Blue/Green migration pattern.
Regions — Also just Data, but now Partitioned with redundancy. Note also that optimized “single-hop” client request routing for partitioned data is dynamically enabled by default, so client applications re-directed from the Blue to the Green cluster automatically adapt to this server-side change.
Cache Server — Used the same way as in the Blue cluster. Client Queues only come into play if reverse replication is used for potential rollback.
The Green cluster is deployed to a new set of hardware resources. We recommend using a pre-production or production-parallel environment, able to switch roles with the Blue cluster after the migration is complete.
Overall, the above components look something like this:
Continuous Query Migration Client
The Migration Client application is the focal point of our migration strategy. It performs these key functions:
Copies into the new “Green” system Data Region for the first time after deployment.
Replicates all subsequent Data changes until production cut-over.
Applies in-flight migration-related modifications to data entries (keys or values). In our example, we append a partitioning key attribute to the existing key in order to support custom data co-location. More complex changes are possible as well, such as refactoring object graphs and/or the Regions that store them, or general enrichment, filtering, etc.
Manages any interruption without losing Data updates. The first time the client runs, it copies the full data set between the clusters. Subsequently it replicates all new Blue updates until the Blue-Green production cut-over. The initial full data set copy only needs to happen once, so if the CQMC crashes or needs a restart for any reason, it only consumes changes queued on the server since it disconnected. Note that the Blue server cluster automatically manages update queue state to assure no events are lost, using client queues that are themselves redundant and highly available (and optionally durable).
For simplicity, we focus on how the Migration Client handles a single (Data) Region’s migration tasks. Later we’ll discuss some more advanced options for how to scale to larger data sets or higher update rates by partitioning the workload of a single or multiple Regions over multiple client instances.
Adding the CQMC we now have a bridge between the clusters:
Continuous Query Migration Client Lifecycle Details
The CQMC works as follows:
Launches and bootstraps the GemFire configuration, which connects to both the Blue and Green clusters via its Client Pools. The Proxy Region is linked to the Green Pool and is ready to process update requests in its proxy role.
Registers a Continuous Query to select all Entries from the Region. The query String is “SELECT * FROM /Data”. Note that when you use “SELECT *” with GemFire’s OQL, the results returned are a collection of the Region entry objects. This is logically the same as doing a Region.getAll() operation with the complete set of Region keys.
Obtains a Query Service instance from the Blue Client Pool and Executes the Continuous Query with Initial Results. The client logic also registers a custom CQListener to receive updates after processing the initial result set.
The initial result set--the entire contents of the Data Region--are returned from the CQ executeWithInitialResults method call. The CQMC code iterates over the Data result set. For each entry, the logic performs the following:
Read the Entry Value’s foreignKey attribute.
Read the Entry Key (a String), and append a delimiter character (“_”) and the foreignKey to the end. This becomes the Entry’s new Data key in the Green cluster.
Write the updated Key and the unchanged Value to the local Data Proxy Region via a Region.put(k,v) Region API call. Since the Data Proxy Region is linked to the Green Pool, the update is applied in the Green cluster.
While the CQ Migration Client processes the Initial Results, any new changes to Data are queued for later consumption. This important capability assures that no updates are ever lost, and the two clusters are assured to eventually become fully consistent (this happens at cut-over time).
After the full Initial Result Set processing is complete--and all of the initial Data entries have their keys enriched and are safely stored in the Green cluster’s Data Partitioned Region, the client logic invokes the Client Cache readyForEvents() method. This initiates delivery of the queued changes that weren’t part of the Initial Result Set, and then continuing Data Region updates.
CQListener updates continue and are all transformed and relayed to the Green cluster, driven by updates to the Blue cluster from its Input Feeds and the Client Queues that deliver them to the CQMC. CQ “update” events may result from new, updated, or deleted Blue cluster Region Entries, and are applied as such to the Green cluster.
At no point during the above steps is the normal production activity on the Blue cluster interrupted.
Operational Execution of the Migration
Here are the operational steps to execute the CQMC-based system migration:
Deploy & Launch Green cluster in parallel with Blue. At first, the Blue and Green clusters are not linked in any way. The Green cluster can be used for testing purposes, but should be cleared of data before the production migration procedure.
Deploy the CQMC application with enough Java Heap space to hold the entire CQ initial Result Set in memory. In our simple Data Region example, this means holding the entire Data Entry Set in CQMC’s memory. This is because CQ Initial Result sets are not paged as the client processes them, but are transmitted in their entirety first before the client thread invoking CQ.executeWithInitialResults() gets control back. Note that a GemFire Java client can easily handle 10’s of GB of data—and potentially up to 100’s of GB’s—as long as there’s enough Java heap space available. If you don’t have access to a host with enough memory to hold the entire Data Region’s contents in memory at once, then it is possible to divide the workload into smaller pieces by executing multiple CQ Queries, each retrieving only a subset. We discuss techniques for doing this at the end of this post as part of the performance and scalability considerations.
Launch the CQMC application. It launches and begins the process of migrating Data. It will take time to process the CQ initial results—the complete contents of the Data Region—depending on its total size, the quality of the network and host hardware, etc. The only way to measure this accurately is to test it on your specific environment and with your data, but one minute/GB of data is a safe estimate for a single CQMC instance. Multiple/parallel CQMC instances can achieve far higher transfer rates and shorten the time it takes for the initial copy of Data, but consider that with planning, we needn’t be in a big hurry. Also, if the CQ migration client fails before completing its work, we can safely restart the process.
Wait for the CQMC application to reach a steady state. Once the Data Region’s contents have been copied from the Blue to the Green cluster, the CQMC next consumes the queue of additional Data changes (since the initial result set). A single CQMC instance can handle up to several thousand updates/second without the server-side update queues growing, so it should very quickly reach a steady state with the queue size at or near zero (you can use gfsh to check on this). When it reaches this steady state, the Green cluster is always within a second or two of being fully synchronized with the Blue cluster, and you are now ready for Blue-Green production cut-over.
Everything up until this point is preparatory and can be managed over a period of minutes, hours, or even days. The next steps are the actual production cut-over, and may impact production service availability unless executed as efficiently as possible. We advise that you automate these steps as much as possible.
Stop update feeds to Blue cluster. CQMC synchronizes any final updates.
Stop client applications connected to Blue cluster. It is being decommissioned.
Start update feeds to Green cluster. If the feed is a parallel/duplicate of the Blue update feed, this step can potentially happen before step five, removing even a brief pause in service availability (some updates may be applied to the Green cluster twice if both update paths are briefly open).
Start client applications connected to the Green cluster. If you have a Green environment for your client application instances, these should already be up and connected to the Green GemFire cluster, and the final step is to redirect service requests from the Blue to the Green application servers. If not, then you need to restart the application servers, now pointing at the Green cluster’s GemFire Locators (instead of the Blue cluster Locators).
The time it takes to re-direct update feeds and client application traffic from the Blue to the Green cluster represents the potential period of service downtime due to the migration. Fully optimized, the process can achieve sub-second service downtime--essentially zero.
So why do we recommend CQ client-driven migrations? The overall migration pattern and general supporting features described here are certainly similar to others based on use of a second “Green” system instance. Within the “Blue-Green” paradigm of systems migration, the CQMC approach has these important advantages:
No special/temporary changes needed to the production GemFire cluster in order to enable the migration process. The original “Blue” server cluster stays as-is, and the new “Green” cluster can include multiple changes representing the desired new configuration. The added load to the Blue system of CQMC clients should be manageable within a typical system’s existing headroom, particularly if we execute the migration during a time of the day or the week with lower than average traffic.
Any custom migration logic needed only for the migration process is isolated to the migration client, and only runs for the duration of the migration process.
Data is moved from the Blue to the Green clusters safely and asynchronously using GemFire’s client queues. These are very similar to (WAN) Gateway Sender queues—with options for queue redundancy and persistence as well as automated queue failover if a server unexpectedly crashes. The key difference is that they can be used dynamically, without needing special configuration or related cluster restarts.
The data migration is initiated and controlled entirely from GemFire clients, using Client Pools to connect to both Blue and Green clusters, CQ API’s to pull data from the Blue system, and the basic Region API to push it to the Green system.
The Continuous Query GemFire feature automatically and transparently handles the most difficult problem of Blue-Green migrations for data management system: first copying an initial image of Blue system data to the Green system, and then keeping them synchronized with additional changes up until the migration completes.
Supporting Domain Object Class Migrations (or the Schema Evolution Problem)
We mentioned earlier that CQ’s with a “*” SELECT projection (“SELECT * FROM . . .”) return a collection of the actual domain object class instances. If one of the CQ migration client’s tasks is to convert between different versions of a domain object class, writing instances of the new class version as output to the Green cluster, then we must avoid using the old version (of course both class versions can’t simultaneously be loaded/used in the same Java process). We can work around this problem by selecting the source data with a SELECT query projection that individually specifies each of the old class version’s member attributes. This would look something like “SELECT attr1, attr2, attr3, etc FROM . . .”.
Our logic to handle the CQ results then changes from iterating over a collection of domain class instances to iterating over an OQL SelectResults collection, logically similar to a JDBC ResultSet. The same data is available, but now packaged without the use of the old/conflicting domain object class definition. This enables us to instantiate instances of the new domain object class definition, populate them from the CQ results (adding any enrichment or re-factoring to account for the changes), and transmit those instances to the Green cluster directly via the Region API.
Please note that many changes to domain object classes are automatically handled by GemFire’s PDX technology, but that discussion is beyond the scope of this post. For more on this topic see this blog.
Performance & Scalability
For most small-to-medium sized GemFire clusters, we don’t anticipate serious performance or scalability challenges. However, for larger clusters, or clusters with especially high data update rates, some special tuning may be necessary. These are the main areas where we anticipate performance bottlenecks and/or scalability challenges may arise:
Large Region Size
One limitation of Continuous Queries is how the initial Result Set is transmitted from server to client. Specifically, the result set is streamed into the client’s memory in its entirety, requiring the Java process heap size to be large enough for it to fit. One way to solve this limitation is to replace use of the CQ initial results features with manual (code-driven) retrieval of the initial Region data set.
We can accomplish this by leveraging the GemFire Client Region API keySetOnServer() method, then iterating over the keys and manually retrieving their Region entries via getAll() method calls in micro-batches. Following this pattern, only a small part of the overall Region’s entry set is resident in the CQMC memory at any given time.
The procedure still begins with registration of a Continuous Query for all Region Entries (but without initial results). While the manual/micro-batched transfer of data takes place, concurrent updates to the Region data accumulate in server-side CQ update delivery queues. Delivery of the queued updates only starts after the CQMC invokes readyForEvents() on the ClientCache service object, providing fine-grained control over the key elements of this procedure. Although perfect ordering of multiple updates to the same Region Entry (concurrent with micro-batched transfer) is not possible, we are assured of consistency between the two clusters once the CQ update queues are drained.
High Region Update Rate
In cases where the target Region’s update rate is so high that a single client thread can’t keep-up, multiple/parallel CQMC instances (configured as above for large data sets) can also provide improved throughput scalability.
Keep in-mind that while updates to a very busy production GemFire cluster typically come from multiple parallel sources—often where each individual update is quite small—the client queue machinery automatically micro-batches the updates before they are delivered to CQ clients. Over-the-wire, many updates to the source Region are combined into a batch, helping to remove much of the on-the-wire latency. Thus, a CQ client may be able to handle a higher update event rate than might be expected just from looking at the source Region’s update characteristics.
There are many deployed GemFire systems with very high update throughput rates, and for those, the following pattern can be used to split the load over multiple CQ Client instances.
Work can be divided over multiple CQMC instances by adding CQ “WHERE” clause predicates to narrow-down each one’s load to just a subset of the complete data set. If this approach is used, then we recommend injecting each CQMC instance’s Continuous Query String as a startup property or argument.
How you divide the data via query predicates depends on the specific data and use-case. For example, if we assume the following:
The original query String is: “SELECT * FROM /Data”
We want to divide the CQMC workload across 3 instances, each processing about ⅓ of the total data set.
The Java class type stored in Data includes some public attribute we can use to limit the query results to an appropriate subset via a range query. For our example, we’ll assume objects in Data have the public integer attribute named foo. The range of possible values for foo is between 0 and 1000, with a relatively even distribution.
We can extend the query String for the First CQMC instance to be:
“SELECT * FROM /Data where foo >= 0 AND foo <= 333”
Similarly, the other two instances limit their Data result sets between 334 and 666, and between 667 and 1000, respectively. The resulting distribution between our three CQMC instances doesn’t have to be perfectly even, it just needs to be close enough to enable our goal of horizontally scaling the workload. The specific object attribute used for limiting the ranges, and the specific values for each range, completely depend on the actual target data.
Even with the option to parallelize for throughput rate scalability, we suggest duplicating the update feed before it reaches the GemFire cluster for the highest throughput systems, and routing the duplicate feed directly into the Green cluster instead of using a CQMC. This may be appropriate especially for sources of data where the lifetime of the data is very short, and thus the benefits of synchronizing older Region state is of little or no importance. Another option in this case is to execute the CQMC client (or multiple in-parallel) without first obtaining an initial Result Set, instead transferring the existing Region Entry Set via the manual/micro-batched approach described in the preceding section.
Reversing the Continuous Query Flow for Rollback
Once cut-over is made to the Green GemFire cluster, what are the implications of having to perform an unexpected rollback to the Blue cluster? The major concern is loss of updates—the two clusters are out-of-sync again as soon as update feeds and/or client application update workloads are flowing directly to Green, and the decision to rollback could mean losing those updates. To support rollback without data loss we need to:
Create a version of the Migration Client that reverses the changes applied to in-flight data. For our example, we would remove the part of the Region key that’s been added as a custom partitioning Routing Key.
Configure these Migration Client instances’ CQ’s to read from Green and write to Blue. Execute the CQ’s without initial results. The two clusters are already in-sync at this point, so only new updates are needed to continue keeping them in sync on the reverse path. Without initial results processing, there is very little overhead to launching Migration Clients, and so this step can be very fast.
Right before starting updates to the Green system (after migration steps 1-6), launch the reverse Migration Client instances.
Now, the roles of the two clusters are reversed, and when updates flow into Green, they are automatically replicated back to Blue, and rollback is possible without loss of data. The reverse Migration Clients are deactivated once the Green system is fully verified and a rollback option is no longer necessary.
Having both clusters active at the same time is a more complicated proposition. If both Migration Clients and reverse Migration Clients are simultaneously active, then special logic must be added to prevent updates from flowing in an endless loop. There are several techniques (such as tracking the original source of an update) that can enable this pattern.
We note, however, that bi-directional, asynchronous replication between clusters is already one of GemFire’s core features. To enable simultaneous operation of multiple application (Blue/Green) versions and a gradual migration of dependent client application instances, the more sophisticated long-term strategy is to have two clusters running at all times, and be able to alternately take-on the Blue and Green roles. We plan to write a future blog post focused on the advantages of GemFire “WAN” Gateways for continuous operations for various migration use-cases.
About the Author
Gideon Low is a Staff Systems Engineer/Architecht at PivotalMore Content by Gideon Low