Case Study: Scaling Reservations for the World’s Largest Train System, China Railways Corporation

November 7, 2013 Stacey Schneider

China Railway case study.

The biggest annual movement of humans on the planet happens around the Chinese New Year, also known as the Spring Festival. According to China Daily, there are 34.88 million trips by air, 235 million trips by rail, and 2.85 billion road trips during the peak of China’s Spring Festival holiday period. Historically, rail travel has meant long lines and waits, and China Railways Corporation (CRC) began to sell tickets online to offer a more convenient method than purchases at stations, ticket offices, or by phone.

With increased use and access to the reservation system, the travel season placed enough pressure on China Railway Corporation’s rail reservation systems to break their legacy RDBMS, warranting a new project to improve their online performance and scalability. These holiday travel periods, like Spring Festival, pressure the system even further. During spikes, the site becomes one of the most popular websites in China. With such severe demand, people experience outages, poor performance, booking errors, payment failures, and issues with ticket confirmations.

One of the first steps taken by Dr. Zhu Jiansheng, Vice Director of the China Academy of Railway Sciences, was to look for areas to boost performance. Back in 2011, Dr. Zhu began sponsorship of a new system based on the two known performance bottlenecks at the time:

  1. Overloading the relational database until it could not handle either the scale of incoming requests or the level of reliability required to meet their SLAs.
  2. The UNIX server’s computational power was inadequate to resolve the capacity requirements.

According to Dr. Zhu, “Traditional RDBMS and mainframe computing models just do not scale like a system built to run in memory across multiple nodes. Our website was proof of this, and trying to scale our legacy system was going to become very expensive.”

Solving Scale and Availability Problems with In-Memory Data Grids

Dr. Zhu’s team began looking at other solutions. Mainframes were found to have the same bottleneck issues as RDBMS. In exploring in-memory data grids (IMDG), they found Pivotal GemFire, an IMDG with a proven track record for scaling the most challenging data problems in the world across financial services, airlines, e-commerce, and other industries. To perform an evaluation, Dr. Zhu and his team selected International Integrated Systems, Inc. (IISI). The team from IISI had a strong track-record of work for government organizations, developing transportation solutions, migrating legacy systems to cloud architectures, and work with Pivotal GemFire. They began with a pilot, believing GemFire would meet the performance, scale, and availability requirements as well as run on low-cost, commodity hardware.

The IISI team created a proof of concept and demonstrated several advantages with GemFire. The speed of ticket calculation improved 50 to 100 times. As load increased, response times maintained latencies of 10-100 milliseconds. They could see the ability to add capacity on demand and achieve near-linear scalability and high availability. The project team built a pilot in just two months, and, four months later, the new online system was fully deployed to all classes of passengers across 5700 train stations.

Scaling the Reservation System

The group in charge of online railways reservations saw massive and unexpected adoption of the online system year over year and project growth as much as 50% per year. With 5700 train stations, their website has booked 2.5 million tickets per day on average.

In the process, the infrastructure changed drastically. Seventy two UNIX systems and a relational database were replaced with 10 primary and 10 back up x86 servers, a much more cost-effective model that holds 2 terabytes or one month of ticket data in memory.

According to Dr. Zhu, “First, Pivotal GemFire offered proof in a realistic test environment. Then, the pilot was a success. Production saw severe, unexpected spikes in seasonal demand, and we took an iterative approach to deployment, overcoming a series of scale challenges. As seen in the most recent National Holiday for 2013, the system is operating with solid performance and uptime. Now, we have a reliable, economically sound production system that supports record volumes and has room to grow. This scale was achieved with 10-100 millisecond latency.” GemFire’s built in high availability, redundancy, and failover mechanisms provide continuous uptime, and it has exceeded all of CRC’s metrics in the area and helped them maintain their SLAs.

Holiday travel periods create peaks of 10 million tickets sold per day, 20 million passengers to visit the web site, 1.4 billion page views per day, and 40,000 visits per second driving up to 5 million tickets sold online per day at peak. Given that so many people rely on the system for travel, it is critical for the site to be continuously available and scale during peak times.

To learn more about this story and Pivotal GemFire:

Pivotal-Blog-CTA-CaseStudy

About the Author

Biography

Previous
How to Mix in Mobile for Restaurant Success
How to Mix in Mobile for Restaurant Success

Over the last couple years, we’ve seen a ton of mobile innovation and disruption across a variety of sector...

Next
WebSocket Architecture in Spring Framework 4.0
WebSocket Architecture in Spring Framework 4.0

In this post, Spring expert and committer, Rossen Soyanchev, underscores how WebSocket provides a foundatio...