GoGaRuCo '09 -Hypertable and Rails: DB Scaling Solutions with HyperRecord – Josh Tyler & Rusty Burchfield

April 19, 2009 Pivotal Labs

Intro

Hypertable and Rails: DB Scaling Solutions with HyperRecord

Links:
Hypertable
HyperRecord

Rusty is from Zvents, a local search engine

Presentation

Showing example of hourly data for the last month for a single event

GoGaRuCo '09 - Rusty Burchfield

Old benchmark was over 1M rows inserted per second sustained

Hypertable is an open-source implementation of Google’s BigTable.

Hypertable is a Column-Oriented DBMS

Data Model
5-part key:
Row Key
Column Family
Column Qualifier
Timestamp
Revision

One index per table (on the row key)
Only stores strings

Architecture
Master server – tracks range servers and where data is stored (spare master is also usually run, as it’s a single point of failure)
Range servers – data is broken up into individual range servers
Hyperspace – Handles locking and master recovery
HDFS – Stores redundant copies of data

GoGaRuCo '09 - Rusty Burchfield

ThriftBroker – An RPC wrapper for Hypertable for many languages using the Thrift Wrapper

HyperRecord

HyperRecord is a subclass of ActiveRecord for Hypertable
Supported by the Hypertable

Example
Loading data into simple pages app
Loading first 10,000 articles of wikipedia
150MB of data infiled in 14 seconds
Loads all the data into a rails scaffold and browses it

Design considerations
Denormalization – can’t do joins so you have to put your data in an appropriate format for querying. Can use MapReduce to interact with data.
Column families/qualifiers – You can store data in the key part of the key value pair
Revisions – deletes are represented as inserted delete cells

Questions

Q: How do you break down data by hours in example
A: Broken down by Ruby and aggregated

Q: It looks like the keys in that list were strings, not timestamps, did you have to take the timestamp and convert it to a string yourself?
A: Pretty much

Q: Did the wikipedia articles contain any of the sub-data like images, links, etc?
A: No, just a sql dump as a demo of querying the database through a rails scaffold

Q: Does hypertable select support SQL limits, order, etc?
A: HQL supports a lot of things you’d expect from SQL, but it’s still somewhat limited.

Q: What do you do with it?
A: We store all of our log data and process it using Cascading to gather hourly data for all our pages. We then put it in Hypertable so we can query it quickly to generate reports.

Rusty:
Cascading is Java code
You can easily construct complicated MapReduce jobs using it

Josh:
Some other uses of Hypertable at Zvents
Changelog
We deal with a lot of user created content, and things change often and we don’t always know what
We log everything that ever happens to our data so that we can track everything that happens to our data. From uploaded images to deleted links to edited descriptions, we can see what changed, when and how.

Zvents and Baidu are the primary sponsors of the Hypertable project. Hypertable and HyperRecord are both on Github.

Hypertable development started 2 years ago as a forward looking solution to analytics problems.

The search problem for Zvents is many dimensional: Time, Location, Description, User Data and User Behavior and Hypertable is a way to inform a lot of that data.

Q: What kind of problems are well suited to HyperTable
A: We’re trying to move our entire site over. A canonical example for this kind of database is a crawl database.
A2: Anything where you have mountains and mountains of data and want to query over it.

Example of Crawl Database stored in Hypertable.

About the Author

Biography

Previous
GoGaRuCo '09 – Meta Meta – LiveBlogging the LiveBlogging – Coda/SubEtha
GoGaRuCo '09 – Meta Meta – LiveBlogging the LiveBlogging – Coda/SubEtha

For the second day of GoGaRuCo, my fellow Pivots David Stevenson, Zach Brock, and Ryan Dy are helping out w...

Next
GoGaRuCo '09 – Josh Susser and Leah Silber
GoGaRuCo '09 – Josh Susser and Leah Silber

Conference Organizers Extraordinaire!