Intro
Hypertable and Rails: DB Scaling Solutions with HyperRecord
Links:
Hypertable
HyperRecord
Rusty is from Zvents, a local search engine
Presentation
Showing example of hourly data for the last month for a single event
Old benchmark was over 1M rows inserted per second sustained
Hypertable is an open-source implementation of Google’s BigTable.
Hypertable is a Column-Oriented DBMS
Data Model
5-part key:
Row Key
Column Family
Column Qualifier
Timestamp
Revision
One index per table (on the row key)
Only stores strings
Architecture
Master server – tracks range servers and where data is stored (spare master is also usually run, as it’s a single point of failure)
Range servers – data is broken up into individual range servers
Hyperspace – Handles locking and master recovery
HDFS – Stores redundant copies of data
ThriftBroker – An RPC wrapper for Hypertable for many languages using the Thrift Wrapper
HyperRecord
HyperRecord is a subclass of ActiveRecord for Hypertable
Supported by the Hypertable
Example
Loading data into simple pages app
Loading first 10,000 articles of wikipedia
150MB of data infiled in 14 seconds
Loads all the data into a rails scaffold and browses it
Design considerations
Denormalization – can’t do joins so you have to put your data in an appropriate format for querying. Can use MapReduce to interact with data.
Column families/qualifiers – You can store data in the key part of the key value pair
Revisions – deletes are represented as inserted delete cells
Questions
Q: How do you break down data by hours in example
A: Broken down by Ruby and aggregated
Q: It looks like the keys in that list were strings, not timestamps, did you have to take the timestamp and convert it to a string yourself?
A: Pretty much
Q: Did the wikipedia articles contain any of the sub-data like images, links, etc?
A: No, just a sql dump as a demo of querying the database through a rails scaffold
Q: Does hypertable select support SQL limits, order, etc?
A: HQL supports a lot of things you’d expect from SQL, but it’s still somewhat limited.
Q: What do you do with it?
A: We store all of our log data and process it using Cascading to gather hourly data for all our pages. We then put it in Hypertable so we can query it quickly to generate reports.
Rusty:
Cascading is Java code
You can easily construct complicated MapReduce jobs using it
Josh:
Some other uses of Hypertable at Zvents
Changelog
We deal with a lot of user created content, and things change often and we don’t always know what
We log everything that ever happens to our data so that we can track everything that happens to our data. From uploaded images to deleted links to edited descriptions, we can see what changed, when and how.
Zvents and Baidu are the primary sponsors of the Hypertable project. Hypertable and HyperRecord are both on Github.
Hypertable development started 2 years ago as a forward looking solution to analytics problems.
The search problem for Zvents is many dimensional: Time, Location, Description, User Data and User Behavior and Hypertable is a way to inform a lot of that data.
Q: What kind of problems are well suited to HyperTable
A: We’re trying to move our entire site over. A canonical example for this kind of database is a crawl database.
A2: Anything where you have mountains and mountains of data and want to query over it.
Example of Crawl Database stored in Hypertable.
About the Author