Daum Communications (Daum) is one of the leading providers of Korean-language online services, including the news and information portal Daum.net, web-based email service Hanmail.net, and the Daum Cafe online community. Headquartered in Jeju Island, the company provides mobile web services, search marketing, and electronic mapping. It also sells online advertising products through Daum.net. Daum is the second largest web portal service provider in terms of daily visits in Korea and has operating centers in Seoul and on Jeju Island.
Through its extensive range of Internet services and sale of online advertising products, Daum generates vast amounts of unstructured data. The company has one of the largest Apache Hadoop clusters in Korea, and analyzes its data to gain critical competitive information in a number of areas, including user preferences and behavior, search rankings, and advertisement targeting.
Complex Environment Impedes Data Analysis
Facing intense domestic and global competition from a number of search engines that are growing market share across desktop and mobile searches, Daum’s businesses needed to make faster and better decisions to protect the company’s 20 percent share of the Korean search market.
The company needed to analyze and make immediate decisions on its vast data stores by extracting knowledge from its data in real time. But Daum was more interested in solving analytic problems than in exploring relationships between data that are available in traditional relational database systems. As a result, Daum was using Hadoop to store data, and was using NoSQL non-relational database management systems such as Cassandra and Storm as the Hadoop Distributed File System (HDFS) to provide greater speed in performing Big Data analytics on unstructured data. This solution landscape presented the company with serious challenges.
“Performing ad-hoc and multidimensional queries and analysis through Hadoop on our unstructured data proved difficult,” says Jun-Sik Eom, Team Manager, Data Technology Department, Daum Communications. “We were restricted in the speed of data analysis due the batch processing of both unstructured and structured data, which meant we relied heavily on the capability of our developers. Data analysis of complex forms was also challenging in the NoSQL database.”
Because Daum’s data must be constantly reviewed, the company sought a solution that would enable employees to perform high-speed queries on the data residing in Hadoop. Additionally, Daum wanted to improve access through tools that were already familiar to developers and database administrators.
Pivotal Greenplum Enables High-Speed Analysis of Unstructured Data
Daum evaluated solutions that could address the limitations in the resource-intensive analysis required by Hadoop and the NoSQL database management systems. To meet the data analysis requirements for its search engine and Internet services businesses, the company selected Pivotal Greenplum, which connects to Hadoop and enables the co-processing of both structured and unstructured data within a single solution.
“We were attracted to Pivotal Greenplum because of the advantage it had in mixing the merits of database, data warehouse, and business intelligence,” says Eom. “We can now use a single platform to run high-speed analytic queries on our most appropriate data stores.”
Delivering New Business Insights from Realtime Analysis
To support its efforts to gain market share, Daum is using Pivotal Greenplum to provide improved services and search accuracy to its users. Through realtime data gathering and analysis of Internet searches and user behavior within its various online services, the company can better predict future behavior and demand.
Daum can now make multiple queries—both in real time and over time as user patterns and knowledge emerge—due to massively parallel processing (MPP) architecture, which enables fast data loading and high-speed queries on the data. In addition to performing real-time weblog analysis, the company can re-analyze data that has already been processed and gain meaningful results with these various interpretations. Pivotal helped Daum achieve an increased depth of knowledge, which is just as critical as breadth in terms of delivering services.
Eliminating Roadblocks to Speedy Querying
Performing ad-hoc queries on the data stored in NoSQL databases from the Pivotal Greenplum means administrators can use familiar SQL commands to perform massive and multidimensional analysis. This reduces the company’s reliance on finding specialist NoSQL and Hadoop skill sets, and minimizes the workload for employees.
“One of the most important elements in effectively using Big Data is securing the right people,” says Eom. “We used to struggle with having the resources needed to perform queries, which greatly reduced our processing efficiency. Today, instead of performing queries on the NoSQL systems, we collect the data residing in Hadoop and NoSQL, and then save it in Pivotal Greenplum to execute the analysis.”
Enabling Continuous Processing While Reducing Costs
Because Pivotal Greenplum is available as a software-only distribution, Daum can run the data warehouse on any of its existing x86 servers running Hadoop. This ensures scalability while eliminating the need for Daum to purchase new data center infrastructure. Pivotal Greenplum enables gNet for Hadoop, a parallel communications transport, to access the Hadoop cluster and query the data efficiently using Hadoop servers rather than those running Pivotal Greenplum.
“By using our existing x86 servers, we were able to reduce expenditures and expand capacity through linear scalability,” Eom explains. “We have continuous processing across Pivotal Greenplum and Hadoop nodes. As the data increases, we can conveniently expand our capacity just by adding standard x86 servers.”