The gap between fantasy and reality is still pretty wide in terms of what enterprises hope to achieve with Big Data analytics. Bridging this gap requires that organizations follow some best practices when implementing Big Data analytics tools and take into account some of the shortcomings of Hadoop. Using Symantec’s forthcoming Veritas Cluster File System Connector for Hadoop, they may implement Hadoop and the Big Data analytics benefits it provides coupled with the stability and reliability that the Veritas Cluster File System provides.
A recent survey conducted by McKinsey & Company found that in 51% of the nearly 1,500 companies that responded Big Data analytics was one their top 10 corporate priorities with nearly 20% of that total naming Big Data analytics as their #1 priority. Further cementing this heightened level of corporate interest in Big Data analytics, a separate survey of AIIM professionals uncovered that 70% of those surveyed believed that there is a Big Data “killer application” that would be “very useful” or even “spectacular” for their business.
These are pretty heady numbers and assumptions that enterprises are expecting Big Data analytics to provide. But it also helps to explain a lot of the hype that currently surrounds Big Data analytics within enterprises. After all, the idea that an enterprise might hit a home run from a business perspective, be it in terms of achieving new cost efficiencies or opening up new revenue opportunities, explains why so many of them currently put the implementation of Big Data analytics high on their list of priorities.
However many fail to account for some of the practical realities of mining their data that is rarely accessed or resides in siloed data stores. That same AIM survey uncovered that 70% of those surveyed believed that that it was “harder” or “much harder” to search for information on their internal systems than to search for information on the web. This reveals the current disconnect between what companies hope to achieve with Big Data analytics and what they will realistically achieve if they actually implement it.
Bridging that gap requires they go into any Big Data implementation by adopting best practices prior to the implementation of a Big Data analytics solution. Three preliminary steps they should take to make sure expectations and realities properly align include:
- Identify the specific business objective to be achieved. Organizations should clearly establish what they want to accomplish with their implementation of a Big Data analytics solution. Is it identifying potential areas to reduce costs or uncovering potential revenue opportunities? Clearly stating that objective will clarify what specific data stores should be accessed and mined for information.
- Establish the output that the Big Data analytics tool should provide in support of that objective. Accessing the data is only beneficial if an enterprise knows what specific information within the data store needs to be retrieved and analyzed.
For instance, an organization may be interested in increasing cross-selling opportunities within a business unit. So to arrive at what factors drive a customer to buy related or additional items, an organization may want to retrieve the purchase history profile of its target customers. Attacking the problem this way will ensure actionable, insightful and relevant information is retrieved as opposed to a bunch of data that has no clear correlation to the stated business objective.
- Use Hadoop as the tool to mine these data stores. Hadoop provides enterprises with a cost-effective Big Data analytics tool to analyze their various data stores and extract the information they need. Additionally, it gives them the ability to easily scale up their data mining efforts should they decide to do so.
The current hiccup in this approach is that using Hadoop is not exactly enterprise friendly in its implementation. An underlying component upon which Hadoop relies to do its analytics – the Hadoop Distributed File System (HDFS) – is tightly integrated with the Hadoop application itself. As such, when organizations look to use Hadoop they currently must bring HDFS along for the ride.
This is where some of the limitations of HDFS in enterprise environments are exposed. For example, HDFS tightly couples servers and storage. This impedes the ability of enterprises to easily scale either performance or storage independently as both must generally be added together regardless of how efficiently HDFS uses them.
Another point of contention is how HDFS manages data placement. HDFS usually makes at least three copies of the data across different nodes in a Hadoop cluster to ensure data integrity. While this approach made sense in the Open Source environments in which Hadoop originated, this is inefficient when running on enterprise hardware.
Yet another data center concern is protecting, maintaining and supporting HDFS once it is in production. HDFS is largely a collaborative, open source effort specifically designed to host Hadoop so it is optimized for performance and to keep costs low. However once HDFS is used in production, new considerations come into play such as:
- What software patches or firmware upgrades may safely be applied to the application, file system and/or underlying hardware?
- Who provides the software upgrades or firmware patches and how trustworthy are they
- What will be the impact of these upgrades or patches on the environment?
Having answers to these and other questions are of the utmost concern for data center IT administrators responsible for maintaining and supporting Hadoop once it is in production.
So to accelerate Hadoop’s adoption into production environments, Symantec has done something very clever: beginning this fall it will offer a Connector that enables Hadoop to directly communicate with the Symantec Veritas Cluster File System. This technique eliminates the need for organizations to implement HDFS as the connector takes the place of HDFS by presenting the Veritas Cluster File System as an HDFS interface to Hadoop.
Implemented this way, Hadoop still thinks it is using HDFS. The upside of this approach is that enterprise IT administrators now get the benefit of having in place a known and supported file system on which they may host Hadoop. As such, they may now use Hadoop in production to do Big Data analytics with the knowledge and confidence that they can support it.
Enterprises are justifiably excited about how Big Data analytics can help them better their business in multiple ways. However realizing the benefits of Big Data Analytics requires they first have a well thought out plan in place and then the right technologies in place to execute on that plan.
Hadoop has emerged as the right technology to cost-effectively make Big Data analytics a reality for enterprises but it still needs a little back end help in order for it to be enterprise ready. The forthcoming Veritas Cluster File System Connector from Symantec gives Hadoop the enterprise characteristics it needs to provide the underlying infrastructure that enterprises need so they may confident
ly, cost-effectively and efficie
ntly manage it once it is in production.