Data driven organisations that need to process large and varied datasets often opt for Apache Hadoop as a potential tool due to its ability to process, store, and manage vast amounts of structured, unstructured or semi-structured data. Apache Hadoop is a distributed data storage and processing platform with three core components: the HDFS distributed file system, the MapReduce distributed processing engine running on top, and YARN (Yet Another Resource Negotiator), which enables the HDFS file system to run mixed workloads.
Hadoop incorporates parallel processing techniques to distribute processing across multiple nodes for rapidity. It can also process data where it is stored, rather than having to transport data across a network, which slows down response times.
While most data professionals are well versed with the ability of Apache Hadoop to manage and store huge datasets, the platform has multiple other benefits which aren’t always as obvious.
The surge in data creation and collection are often quoted as bottlenecks for Big Data analysis. Despite this, Big Data is most useful when it is analysed in real time, or as close to real time as possible. Many organisations face a challenge in keeping data on a platform which gives them a single consistent view. Hadoop clusters provide a highly scalable storage platform, because it can store and distribute sizeable datasets across hundreds of inexpensive servers that operate in parallel. Moreover, it is possible to scale the cluster by adding additional nodes. This means that Hadoop enables businesses to run applications on thousands of nodes involving many thousands of terabytes of data.
Hadoop clusters prove a very cost-effective solution for expanding datasets. The problem with traditional relational database management systems is that they are expensive to scale in order to adequately process these massive volumes of data. In an effort to reduce costs, many companies in the past would have had to down-sample data and classify it based on certain criteria to determine which data was the most valuable. The raw data would be deleted, as it was too costly to keep. While this approach may have worked in the short term, the datasets were no longer available to the business when its priorities changed. Hadoop, on the other hand, is designed as a scale-out architecture that can affordably store all of a company’s data for later use. The cost savings are staggering, instead of costing thousands of pounds per terabyte, Hadoop offers computing and storage capability for hundreds of pounds per terabyte.
Availability and resilience
Hadoop is highly fault tolerant platform. Hadoop’s basic storage structure is a distributed file system called HDFS (Hadoop Distributed File System). HDFS automatically makes 3 copies of the entire file across three separate computer nodes within the Hadoop cluster (a node is an Intel server). If a node goes offline, HDFS has two other copies. This is similar to RAID configurations. What is different is that the entire file is duplicated, not segments of a hard disk, and the duplications occur across computer nodes versus multiple hard disks.
The High Availability (HA) configuration feature of Hadoop cluster also protects it during planned and unplanned downtime. Failovers can be deliberately triggered for maintenance or are automatically triggered in the event of failures or unresponsive service. HA in a Hadoop cluster can protect against the single-point failure of a master node (the NameNode or JobTracker or ResourceManager).
Hadoop’s built-in failure protection, combined with its use of commodity hardware, makes it a very attractive proposition. Hadoop lets the business store and capture multiple data types, including documents, images, and video, and make them readily available for processing and analysis. The flexible nature of a Hadoop system enables organisations to expand and adjust their data analysis operations as the business expands. Businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations or clickstream data. In addition, Hadoop can be used for a variety of other purposes, such as log processing, recommendation systems, data warehousing, marketing campaign analysis, and fraud detection. With the support and enthusiasm of Hadoop’s open source community Big Data analysis is becoming readily accessible to everyone.
Hadoop is based on two primary technologies, HDFS for storing data, and MapReduce for processing the data in parallel. Everything else in Hadoop, including its ecosystem, sits atop these foundation blocks. Hadoop’s storage method is based on a distributed file system that maps data wherever it is located on a cluster. The tools used for data processing, such as MapReduce programming, are also generally located on the same servers, allowing for faster processing of data. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.
WHISHWORKS is a Hortonworks Gold Partner, an Authorised Reseller and Certified Consulting Partner for MapR and Cloudera Silver Partner.
If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at email@example.com.
Other useful links: