Wednesday, November 23, 2016

Big Data Technology Explained- Part 2 (The Hadoop Ecosystem)

As I talked about in the last post, Hadoop was really the engine that started to drive the application of big data 10 years ago now.  But since then, there has been incredible growth within the Hadoop ecosystem to address many of the challenges that were identified by commercial organizations as they started to adopt Hadoop.

Of the 20+ Apache projects that have come to life as a part of the ecosystem in the last many years, I will just mention a few of the larger projects and their signifigence in the grand scheme of things.  Again, this is not meant to be "developer 101", but instead, business person 101.  My goal here is to help business people know enough to ask the right kinds of questions of both their internal and external technology partners, before they embark on a big data adventure.

Hbase:
The first ecosystem component to talk about is Apache Hbase.  Hbase is what is called a NoSQL columnar database, which, to distill it all down, really just means you can read/write data really fast to or from it for time sensitive operations and not need to just use SQL queries to get the results.  One of the areas in the commercial space where Hbase is commonly used is as the "database" for lookups on websites.  So, if I am on eBay and I am searching for specific kinds of products, the search I do may take me behind the scenes to something like Hbase to do a fast look up to see if that product is available.  Or if I am on Etsy, all the clicks I may make on the site during my visit could be saved and tracked and then stored in something like Hbase for immediate access by operations teams.  So when you hear the word or term Hbase, think super fast reading and writing to a data store that sits right on top of Hadoop (HDFS).

Hive (LLAP) and Impala:
Two other major projects that were developed to help speed the adoption of Hadoop in general were Apache Hive and Impala.  Both are what is referred to as "SQL Engines"on top of HDFS.  As is the case with just about every new technology, in order to drive adoption, making it easy for users to interact in a way they are familiar with is critical.  SQL happens to be a query language that is pretty much universally accepted and used by millions of analysts around the world with relational databases.  So in order to really drive the adoption of Hadoop, it made sense to build a tool that could leverage those skill sets but still get the value, power, speed and low cost of the Hadoop backend.  That is where Hive and Impala come in.  They both allow folks with those traditional SQL skills to continue doing what they do well, but leverage the goodness of Hadoop behind the scenes.  Organizations may use Hive or Impala to do dynamic queries or large scale summarizations across multiple, very large sets of data.

Spark:
Over the last number of years, all of these projects started to solve the problems that companies had with the original Hadoop components of HDFS and MapReduce.  But one thing kept coming up over and over again as computing started to get cheaper and specifically, as memory has started to become cheaper.  How can I make my processing go faster?  This is where Spark came in and started to become the new hot tool and darling of the big data community.  Spark was developed out of Cal Berkeley in their AMP Lab and has become one of the fastest growing open source projects in history.  The big reason is that it was able to speed up processing on data by orders of magnitude for users, by storing data in memory for fast and easy access. And for many organizations using data today at the heart of their business, they needed this speed to make faster decisions.  Spark also has become popular lately because it is a tool that is much more friendly for developers and also has multiple uses for the same core engine.  It can be Spark for batch processing or Spark Streaming for real time data movement or you can use Spark for creating graph databases.  So it is a tool that is incredibly flexible for a myriad of use cases.

Storm, Flink, Apex:
With the original processing framework of MapReduce and then even the additional of other tools like Hive, Impala etc... there was always a missing piece to the puzzle.  And that was real time streaming analysis over large sets of data.  While those other tools did a great job with running batch analysis on large sets of data, they were not built to do analysis on real time, streaming data. A great example of real time streaming data would be something like Twitter.  Tons of data, coming in real time and needing to be analyzed in real time to make decisions.   This is just one of many use cases or examples where real time is being used.  There are dozens of others, in every industry, where real time streaming analytics is becoming more and more popular and valuable.   Apache Storm was the first real project that was built to address this real time streaming analytics need and was actually built by the folks at Twitter.  Apache Apex and Apache Flink are two other real time streaming projects that have also gained steam lately along with Spark Streaming.


YARN:
As many of the projects started to become a core part of any large scale big data project, there became the need to better manage the resources that were executing all of these queries or processes that people wanted to run.  So, that is where YARN comes in.  It was developed to be the layer on top of HDFS to help manage the resources effectively.  Kind of like the operating system for HDFS.  Not something that business people are really going to care about, but still good to know what YARN is and where it fits in the picture.

Ranger and Knox:
No list of tools would be complete without talking about the security of the Hadoop stack.  With all of the concern with regards to data privacy and security, over the last few years the open source community has ramped up the work on putting in place tools that make it easier to lock down the data held within Hadoop.  That is what Ranger and Knox were developed to be, the access and authorization tools to ensure that only the right people or systems, with the right kinds of privileges, are able to access data in Hadoop.  For many commercial organizations, this has been the hurdle that needed to be cleared in order to adopt Hadoop and start deriving real business value.  It just flat out needed to be more secure.

Alright, so that is a good review of some of the core technologies that make up the Hadoop ecosystem and where they fit.  As I mentioned in a previous post, Hadoop is not the only ecosystem of technologies that is used in the big data space.  There are numerous other open source and proprietary frameworks and tools that are being used to augment the use of Hadoop tools.  We will talk about a few of those other tools next time.








No comments: