Wednesday, November 23, 2016

Big Data Technology Explained- Part 2 (The Hadoop Ecosystem)

As I talked about in the last post, Hadoop was really the engine that started to drive the application of big data 10 years ago now.  But since then, there has been incredible growth within the Hadoop ecosystem to address many of the challenges that were identified by commercial organizations as they started to adopt Hadoop.

Of the 20+ Apache projects that have come to life as a part of the ecosystem in the last many years, I will just mention a few of the larger projects and their signifigence in the grand scheme of things.  Again, this is not meant to be "developer 101", but instead, business person 101.  My goal here is to help business people know enough to ask the right kinds of questions of both their internal and external technology partners, before they embark on a big data adventure.

Hbase:
The first ecosystem component to talk about is Apache Hbase.  Hbase is what is called a NoSQL columnar database, which, to distill it all down, really just means you can read/write data really fast to or from it for time sensitive operations and not need to just use SQL queries to get the results.  One of the areas in the commercial space where Hbase is commonly used is as the "database" for lookups on websites.  So, if I am on eBay and I am searching for specific kinds of products, the search I do may take me behind the scenes to something like Hbase to do a fast look up to see if that product is available.  Or if I am on Etsy, all the clicks I may make on the site during my visit could be saved and tracked and then stored in something like Hbase for immediate access by operations teams.  So when you hear the word or term Hbase, think super fast reading and writing to a data store that sits right on top of Hadoop (HDFS).

Hive (LLAP) and Impala:
Two other major projects that were developed to help speed the adoption of Hadoop in general were Apache Hive and Impala.  Both are what is referred to as "SQL Engines"on top of HDFS.  As is the case with just about every new technology, in order to drive adoption, making it easy for users to interact in a way they are familiar with is critical.  SQL happens to be a query language that is pretty much universally accepted and used by millions of analysts around the world with relational databases.  So in order to really drive the adoption of Hadoop, it made sense to build a tool that could leverage those skill sets but still get the value, power, speed and low cost of the Hadoop backend.  That is where Hive and Impala come in.  They both allow folks with those traditional SQL skills to continue doing what they do well, but leverage the goodness of Hadoop behind the scenes.  Organizations may use Hive or Impala to do dynamic queries or large scale summarizations across multiple, very large sets of data.

Spark:
Over the last number of years, all of these projects started to solve the problems that companies had with the original Hadoop components of HDFS and MapReduce.  But one thing kept coming up over and over again as computing started to get cheaper and specifically, as memory has started to become cheaper.  How can I make my processing go faster?  This is where Spark came in and started to become the new hot tool and darling of the big data community.  Spark was developed out of Cal Berkeley in their AMP Lab and has become one of the fastest growing open source projects in history.  The big reason is that it was able to speed up processing on data by orders of magnitude for users, by storing data in memory for fast and easy access. And for many organizations using data today at the heart of their business, they needed this speed to make faster decisions.  Spark also has become popular lately because it is a tool that is much more friendly for developers and also has multiple uses for the same core engine.  It can be Spark for batch processing or Spark Streaming for real time data movement or you can use Spark for creating graph databases.  So it is a tool that is incredibly flexible for a myriad of use cases.

Storm, Flink, Apex:
With the original processing framework of MapReduce and then even the additional of other tools like Hive, Impala etc... there was always a missing piece to the puzzle.  And that was real time streaming analysis over large sets of data.  While those other tools did a great job with running batch analysis on large sets of data, they were not built to do analysis on real time, streaming data. A great example of real time streaming data would be something like Twitter.  Tons of data, coming in real time and needing to be analyzed in real time to make decisions.   This is just one of many use cases or examples where real time is being used.  There are dozens of others, in every industry, where real time streaming analytics is becoming more and more popular and valuable.   Apache Storm was the first real project that was built to address this real time streaming analytics need and was actually built by the folks at Twitter.  Apache Apex and Apache Flink are two other real time streaming projects that have also gained steam lately along with Spark Streaming.


YARN:
As many of the projects started to become a core part of any large scale big data project, there became the need to better manage the resources that were executing all of these queries or processes that people wanted to run.  So, that is where YARN comes in.  It was developed to be the layer on top of HDFS to help manage the resources effectively.  Kind of like the operating system for HDFS.  Not something that business people are really going to care about, but still good to know what YARN is and where it fits in the picture.

Ranger and Knox:
No list of tools would be complete without talking about the security of the Hadoop stack.  With all of the concern with regards to data privacy and security, over the last few years the open source community has ramped up the work on putting in place tools that make it easier to lock down the data held within Hadoop.  That is what Ranger and Knox were developed to be, the access and authorization tools to ensure that only the right people or systems, with the right kinds of privileges, are able to access data in Hadoop.  For many commercial organizations, this has been the hurdle that needed to be cleared in order to adopt Hadoop and start deriving real business value.  It just flat out needed to be more secure.

Alright, so that is a good review of some of the core technologies that make up the Hadoop ecosystem and where they fit.  As I mentioned in a previous post, Hadoop is not the only ecosystem of technologies that is used in the big data space.  There are numerous other open source and proprietary frameworks and tools that are being used to augment the use of Hadoop tools.  We will talk about a few of those other tools next time.








Monday, November 7, 2016

Big Data Technology Explained- Part 1 (Hadoop)

I am going to break this topic into a few posts, simply because it could get quite long.  But my goal here is not to go into exhaustive detail on the technology in the big data ecosystem or go toe to toe with the big data architects of the world on what tools are better or faster etc....  Instead, my goal is to help business teams get just enough of the detail about this technology that it helps them make more informed decisions with their internal and external technology partners as they begin a big data program.

After being around big data now for a number of years, I continue to be amazed at how complex the technology really is.  There IS a reason why there is such a shortage of skills in the big data space, not only in data science but also with just pure big data architects and programmers.  It is just flat out hard to get it all to work together.

But, lets just start with the basics and keep it at a high level for now.

Lets not get too deep in the weeds on when Big Data really started etc... Just a waste of time.  Lets just start with the idea that Big Data really started to get moving in the Enterprise and Consumer spaces with the advent of what is widely known as Hadoop.  

I am sure that many of you have heard this word used or have read about it somewhere in the last few years.  Hadoop started out as two simple components of a technology stack that Yahoo and Google started using many years ago to help make search faster and more relevant for users.  The two parts were HDFS and Map/Reduce.  HDFS is the Hadoop Distributed File System and Map/Reduce is a distributed processing framework or job that runs on top of this file system that executes some kind of command to then spit out a set of results that can be used.

What is unique and special about HDFS, as opposed to other file systems or databases, is that it can store mass quantities of data, in any format and any size for an incredibly small amount of money.  So, as you can imagine, when you are trying to index "all" of the worlds information, like Google or Yahoo were trying to do, it is really useful to have a data store that is incredibly flexible but also incredibly cheap.  So think of HDFS as the "Oracle" of this new world of big data.

Alongside the uniqueness of HDFS, Map/Reduce was also unique in the world of processing.  As you can imagine, with all of that data that Yahoo and Google were now storing in HDFS, they needed really fast ways of being able to process that data and make sense of it.  That is what Map/Reduce was all about.  It was this idea that you now had a framework that could distribute out processing over many many different machines to get answers to questions faster.  The concept is sometimes referred to as Massively Distributed Parallel Computing.  It just means that you can spread out the work from one computer to many, so that you can get the answers much much faster.

So, when we are beginning to talk about Big Data, these two components really were the engine that got the big data world moving in the last 10 years.  Early on, these software frameworks were donated to the Apache Software Foundation as open source tools, which allowed the technology to be used by literally anyone.  So when you hear the term Hadoop now, it typically is coupled with the world Apache (Apache Hadoop) for this reason.

Since that time, there are have been dozens of new projects that have been developed by teams of people from all over the world that layer on top of these original components of HDFS and Map/Reduce.  It is literally an entire ecosystem of technology tools that leverage or are built right on top of Hadoop that accomplish some specific task.   See below for an "architecture" picture of the different tools in the ecosystem today, 10 years later.



And as processing power grows, memory costs shrink, storage costs continue to decline and use cases for Big Data continue to evolve, companies and individuals are creating new tools everyday to deal with specific challenges that they have in their organization.

In the next post, I will break down, at a high level, some of these tools in the graphic above that have been added to the Hadoop ecosystem over the last 5-7 years and why they are important in the context of business users and their needs.

Thursday, November 3, 2016

Business Value for Big Data Part 2

Ok, lets pick right up from where we left off in our last post and dig into buckets three and four from our business value use cases for big data.

Our third big bucket of value from big data projects was focused on driving real business value by more effectively predicting outcomes.  Again, the key to this bucket, like the others, is that today, companies have a much different and more effective way of bringing disparate data sources together and extracting some kind of signal from all of the noise.  I mentioned GE in my previous post as a good example of driving value in this space.  They have committed billions of dollars over the last few years to become a software company because they see the business opportunity in front of them with being able to predict when machines will fail.  Much of their marketing has gone towards talking about airplane maintenance or optimizing monsterous windmill farms in the middle of the ocean.  I think the common marketing pitch they give these days is that by better diagnosing problems with machines up front and eliminating downtime, there are trillions of dollars to be realized across the industrial machine space.  Yes, you read that right, Trillions.

But, predicting outcomes doesn't need to be focused on such a large class of assets or within a set industry to be valuable.  Using data to better predict outcomes really cuts across all industries and across all lines of business within an enterprise.  IT organizations are using predictive analytics to determine how best to optimize their hardware and software to save costs in their data centers.  Security teams are using predictive analytics to help find Advanced Persistent Threats within a network and cut off hackers before they even get started stealing data.  Sales organizations are using predictions across diverse data sources to more effectively target prospects that have a higher likelihood to buy.  Marketers have been using predictions for years to make more personalized offers to customers when they are checking out online, think "People who bought this item also bought.....".   As we move into the future, marketers are getting ever more creative with big data and using predictions to make even more personalized offers across multiple platforms.  And Customer Service leaders are using predictive analytics to more effectively match call center agents with customers that are calling in and have a particular type of personality, class of the problem or recent their activity.

Finally, big data has real value in a category I explained last time as "plumbing".  While no where near as eye catching or interesting as these other kinds of use cases that drive business value, updating the "plumbing" can be of tremendous value to many organizations.  In fact, a solid place to start a big data program can be a use case as mundane as just moving away from a traditional Data Warehouse approach to storing/operating on data.  These traditional approaches can be incredibly expensive and can be incredibly frustrating to use/maintain for generating reports across the business.  The big data alternatives seem to be able to leverage alot of the same tools that are user facing, but put in place a much more innovative back end infrastructure that allows companies to significantly reduce their costs for the data warehouse technology and help to speed up processing, or creating reports, by orders of magnitude.

I know that was a lot to ingest and consume all at once across these two posts.  But I think it is important for business people to understand that all of the hype you may hear about Big Data or Hadoop etc... has real legs and real value behind it.  The dirty little secret, well not so much a secret anymore, is that the real challenge with big data programs is less about determining outcomes to focus on and more about the complexity of the technology itself.

But we will get to that more in depth in one of the upcoming posts.

As always, please feel free to comment and share your experience with big data programs and their associate value for your company.