After being around big data now for a number of years, I continue to be amazed at how complex the technology really is. There IS a reason why there is such a shortage of skills in the big data space, not only in data science but also with just pure big data architects and programmers. It is just flat out hard to get it all to work together.
But, lets just start with the basics and keep it at a high level for now.
Lets not get too deep in the weeds on when Big Data really started etc... Just a waste of time. Lets just start with the idea that Big Data really started to get moving in the Enterprise and Consumer spaces with the advent of what is widely known as Hadoop.
I am sure that many of you have heard this word used or have read about it somewhere in the last few years. Hadoop started out as two simple components of a technology stack that Yahoo and Google started using many years ago to help make search faster and more relevant for users. The two parts were HDFS and Map/Reduce. HDFS is the Hadoop Distributed File System and Map/Reduce is a distributed processing framework or job that runs on top of this file system that executes some kind of command to then spit out a set of results that can be used.
What is unique and special about HDFS, as opposed to other file systems or databases, is that it can store mass quantities of data, in any format and any size for an incredibly small amount of money. So, as you can imagine, when you are trying to index "all" of the worlds information, like Google or Yahoo were trying to do, it is really useful to have a data store that is incredibly flexible but also incredibly cheap. So think of HDFS as the "Oracle" of this new world of big data.
Alongside the uniqueness of HDFS, Map/Reduce was also unique in the world of processing. As you can imagine, with all of that data that Yahoo and Google were now storing in HDFS, they needed really fast ways of being able to process that data and make sense of it. That is what Map/Reduce was all about. It was this idea that you now had a framework that could distribute out processing over many many different machines to get answers to questions faster. The concept is sometimes referred to as Massively Distributed Parallel Computing. It just means that you can spread out the work from one computer to many, so that you can get the answers much much faster.
So, when we are beginning to talk about Big Data, these two components really were the engine that got the big data world moving in the last 10 years. Early on, these software frameworks were donated to the Apache Software Foundation as open source tools, which allowed the technology to be used by literally anyone. So when you hear the term Hadoop now, it typically is coupled with the world Apache (Apache Hadoop) for this reason.
Since that time, there are have been dozens of new projects that have been developed by teams of people from all over the world that layer on top of these original components of HDFS and Map/Reduce. It is literally an entire ecosystem of technology tools that leverage or are built right on top of Hadoop that accomplish some specific task. See below for an "architecture" picture of the different tools in the ecosystem today, 10 years later.
And as processing power grows, memory costs shrink, storage costs continue to decline and use cases for Big Data continue to evolve, companies and individuals are creating new tools everyday to deal with specific challenges that they have in their organization.
In the next post, I will break down, at a high level, some of these tools in the graphic above that have been added to the Hadoop ecosystem over the last 5-7 years and why they are important in the context of business users and their needs.
No comments:
Post a Comment