Friday, September 8, 2017

Platform First or Effective Detection First

The other day a friend in the security industry forwarded me a blog post that really stirred me up and very much frustrated me.

The reason being that it was another example to me of a post that was asking the wrong question of our industry and making broad generalizations that, from my experience, are not true when it comes to the convergence of the big data and security ecosystems.  The question being asked was this:

What is a bigger challenge for large scale security data analysis efforts, Scalable Platform or Effective Detection Content?

Two points to consider:
  1. I am getting really tired of those people that are saying that building a security data lake is a fools errand.  You can think that it is just a waste of time to do this, but the reality is that those that are doing it today, for the most part, are forward thinking organizations that have a strategic vision and plan for how they are going to use the data in the near term.  I have personally visited with 50+  companies in the last 2 years that all have a vision for how they are going to use the data they are collecting.  And even more, most of those companies that are building these lakes, are holding this data already anyway for compliance or regulatory purposes and building a lake simply makes that a cheaper proposition for them.  Seems pretty smart to me
  2. The greater challenge that the larger big data ecosystem is having is the complete lack of true applications that sit on top of the stack to provide the next layer of value.  Cost savings is the first layer of value that these companies are getting from building a lake for security data.  The other is value that comes from leveraging pre built apps that can take advantage of this architecture.  Unfortunaltely, these apps are not coming as fast as the customer base deserves and needs, but they are coming.  The reality is that building the platform is not easy, in fact, it is very challenging.  But so is building the analytical apps on top of it.  It is not about just writing some python code to build models.  It is about building code that not only analyzes data, but does it at scale and can do it in a multi-tenant processing environment.  And if you think this is easy, then you really don’t know what you are talking about.


Again, I think the question asked by the blogger,  is really a naive one.  It is not about whether or not one is more challenging than the other.  It is a question of maturity of the offerings in the market and how they are delivered.  Because both building a security data lake and building the app on top that can provide analytical value are ridiculously hard/challenging.  Our job as a community is to do our best to hide this complexity and provide a software package that is easy to consume and get value from.  Not argue about which part is more challenging. 

Wednesday, March 22, 2017

So Tired of People Using the Word Insights!

Ok, so this is not going to be a typical blog post where I talk about some technology, focused these days in the big data or security space and how it impacts the customer experience.

Nope.... This is going to be a "Pet Peeve" blog post about something that has driven me crazy for years...

And that pet peeve is the use of the word "Insights"....

For the last few years now, every single vendor under the sun, regardless if they are an analytics company or not, have been talking about providing you "Insights" into your data or your operations or your customers etc...

JUST STOP IT!!!

No one and I really mean NO ONE is really providing customers with Insights.

What they are providing customers with is data, graphs, charts, dashboards and any other cool, wiz bang visuals that can be generated by a software program.

But can we please just agree that these are not Insights.  They are just another way of showing someone data and then forcing them to extrapolate from there what the data actually means, which will then eventually end up with a human generating some real Insights.

What I want to see more of in the industry in general is the software providing the end user with a full blown, actual Insight.  So instead of showing someone a pie chart with 8 slices in it that make up a customer segmentation, there is a short description of what the data means, why it is important, a bit of context around it and a link provided that allows the person to explore the data that generated this Insight a bit more.

I think of Insights as "Leads".  The machine doing the heavy lifting of finding where you should be focused, based on the data and then allowing the user to dig deeper if need be.

So, please vendors, stop talking about how you all provide "Insights".  And instead focus on building software that actually does provide Insights.



Wednesday, December 7, 2016

Big Data Technology Explained- Part 3 (Other Big Data Tech)

In the previous posts on Big Data, we talked about some of the base technology tools that are in use today by companies all over the world to drive their Big Data programs.  We talked about the Hadoop ecosystem and a few of the projects or tools that have become common in commercial use cases.  Things like Hbase, Hive, Impala, Storm, Spark etc...

But, we can't limit the big data world to just the world of Apache Hadoop.  There are dozens and dozens of other technology tools, frameworks, platforms and applications that have been developed over even the past 5 years, that drive real value for organizations.  Take a look at the chart below and you can see that there are a ton of players.  But I will only dig in to a few of them that I have seen real value be generated within the folks I work with everyday.


As I said, a TON of companies making a play in the Big Data arena.  Still lots of opportunity I think to build more useful apps, but that is for another post down the road.

Out of these many dozen companies on this graphic, I want to call out a couple:

Elastic:
Formerly known as ElasticSearch, Elastic is an open source indexing and search tool.  Similar to Apache Solr, Elastic is used by companies to take documents or chunks of data or even individual log events, index them, make them searchable and then purge when space is needed.  The beauty to me of the Elastic platform is not just the search mechanism, but also the other tools that have been built to enhance the user experience in using Elastic.  In particular, I call out Kibana as a great tool that sits on top of Elastic and makes it very easy to find data that you are looking for.

NiFi:
When it comes to the world of big data, almost nothing is more important that actually being able to easily and quickly move data from one place to another.  But not just move the data, but securely move it, at scale and the ability to recover in case something happens in transit.  This is where Apache NiFi shines.  NiFi has been picked up with incredible speed by some of the largest companies in the world to fill in the gap that they all have of more effectively moving their data around their organization.

Neo4j:
As we start to move to a world that is based more and more on relationships and networks, graph databases start to become more and more important and that is what Neo4j is all about.  Every company, whether they like it or not, over the next few years will need to start connecting the dots that exist about their customers, partners, suppliers etc... Doing this with traditional databases is almost impossible and there are not any really great tools out there within the common frameworks that make graph database a possibility, besides Spark.  So, we will begin to see real growth in this area is my view and should be an area to keep an eye on for new, more user friendly kinds of solutions.

Ok, this is the end of this part of the series on Big Data, focused on the technology.   Again, my goal was not to go toe to toe with all of the architects of the world on what big data technology really is or how it works.  My goal was to help business teams get just enough of the detail about this technology that it helps them make more informed decisions with their internal and external technology partners for their big data programs.


Wednesday, November 23, 2016

Big Data Technology Explained- Part 2 (The Hadoop Ecosystem)

As I talked about in the last post, Hadoop was really the engine that started to drive the application of big data 10 years ago now.  But since then, there has been incredible growth within the Hadoop ecosystem to address many of the challenges that were identified by commercial organizations as they started to adopt Hadoop.

Of the 20+ Apache projects that have come to life as a part of the ecosystem in the last many years, I will just mention a few of the larger projects and their signifigence in the grand scheme of things.  Again, this is not meant to be "developer 101", but instead, business person 101.  My goal here is to help business people know enough to ask the right kinds of questions of both their internal and external technology partners, before they embark on a big data adventure.

Hbase:
The first ecosystem component to talk about is Apache Hbase.  Hbase is what is called a NoSQL columnar database, which, to distill it all down, really just means you can read/write data really fast to or from it for time sensitive operations and not need to just use SQL queries to get the results.  One of the areas in the commercial space where Hbase is commonly used is as the "database" for lookups on websites.  So, if I am on eBay and I am searching for specific kinds of products, the search I do may take me behind the scenes to something like Hbase to do a fast look up to see if that product is available.  Or if I am on Etsy, all the clicks I may make on the site during my visit could be saved and tracked and then stored in something like Hbase for immediate access by operations teams.  So when you hear the word or term Hbase, think super fast reading and writing to a data store that sits right on top of Hadoop (HDFS).

Hive (LLAP) and Impala:
Two other major projects that were developed to help speed the adoption of Hadoop in general were Apache Hive and Impala.  Both are what is referred to as "SQL Engines"on top of HDFS.  As is the case with just about every new technology, in order to drive adoption, making it easy for users to interact in a way they are familiar with is critical.  SQL happens to be a query language that is pretty much universally accepted and used by millions of analysts around the world with relational databases.  So in order to really drive the adoption of Hadoop, it made sense to build a tool that could leverage those skill sets but still get the value, power, speed and low cost of the Hadoop backend.  That is where Hive and Impala come in.  They both allow folks with those traditional SQL skills to continue doing what they do well, but leverage the goodness of Hadoop behind the scenes.  Organizations may use Hive or Impala to do dynamic queries or large scale summarizations across multiple, very large sets of data.

Spark:
Over the last number of years, all of these projects started to solve the problems that companies had with the original Hadoop components of HDFS and MapReduce.  But one thing kept coming up over and over again as computing started to get cheaper and specifically, as memory has started to become cheaper.  How can I make my processing go faster?  This is where Spark came in and started to become the new hot tool and darling of the big data community.  Spark was developed out of Cal Berkeley in their AMP Lab and has become one of the fastest growing open source projects in history.  The big reason is that it was able to speed up processing on data by orders of magnitude for users, by storing data in memory for fast and easy access. And for many organizations using data today at the heart of their business, they needed this speed to make faster decisions.  Spark also has become popular lately because it is a tool that is much more friendly for developers and also has multiple uses for the same core engine.  It can be Spark for batch processing or Spark Streaming for real time data movement or you can use Spark for creating graph databases.  So it is a tool that is incredibly flexible for a myriad of use cases.

Storm, Flink, Apex:
With the original processing framework of MapReduce and then even the additional of other tools like Hive, Impala etc... there was always a missing piece to the puzzle.  And that was real time streaming analysis over large sets of data.  While those other tools did a great job with running batch analysis on large sets of data, they were not built to do analysis on real time, streaming data. A great example of real time streaming data would be something like Twitter.  Tons of data, coming in real time and needing to be analyzed in real time to make decisions.   This is just one of many use cases or examples where real time is being used.  There are dozens of others, in every industry, where real time streaming analytics is becoming more and more popular and valuable.   Apache Storm was the first real project that was built to address this real time streaming analytics need and was actually built by the folks at Twitter.  Apache Apex and Apache Flink are two other real time streaming projects that have also gained steam lately along with Spark Streaming.


YARN:
As many of the projects started to become a core part of any large scale big data project, there became the need to better manage the resources that were executing all of these queries or processes that people wanted to run.  So, that is where YARN comes in.  It was developed to be the layer on top of HDFS to help manage the resources effectively.  Kind of like the operating system for HDFS.  Not something that business people are really going to care about, but still good to know what YARN is and where it fits in the picture.

Ranger and Knox:
No list of tools would be complete without talking about the security of the Hadoop stack.  With all of the concern with regards to data privacy and security, over the last few years the open source community has ramped up the work on putting in place tools that make it easier to lock down the data held within Hadoop.  That is what Ranger and Knox were developed to be, the access and authorization tools to ensure that only the right people or systems, with the right kinds of privileges, are able to access data in Hadoop.  For many commercial organizations, this has been the hurdle that needed to be cleared in order to adopt Hadoop and start deriving real business value.  It just flat out needed to be more secure.

Alright, so that is a good review of some of the core technologies that make up the Hadoop ecosystem and where they fit.  As I mentioned in a previous post, Hadoop is not the only ecosystem of technologies that is used in the big data space.  There are numerous other open source and proprietary frameworks and tools that are being used to augment the use of Hadoop tools.  We will talk about a few of those other tools next time.








Monday, November 7, 2016

Big Data Technology Explained- Part 1 (Hadoop)

I am going to break this topic into a few posts, simply because it could get quite long.  But my goal here is not to go into exhaustive detail on the technology in the big data ecosystem or go toe to toe with the big data architects of the world on what tools are better or faster etc....  Instead, my goal is to help business teams get just enough of the detail about this technology that it helps them make more informed decisions with their internal and external technology partners as they begin a big data program.

After being around big data now for a number of years, I continue to be amazed at how complex the technology really is.  There IS a reason why there is such a shortage of skills in the big data space, not only in data science but also with just pure big data architects and programmers.  It is just flat out hard to get it all to work together.

But, lets just start with the basics and keep it at a high level for now.

Lets not get too deep in the weeds on when Big Data really started etc... Just a waste of time.  Lets just start with the idea that Big Data really started to get moving in the Enterprise and Consumer spaces with the advent of what is widely known as Hadoop.  

I am sure that many of you have heard this word used or have read about it somewhere in the last few years.  Hadoop started out as two simple components of a technology stack that Yahoo and Google started using many years ago to help make search faster and more relevant for users.  The two parts were HDFS and Map/Reduce.  HDFS is the Hadoop Distributed File System and Map/Reduce is a distributed processing framework or job that runs on top of this file system that executes some kind of command to then spit out a set of results that can be used.

What is unique and special about HDFS, as opposed to other file systems or databases, is that it can store mass quantities of data, in any format and any size for an incredibly small amount of money.  So, as you can imagine, when you are trying to index "all" of the worlds information, like Google or Yahoo were trying to do, it is really useful to have a data store that is incredibly flexible but also incredibly cheap.  So think of HDFS as the "Oracle" of this new world of big data.

Alongside the uniqueness of HDFS, Map/Reduce was also unique in the world of processing.  As you can imagine, with all of that data that Yahoo and Google were now storing in HDFS, they needed really fast ways of being able to process that data and make sense of it.  That is what Map/Reduce was all about.  It was this idea that you now had a framework that could distribute out processing over many many different machines to get answers to questions faster.  The concept is sometimes referred to as Massively Distributed Parallel Computing.  It just means that you can spread out the work from one computer to many, so that you can get the answers much much faster.

So, when we are beginning to talk about Big Data, these two components really were the engine that got the big data world moving in the last 10 years.  Early on, these software frameworks were donated to the Apache Software Foundation as open source tools, which allowed the technology to be used by literally anyone.  So when you hear the term Hadoop now, it typically is coupled with the world Apache (Apache Hadoop) for this reason.

Since that time, there are have been dozens of new projects that have been developed by teams of people from all over the world that layer on top of these original components of HDFS and Map/Reduce.  It is literally an entire ecosystem of technology tools that leverage or are built right on top of Hadoop that accomplish some specific task.   See below for an "architecture" picture of the different tools in the ecosystem today, 10 years later.



And as processing power grows, memory costs shrink, storage costs continue to decline and use cases for Big Data continue to evolve, companies and individuals are creating new tools everyday to deal with specific challenges that they have in their organization.

In the next post, I will break down, at a high level, some of these tools in the graphic above that have been added to the Hadoop ecosystem over the last 5-7 years and why they are important in the context of business users and their needs.

Thursday, November 3, 2016

Business Value for Big Data Part 2

Ok, lets pick right up from where we left off in our last post and dig into buckets three and four from our business value use cases for big data.

Our third big bucket of value from big data projects was focused on driving real business value by more effectively predicting outcomes.  Again, the key to this bucket, like the others, is that today, companies have a much different and more effective way of bringing disparate data sources together and extracting some kind of signal from all of the noise.  I mentioned GE in my previous post as a good example of driving value in this space.  They have committed billions of dollars over the last few years to become a software company because they see the business opportunity in front of them with being able to predict when machines will fail.  Much of their marketing has gone towards talking about airplane maintenance or optimizing monsterous windmill farms in the middle of the ocean.  I think the common marketing pitch they give these days is that by better diagnosing problems with machines up front and eliminating downtime, there are trillions of dollars to be realized across the industrial machine space.  Yes, you read that right, Trillions.

But, predicting outcomes doesn't need to be focused on such a large class of assets or within a set industry to be valuable.  Using data to better predict outcomes really cuts across all industries and across all lines of business within an enterprise.  IT organizations are using predictive analytics to determine how best to optimize their hardware and software to save costs in their data centers.  Security teams are using predictive analytics to help find Advanced Persistent Threats within a network and cut off hackers before they even get started stealing data.  Sales organizations are using predictions across diverse data sources to more effectively target prospects that have a higher likelihood to buy.  Marketers have been using predictions for years to make more personalized offers to customers when they are checking out online, think "People who bought this item also bought.....".   As we move into the future, marketers are getting ever more creative with big data and using predictions to make even more personalized offers across multiple platforms.  And Customer Service leaders are using predictive analytics to more effectively match call center agents with customers that are calling in and have a particular type of personality, class of the problem or recent their activity.

Finally, big data has real value in a category I explained last time as "plumbing".  While no where near as eye catching or interesting as these other kinds of use cases that drive business value, updating the "plumbing" can be of tremendous value to many organizations.  In fact, a solid place to start a big data program can be a use case as mundane as just moving away from a traditional Data Warehouse approach to storing/operating on data.  These traditional approaches can be incredibly expensive and can be incredibly frustrating to use/maintain for generating reports across the business.  The big data alternatives seem to be able to leverage alot of the same tools that are user facing, but put in place a much more innovative back end infrastructure that allows companies to significantly reduce their costs for the data warehouse technology and help to speed up processing, or creating reports, by orders of magnitude.

I know that was a lot to ingest and consume all at once across these two posts.  But I think it is important for business people to understand that all of the hype you may hear about Big Data or Hadoop etc... has real legs and real value behind it.  The dirty little secret, well not so much a secret anymore, is that the real challenge with big data programs is less about determining outcomes to focus on and more about the complexity of the technology itself.

But we will get to that more in depth in one of the upcoming posts.

As always, please feel free to comment and share your experience with big data programs and their associate value for your company.

Thursday, October 27, 2016

Business Value for Big Data Part 1

In the last post of this series, I talked about the different types of use cases that are bubbling up for using big data technologies to drive value.  I talked about four buckets that I see these use cases falling in to:
  • Faster and more advanced analytics
  • Customer 360
  • Predictive Analytics
  • Optimizing the Plumbing
One of the common blog posts seen over the last few months within the big data community and tech blogs/sites in general is focused on the lack of value that companies seem to be getting from their investments in big data programs.  It is quite common to read one analyst or another writing about the "science projects" that are going on in the market with big data and the adoption of big data technologies being no where near the forecast.  I even read a twitter post from a well known analyst the other day calling for big data companies to focus on "outcomes" versus the technology.

While I do agree with this particular analyst in focusing technology projects on outcomes, I will say that I don't think this is really rocket science or anything new.  Focusing on outcomes should be what every company is doing, whether they are doing the investing or providing the technology.  Without focused outcomes, the project will be doomed from the beginning.

So what are some of those outcomes we should be focusing on for big data projects?
Well, they all come down to the same two big buckets we have seen for many years now:
  • Saving Money
  • Making Money
Now, one could argue that there are sub categories to these two outcomes, but by and large, these are what business leaders are looking at when investing in projects.

So then, how do the four use cases I laid out in the previous post connect to these two outcomes?
Let's focus on bucket one and two in this post and three and four in our next post.

Let's start with number one.  When looking at the "Faster and more advanced analytics" bucket, value starts to be realized for companies by being able to find patterns that were never uncovered in the past.  As an example, a retailer who was able to optimize their truck routes in logistics more effectively because they had a more advanced way of looking at their data, thus saving huge dollars on fuel costs.  Or a Telco that was able to cross reference data from multiple silos to show broader patterns related to network outages and capacity.   Which directly impacts both customer acquisition and maintenance costs to the tune of multi millions of dollars a year. 

When we think about Customer 360, the associated business value no doubt straddles both saving and making money.  As we talked about in the last post, Customer 360 has been the panacea for marketers and customer service leaders for years.  For marketers, the Customer 360 represents the best opportunity they have at truly understanding their customers wants and needs and then being able to offer products or services that match most closely to those wants and needs.  A great example of this would be insurance companies.  They are one of the "OGs" (originals) in the big data space (along with Telcos), collecting more data in one day that some companies capture in a year on their customers.  Now, by bringing all of this data together in new ways, marketers can offer more granular tiers of car insurance, thus broadening their prospect base.  Or marketers can much more easily help identify customer life events that may trigger offers for new types of insurance to long time customers, thus driving new forms of revenue capture. 

We can not forget though, that the Customer 360 is not only a win for marketers, but also a huge win for customer service leaders.  By giving them and their teams access to the full view of the customer, they are empowered to create a set of processes and experiences for customers that ultimately drive real business value.  Whether it be through providing an authentic customer experience (soft value) or by solving problems faster (hard value) or even starting to be proactive about problems that might be coming (hard value), the Customer 360 drives real value for both customers and companies via the customer service teams.

In our next post, we will tackle bucket three and four, using big data to more effectively predict outcomes and fixing the "plumbing".