Forum Stats

  • 3,769,580 Users
  • 2,252,982 Discussions


Solving Common BI on Hadoop Performance Pitfalls


In this article, we’ll explore some of the various ways we can answer the question of how do we do BI from big data and compare alternatives with Jethro’s approach. Most of the alternatives mentioned in this piece have been explored and even implemented (sad to admit) by me before big data really grew into what is available now. So please enjoy and feel free to provide input or email comments and share experiences with me and the rest of the Jethro team so we can potentially advise and help through similar situations and avoid the pitfalls described here.

Core Hadoop

One of the most common, and frankly naïve, ways people approach BI with Hadoop, which generally ends in failure, is using core Hadoop to power BI reporting with tools such as Tableau, SSRS, or some other third party system like Business Objects, etc. When we talk about core Hadoop we are referring to the following services:

  • HDFS
  • MapReduce/MR2 YARN
  • Hive/HiveServer 2
  • Zookeeper

Now don’t get me wrong, MR2/YARN and the others listed above are excellent for what they are meant to do, but one of these services is often misrepresented for being a multipurpose solution that it is not: HIVE. The purpose of HIVE is simple, batch processing in/out – that’s it. Working with a company a few years back there weren’t many alternatives and I remember seeing demos with HIVE tables and views as the source to a Tableau dashboard. At the time, we were looking for alternatives to present the data quicker from HIVE into Tableau and ultimately ended up settling on SQL Server Analysis Services Tabular (more to come on this). I can clearly remember asking the presenter to manipulate filters via screen share and then the awkward waiting as one, two, five minutes passed as we waited for a MapReduce job to run in the background unbeknownst to the audience for the new data.

One of the major problems with this solution is the fact that Tableau, and other tools that rely on queries to refresh data, will cause MapReduce jobs to be launched as users explore and elect to change the initial set of data. There are clever ways around this, extracting the data all up front, etc., but oftentimes the cost for doing so is not worth it.

For example, take the connection below to a table with ~75M rows:

SS table 75M_rows

Connecting Tableau to Hive + Spark*

Connecting Tableau to Hive and Spark

The screenshot illustrates how long the wait was for metadata generation (I killed it after 2:00 minutes). One could argue that limiting the general import would fix this issue, i.e., if I did a LIMIT 500,000 I can pull the data through, and that is true. But what do you do when your smallest dataset initially starts at 25M, 50M, or greater?

You can spend many man hours on attempting to make this work with partitions, shrinking data into smaller tables, or other downstream tasks but none will ever prove sustainable nor provide the speed many of your customers are expecting.

Retro Solutioning

Retro Solutioning, which may be a coined term (not sure to be honest), is when you don’t know enough about the other Hadoop services outside of the core features and decide to stick with what’s “safe.” The resulting solution will utilize technologies your team has at hand and is comfortable with and is usually the second line of attack when initial plans don’t quite work out (see the previous section).

Although there are valid situations that may call for utilizing existing technology and every company and situation is unique, more often than not it proves to be additional work and, by nature, additional processes for you to manage. Not only does this completely negate the purpose of Hadoop and big data but it wastes the time and resources of your team and has a high likelihood of becoming a band-aide solution instead of a scalable one.

Source: jethro

This discussion has been closed.