We have seen a lot of buzz around ‘Big Data’ and ‘Data as an Asset’ in recent years. When used effectively and with the right tools, data can be turned into a critical business advantage. There are different tools and systems that are better suited for different types of data and different uses of data. At Medio we use MapReduce as part of our ETL platform to process billions of user events in preparation for reporting and building analytical models.
MapReduce is simply an approach that facilitates scaling out processing capabilities (information extraction) though parallelization (partitioning) of data. The Map step is used to partition the data (or computation) across many processing nodes (physical or virtual) which allows horizontal scaling – using multiple low cost servers vs. having to use powerful and very expensive servers. The partitioned data is processed locally on each node. The Reduce step is used to produce an aggregate (a count, summary, sum, de-duplication, filtration or other transformation) of the data on each node. A final Reduce step aggregates the data across all nodes to produce a final result.
MapReduce is the right approach for you if:
- Your company has a lot of data that is stored across multiple servers – MapReduce will let you distribute the computation to where the data is already
- Your data has a repetitive structure (many events from a large number of users, known set of states for a large set of items etc.)
- Your data is easily partitionable – you can process the data though splitting and running multiple occurrences at the same time
- You do a lot of batch oriented dataset processing
- You need tools to explore datasets with uncertain schema in an ad hoc manner
- You have a significant pool of existing machines that you would like to aggregate into something more powerful
- If you lack database expertise but have strong developers/engineers
- If scaling out is important for your business
MapReduce is the wrong approach for you if:
- Your company works with streaming data that needs to be processed continuously/incrementally
- Your processing requires sharing the data (you heavily rely on access to common data)
- You are integrating with existing systems
- You need to combine data from a very large number of sources
- You require fast or real-time reporting
- You require strict guarantees on processing time or have no tolerance to potential variation in time
- You require ordered execution of processing over data
- You can work with samples or partial data to achieve good results
- You are provisioning the hardware for the task
- You need heterogeneous or cutting age hardware
Today existing tools that expose MapReduce – Hadoop leading the way – are mostly geared towards engineers and require significant IT resources. Today, solving problems using MapReduce can sometimes be tricky and slow. With recent announcements from some large companies we are starting to see a shift that will enable broader commercial usage of MapReduce. One direction is the recent integration of MapReduce with DBMSs. We are also starting to see more systems and tools that are reaching enterprise-level maturity and ease of use. Still MapReduce is just one tool in the arsenal for working with Big Data.