Apache Hadoop

April 13, 2011

With the recent popularization and propagation of VM's and cloud computing, systems now have the assumption that horizontal scalability is a prerequisite instead of a feature. Google realized this need to horizontally scale, so they designed different solutions like GFS (Google's distributed, fault–tolerant, and redundant file system), BigTable (distributed, fault–tolerant and auto–sharing column store), and MapReduce (distributed computation framework); moreover, they shared the designs for these systems in white–papers that were publicly available.

Enter, Apache Hadoop. It's the open source implementation of the aforementioned Google white–papers. At its core lies a MapReduce framework and a distributed file system (HDFS). However, many other projects have built on top of the Hadoop core to create many other important top–level Apache projects: HBase (a la BigTable on top of HDFS), ZooKeeper (distributed coordination and lock service), Hive (ANSI SQL warehouse on top of HDFS), Avro (binary serialization framework like Thrift or protocol–buffers), Pig (language to make writing MapReduce jobs easier), and many others.

Come join us to learn more about Hadoop and its spawn, how they work, their uses, and how they apply to you!