Another look at Hadoop
Feb 14th, 2009 by admin
Apache Hadoop is a (free) software framework that supports data intensive distributed applications. It enables applications to work with thousands of nodes and huge amounts of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers. Hadoop was originally a sub-project of Lucene before it became a Top Level Project at Apache. Being a top level project has allowed sub-projects to be added.
There are now many of these and include:
- HBase: HBase provides Bigtable-like capabilities on top of Hadoop Core.
- ZooKeeper: Zookeeper a service for coordinating distributed systems
- Pig: Pig is a platform for analyzing large datasets
- Hive: Hive provides data warehousing for Hadoop
There are many other open source project that have connections to Hadoop:
Mahout: Mahout create sscalable machine learning libraries that run on Hadoop
CloudBase: CloudBase is a data warehouse system build on Hadoop
Cascading: Cascading is an API for building dataflows for Hadoop MapReduce
Tashi: Tashi is an Apache incubator project for cloud computing for large datasets
Disco: Disco is a MapReduce implementation in Erlang/Python
Hypertable: Hypertable is a distributed data storage system, modeled on Google’s Bigtable
If you are interested in running Hadoop on EC2 the Hadoop Wiki gives some nice instructions as to how to achieve this.














