What is Hadoop?

I have been fortunate enough to land a job at a company called Explorys. I have been working here since October of last year (2014). The phrase, "Unlocking the power of BIG DATA," is a common company mantra. Explorys is a company that takes in various hospital organizations' patient data (such as the Cleveland Clinic) from their various medical management systems, normalizes the data into a single system of records, and sells it back to the organization. It is a brilliant model. Basically, we ingest and sell hospital data back to the customer.

Because of the enormous amount of data consumed in this process, Explorys has chosen a piece of software called Hadoop as its system of record. Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

A good analogy to this is, in pioneer days, when an ox could not move a log, a larger ox was not bred; instead, systems of using multiple oxen to perform the work were created. In context, a Hadoop datanode is the ox and the organizer is the namenode. There is only one primary namenode. This is a potential single point of failure, but implementations often use various patterns of redundancy to reduce the risk of a significant outage. The number of datanodes is bound only by the amount of memory that namenode has available for referencing all child datanodes.

Critical Points and Opinions

  1. Scalability and Flexibility:

    • Hadoop's ability to scale from single servers to thousands of machines makes it a versatile solution for handling big data. Its design ensures that it can grow with the needs of the organization, providing a scalable platform that can handle increasing amounts of data without requiring significant changes to the underlying infrastructure.
  2. Fault Tolerance:

    • One of Hadoop's strengths is its fault tolerance. By distributing data across multiple nodes and replicating it, Hadoop ensures that data remains available even if some nodes fail. This is crucial for maintaining high availability and reliability in big data environments. However, the single point of failure at the namenode can be a concern, and it is essential to implement redundancy to mitigate this risk.
  3. Cost-Effectiveness:

    • Hadoop's use of commodity hardware makes it a cost-effective solution for big data storage and processing. Organizations do not need to invest in expensive, specialized hardware, which significantly reduces the overall cost of implementation. Additionally, being open-source, Hadoop provides a free alternative to proprietary solutions, further enhancing its cost-effectiveness.
  4. Data Processing Capabilities:

    • The MapReduce framework provides a powerful model for processing large datasets. It abstracts the complexity of distributed processing, making it accessible to developers. However, the batch-oriented nature of MapReduce can introduce latency, which may not be suitable for all use cases, particularly those requiring real-time processing.
  5. Ecosystem and Integration:

    • Hadoop's ecosystem includes a wide range of tools and projects, such as HBase, Hive, Pig, and Spark, which extend its capabilities. This ecosystem allows for the integration of Hadoop with various data processing and analysis tools, providing a comprehensive platform for big data. However, this diversity can also introduce complexity, requiring careful planning and management to ensure seamless integration and efficient operation.

Hadoop - The Definitive Guide: 4th Edition

I started reading this book yesterday. It has helped shed some light on the mystery of the buzzword, "Big Data."

BOOK OVERVIEW:

=== CHAPTER 1: Meet Hadoop ===

  • Read/Write speeds impede application performance
  • Splitting data into small segments allows easier access, but machine failure becomes a problem
  • RAID pairing
  • How do you read data pieces from multiple systems?
  • MapReduce?
  • "Batch" system
  • MapReduce detects failures and reschedules on healthy machines
  • Abstracts these thoughts away from the programmer/end user
  • Data can be loosely structured in comparison to traditional DBMS

=== Chapter 2: MapReduce ===

  • Inherently parallel
  • Work units are run as "jobs"
  • Map() and Reduce() jobs are packaged into JARs that are distributed across the cluster
  • Input and output paths
  • Failed jobs are automatically scheduled to run on a different node
  • Tasks & Splits
  • Tasks are divisions of a job map tasks, reduce tasks
  • Splits are manageable chunks the input data is split into
  • Map output is written locally and considered "throw away," not distributed across the HDFS
  • Combiner functions allow specification of optimization of output data - reduces throughput over the network?
  • Tasks run as locally close to the data as possible: physical machine, rack, etc.

== Chapter 3: The Hadoop Distributed Filesystem ==

  • Very Large Files (100mb+)
  • Streaming - Write once, read many
  • Commodity Hardware
    • Interpretive of the times
    • How much machine fault tolerance is acceptable?
  • "To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three)."
  • NOT IDEAL FOR
    • Low Latency Data Access
    • Lots of small files
    • Multiple writers
  • Block size is extremely large, compared to traditional disk/file systems
  • Nothing requires that blocks of the same file to be stored on the same disk or machine
  • NameNode is the master!
  • The namenode is still a single point of failure (SPOF)
  • On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.
  • DataNode is the worker...
  • By exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS
  • Multiple different interfaces to HDFS
  • Two Java Strategies for reading from HDFS
    • Stream from Hadoop URL
    • FileSystem API
    • API Allows for:
      • creating files
      • appending files
      • making directories
      • finding file status
      • list directory contents
      • file patterns (globStatus & PathFinder)
      • delete

INTERESTING BOOK QUOTES:

"In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis. What’s more, Hadoop is affordable since it runs on commodity hardware and is open source."

"The trend since then has been to sort even larger volumes of data at ever faster rates. In the 2014 competition, a team from Databricks were joint winners of the Gray sort benchmark. They used a 207-node Spark cluster to sort 100 terabytes of data in 1,406 seconds, a rate of 4.27 terabytes per minute."

"Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this."


TERMS:

  • Hadoop Distributed File System (HDFS)

    • Named after the creator's kid's yellow elephant...
    • As a result, several Hadoop technologies reference animals
      • Pig
      • Impala
    • Notes that kids are great for coming up with "project names"
      • ​My son Jordan says: "Helwoho"
  • HBase

    • HBase random, realtime read/write access to data
    • Key/Value storage & retrieval
  • YARN

    • Yet Another Resource Negotiator
    • Hadoop 2
    • YARN is a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster
  • Impala

    • Interactive SQL Interface
  • Hive

    • Interactive SQL Interface
  • Spark

    • Job scheduler
  • Storm

  • Solr

  • MapReduce Job

    • A unit of work that the client wants to be performed
    • Consists of the input data, the MapReduce program, and configuration information

Get In Touch

We'd love to hear from you! Whether you have a question about our services, need a consultation, or just want to connect, our team is here to help. Reach out to us through the form, or contact us directly via social media.


Previous
Previous

Client-Side Web Project - Pt 2

Next
Next

Client-Side Web Project - Pt 1