If you upload a file of 150MB to HDFS how many blocks is it split into?

2, one of 128MB and one of 22MB

Which statements about HDFS are true?

The HDFS namespace is a hierarchy of files and directories.

What is the Secondary NameNode?

The Secondary NameNode is a node that merges the fsimage and the edits log files periodically and keeps edits log size within a limit. This allows the NameNode to start up faster in case of failure, but the Secondary NameNode is not a redundant NameNode. Over the years, the HDFS team kept improving on the "alternative" name nodes and came up almost every year with a new name with new functionality improving on the former ones. In the lecture, we discuss the latest up-to-date variant, standby namenodes, saying all the others (secondary, checkpoint, backup...) are just "HDFS archeology".

Is object storage or block storage better for storing experimental and simulation data from CERN?

Block Storage. Because it can handle large files and store more data than ordinary object storage.

Which one of the two typically has strong consistency?

Object storage

Which component is the main single point of failure in Hadoop?

Prior to Hadoop 2.0.0, the NameNode was a single point of failure.

If a task is running on a particular datanode that needs a block/replica that is stored on datanodes in the same rack and on other racks which replicas will it read?

The one on the same rack because the reading priority is only based on distance.

How large is a block in HDFS?

Typical size for a block is either 64 or 128 megabytes.

List 4 advantages of the large block size of HDFS?

1. it minimizes the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.

2. it reduces clients' need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially.

3. since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time.

4. it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory.

Whats HDFS's default placement strategy for replicas?

• Put one replica on the node where client is. If client is not in the cluster then the node is chosen randomly.
• Another replica is placed on a node in a different (remote) rack.
• Third replica is also placed in the same rack as second but the node is different, chosen at random.
Which statements about HDFS are true?

The locations of block replicas are part of the persistent checkpoint that the NameNode stores in its native file system.

Explain how HDFS accomplishes the following requirements:

1. Scalability
2. Durability
1. Scalability: by partitioning files into blocks and distributing them to many servers operating in parallel, HDFS can scale to potentially a large number of files of any size. By adding more DataNodes the storage capacity of the system can be increased arbitrarily. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost.
2. Durability: HDFS creates multiple copies of each block (by default 3, on different racks) to minimize the probability of data loss.
3. High sequential read/write performance: by splitting huge files into blocks and spreading these into multiple machines. This makes parallel reads possible (accessing different nodes at the same time) either by using multiple clients or by using a distributed data processing framework such as MapReduce.
