HDFS at ETHZ - ETH Zurich | Flashcards & Summaries

Suggested languages for you:

# Lernmaterialien für HDFS an der ETHZ - ETH Zurich

Greife auf kostenlose Karteikarten, Zusammenfassungen, Übungsaufgaben und Altklausuren für deinen HDFS Kurs an der ETHZ - ETH Zurich zu.

TESTE DEIN WISSEN

If you upload a file of 150MB to HDFS how many blocks is it split into?

Lösung anzeigen
TESTE DEIN WISSEN

2, one of 128MB and one of 22MB

Lösung ausblenden
TESTE DEIN WISSEN

Which statements about HDFS are true?

Lösung anzeigen
TESTE DEIN WISSEN

The HDFS namespace is a hierarchy of files and directories.

Lösung ausblenden
TESTE DEIN WISSEN

What is the Secondary NameNode?

Lösung anzeigen
TESTE DEIN WISSEN

The Secondary NameNode is a node that merges the fsimage and the edits log files periodically and keeps edits log size within a limit. This allows the NameNode to start up faster in case of failure, but the Secondary NameNode is not a redundant NameNode. Over the years, the HDFS team kept improving on the "alternative" name nodes and came up almost every year with a new name with new functionality improving on the former ones. In the lecture, we discuss the latest up-to-date variant, standby namenodes, saying all the others (secondary, checkpoint, backup...) are just "HDFS archeology".

Lösung ausblenden
TESTE DEIN WISSEN

Is object storage or block storage better for storing experimental and simulation data from CERN?

Lösung anzeigen
TESTE DEIN WISSEN

Block Storage. Because it can handle large files and store more data than ordinary object storage.

Lösung ausblenden
TESTE DEIN WISSEN

Which one of the two typically has strong consistency?

Lösung anzeigen
TESTE DEIN WISSEN

Object storage

Lösung ausblenden
TESTE DEIN WISSEN

Which component is the main single point of failure in Hadoop?

Lösung anzeigen
TESTE DEIN WISSEN

Prior to Hadoop 2.0.0, the NameNode was a single point of failure.

Lösung ausblenden
TESTE DEIN WISSEN

If a task is running on a particular datanode that needs a block/replica that is stored on datanodes in the same rack and on other racks which replicas will it read?

Lösung anzeigen
TESTE DEIN WISSEN

The one on the same rack because the reading priority is only based on distance.

Lösung ausblenden
TESTE DEIN WISSEN

How large is a block in HDFS?

Lösung anzeigen
TESTE DEIN WISSEN

Typical size for a block is either 64 or 128 megabytes.

Lösung ausblenden
TESTE DEIN WISSEN

List 4 advantages of the large block size of HDFS?

Lösung anzeigen
TESTE DEIN WISSEN
1. it minimizes the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.

2. it reduces clients' need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially.

3. since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time.

4. it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory.

Lösung ausblenden
TESTE DEIN WISSEN

Whats HDFS's default placement strategy for replicas?

Lösung anzeigen
TESTE DEIN WISSEN
• Put one replica on the node where client is. If client is not in the cluster then the node is chosen randomly.
• Another replica is placed on a node in a different (remote) rack.
• Third replica is also placed in the same rack as second but the node is different, chosen at random.
Lösung ausblenden
TESTE DEIN WISSEN

Which statements about HDFS are true?

Lösung anzeigen
TESTE DEIN WISSEN

The locations of block replicas are part of the persistent checkpoint that the NameNode stores in its native file system.

Lösung ausblenden
TESTE DEIN WISSEN

Explain how HDFS accomplishes the following requirements:

1. Scalability
2. Durability
Lösung anzeigen
TESTE DEIN WISSEN
1. Scalability: by partitioning files into blocks and distributing them to many servers operating in parallel, HDFS can scale to potentially a large number of files of any size. By adding more DataNodes the storage capacity of the system can be increased arbitrarily. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost.
2. Durability: HDFS creates multiple copies of each block (by default 3, on different racks) to minimize the probability of data loss.
3. High sequential read/write performance: by splitting huge files into blocks and spreading these into multiple machines. This makes parallel reads possible (accessing different nodes at the same time) either by using multiple clients or by using a distributed data processing framework such as MapReduce.
Lösung ausblenden
• 96766 Karteikarten
• 1723 Studierende
• 87 Lernmaterialien

## Beispielhafte Karteikarten für deinen HDFS Kurs an der ETHZ - ETH Zurich - von Kommilitonen auf StudySmarter erstellt!

Q:

If you upload a file of 150MB to HDFS how many blocks is it split into?

A:

2, one of 128MB and one of 22MB

Q:

Which statements about HDFS are true?

A:

The HDFS namespace is a hierarchy of files and directories.

Q:

What is the Secondary NameNode?

A:

The Secondary NameNode is a node that merges the fsimage and the edits log files periodically and keeps edits log size within a limit. This allows the NameNode to start up faster in case of failure, but the Secondary NameNode is not a redundant NameNode. Over the years, the HDFS team kept improving on the "alternative" name nodes and came up almost every year with a new name with new functionality improving on the former ones. In the lecture, we discuss the latest up-to-date variant, standby namenodes, saying all the others (secondary, checkpoint, backup...) are just "HDFS archeology".

Q:

Is object storage or block storage better for storing experimental and simulation data from CERN?

A:

Block Storage. Because it can handle large files and store more data than ordinary object storage.

Q:

Which one of the two typically has strong consistency?

A:

Object storage

Q:

Which component is the main single point of failure in Hadoop?

A:

Prior to Hadoop 2.0.0, the NameNode was a single point of failure.

Q:

If a task is running on a particular datanode that needs a block/replica that is stored on datanodes in the same rack and on other racks which replicas will it read?

A:

The one on the same rack because the reading priority is only based on distance.

Q:

How large is a block in HDFS?

A:

Typical size for a block is either 64 or 128 megabytes.

Q:

List 4 advantages of the large block size of HDFS?

A:
1. it minimizes the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.

2. it reduces clients' need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially.

3. since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time.

4. it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory.

Q:

Whats HDFS's default placement strategy for replicas?

A:
• Put one replica on the node where client is. If client is not in the cluster then the node is chosen randomly.
• Another replica is placed on a node in a different (remote) rack.
• Third replica is also placed in the same rack as second but the node is different, chosen at random.
Q:

Which statements about HDFS are true?

A:

The locations of block replicas are part of the persistent checkpoint that the NameNode stores in its native file system.

Q:

Explain how HDFS accomplishes the following requirements:

1. Scalability
2. Durability
A:
1. Scalability: by partitioning files into blocks and distributing them to many servers operating in parallel, HDFS can scale to potentially a large number of files of any size. By adding more DataNodes the storage capacity of the system can be increased arbitrarily. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost.
2. Durability: HDFS creates multiple copies of each block (by default 3, on different racks) to minimize the probability of data loss.
3. High sequential read/write performance: by splitting huge files into blocks and spreading these into multiple machines. This makes parallel reads possible (accessing different nodes at the same time) either by using multiple clients or by using a distributed data processing framework such as MapReduce.

### Erstelle und finde Lernmaterialien auf StudySmarter.

Greife kostenlos auf tausende geteilte Karteikarten, Zusammenfassungen, Altklausuren und mehr zu.

## Das sind die beliebtesten StudySmarter Kurse für deinen Studiengang HDFS an der ETHZ - ETH Zurich

Für deinen Studiengang HDFS an der ETHZ - ETH Zurich gibt es bereits viele Kurse, die von deinen Kommilitonen auf StudySmarter erstellt wurden. Karteikarten, Zusammenfassungen, Altklausuren, Übungsaufgaben und mehr warten auf dich!

## Das sind die beliebtesten HDFS Kurse im gesamten StudySmarter Universum

##### HDM

Hochschule Fresenius

##### HDL1

Technical University of Gdansk

##### HDPB

Université Panthéon-Assas (Paris II)