HDFS an der ETHZ - ETH Zurich | Karteikarten & Zusammenfassungen

Lernmaterialien für HDFS an der ETHZ - ETH Zurich

Greife auf kostenlose Karteikarten, Zusammenfassungen, Übungsaufgaben und Altklausuren für deinen HDFS Kurs an der ETHZ - ETH Zurich zu.

TESTE DEIN WISSEN

What problems does object storage solve?

Lösung anzeigen
TESTE DEIN WISSEN

Block-based storage systems beyond a hundred terabytes or beyond multiple petabytes may run into durability issues, hard limitations, or management overhead may go through the roof.

Solving the provisioning management issues presented by the expansion of storage at this scale is where object storage shines.

The flat name space organization of the data, in combination with its expandable metadata functionality, facilitate this ease of use.

Objects remain protected by storing multiple copies of data over a distributed system.

Lösung ausblenden
TESTE DEIN WISSEN

Which statements about HDFS are true?

Lösung anzeigen
TESTE DEIN WISSEN

The locations of block replicas are part of the persistent checkpoint that the NameNode stores in its native file system.

Lösung ausblenden
TESTE DEIN WISSEN

If you upload a file of 150MB to HDFS how many blocks is it split into?

Lösung anzeigen
TESTE DEIN WISSEN

2, one of 128MB and one of 22MB

Lösung ausblenden
TESTE DEIN WISSEN

Which statements about HDFS are true?

Lösung anzeigen
TESTE DEIN WISSEN

The HDFS namespace is a hierarchy of files and directories.

Lösung ausblenden
TESTE DEIN WISSEN

If a task is running on a particular datanode that needs a block/replica that is stored on datanodes in the same rack and on other racks which replicas will it read?

Lösung anzeigen
TESTE DEIN WISSEN

The one on the same rack because the reading priority is only based on distance.

Lösung ausblenden
TESTE DEIN WISSEN

Which component is the main single point of failure in Hadoop?

Lösung anzeigen
TESTE DEIN WISSEN

Prior to Hadoop 2.0.0, the NameNode was a single point of failure.

Lösung ausblenden
TESTE DEIN WISSEN

How does the hardware cost grow as function of the amount of data we need to store in a Distributed File System such as HDFS? Why?

Lösung anzeigen
TESTE DEIN WISSEN

Linearly. HDFS is designed taking machine failure into account, and therefore DataNodes do not need to be (highly expensive) highly reliable machines.

Lösung ausblenden
TESTE DEIN WISSEN

Whats HDFS's default placement strategy for replicas?

Lösung anzeigen
TESTE DEIN WISSEN
  • Put one replica on the node where client is. If client is not in the cluster then the node is chosen randomly.
  • Another replica is placed on a node in a different (remote) rack.
  • Third replica is also placed in the same rack as second but the node is different, chosen at random.
Lösung ausblenden
TESTE DEIN WISSEN

List 4 advantages of the large block size of HDFS?

Lösung anzeigen
TESTE DEIN WISSEN
  1. it minimizes the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.

  2. it reduces clients' need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. 

  3. since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. 

  4. it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory.

Lösung ausblenden
TESTE DEIN WISSEN

Whats the drawback if HDFS were to store all 3 replicas on different racks?

Lösung anzeigen
TESTE DEIN WISSEN

Even though it would increase the expected availability this would also slow down the writing process that would involve two inter-rack communications instead of one (and inter-rack-communication is slower than intra-rack-communication)

Lösung ausblenden
TESTE DEIN WISSEN

Explain how HDFS accomplishes the following requirements:

  1. Scalability
  2. Durability
  3. High sequential read/write performance
Lösung anzeigen
TESTE DEIN WISSEN
  1. Scalability: by partitioning files into blocks and distributing them to many servers operating in parallel, HDFS can scale to potentially a large number of files of any size. By adding more DataNodes the storage capacity of the system can be increased arbitrarily. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost.
  2. Durability: HDFS creates multiple copies of each block (by default 3, on different racks) to minimize the probability of data loss.
  3. High sequential read/write performance: by splitting huge files into blocks and spreading these into multiple machines. This makes parallel reads possible (accessing different nodes at the same time) either by using multiple clients or by using a distributed data processing framework such as MapReduce.
Lösung ausblenden
TESTE DEIN WISSEN

What is the Secondary NameNode?

Lösung anzeigen
TESTE DEIN WISSEN

The Secondary NameNode is a node that merges the fsimage and the edits log files periodically and keeps edits log size within a limit. This allows the NameNode to start up faster in case of failure, but the Secondary NameNode is not a redundant NameNode. Over the years, the HDFS team kept improving on the "alternative" name nodes and came up almost every year with a new name with new functionality improving on the former ones. In the lecture, we discuss the latest up-to-date variant, standby namenodes, saying all the others (secondary, checkpoint, backup...) are just "HDFS archeology".

Lösung ausblenden
  • 51455 Karteikarten
  • 1064 Studierende
  • 67 Lernmaterialien

Beispielhafte Karteikarten für deinen HDFS Kurs an der ETHZ - ETH Zurich - von Kommilitonen auf StudySmarter erstellt!

Q:

What problems does object storage solve?

A:

Block-based storage systems beyond a hundred terabytes or beyond multiple petabytes may run into durability issues, hard limitations, or management overhead may go through the roof.

Solving the provisioning management issues presented by the expansion of storage at this scale is where object storage shines.

The flat name space organization of the data, in combination with its expandable metadata functionality, facilitate this ease of use.

Objects remain protected by storing multiple copies of data over a distributed system.

Q:

Which statements about HDFS are true?

A:

The locations of block replicas are part of the persistent checkpoint that the NameNode stores in its native file system.

Q:

If you upload a file of 150MB to HDFS how many blocks is it split into?

A:

2, one of 128MB and one of 22MB

Q:

Which statements about HDFS are true?

A:

The HDFS namespace is a hierarchy of files and directories.

Q:

If a task is running on a particular datanode that needs a block/replica that is stored on datanodes in the same rack and on other racks which replicas will it read?

A:

The one on the same rack because the reading priority is only based on distance.

Mehr Karteikarten anzeigen
Q:

Which component is the main single point of failure in Hadoop?

A:

Prior to Hadoop 2.0.0, the NameNode was a single point of failure.

Q:

How does the hardware cost grow as function of the amount of data we need to store in a Distributed File System such as HDFS? Why?

A:

Linearly. HDFS is designed taking machine failure into account, and therefore DataNodes do not need to be (highly expensive) highly reliable machines.

Q:

Whats HDFS's default placement strategy for replicas?

A:
  • Put one replica on the node where client is. If client is not in the cluster then the node is chosen randomly.
  • Another replica is placed on a node in a different (remote) rack.
  • Third replica is also placed in the same rack as second but the node is different, chosen at random.
Q:

List 4 advantages of the large block size of HDFS?

A:
  1. it minimizes the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.

  2. it reduces clients' need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. 

  3. since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. 

  4. it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory.

Q:

Whats the drawback if HDFS were to store all 3 replicas on different racks?

A:

Even though it would increase the expected availability this would also slow down the writing process that would involve two inter-rack communications instead of one (and inter-rack-communication is slower than intra-rack-communication)

Q:

Explain how HDFS accomplishes the following requirements:

  1. Scalability
  2. Durability
  3. High sequential read/write performance
A:
  1. Scalability: by partitioning files into blocks and distributing them to many servers operating in parallel, HDFS can scale to potentially a large number of files of any size. By adding more DataNodes the storage capacity of the system can be increased arbitrarily. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost.
  2. Durability: HDFS creates multiple copies of each block (by default 3, on different racks) to minimize the probability of data loss.
  3. High sequential read/write performance: by splitting huge files into blocks and spreading these into multiple machines. This makes parallel reads possible (accessing different nodes at the same time) either by using multiple clients or by using a distributed data processing framework such as MapReduce.
Q:

What is the Secondary NameNode?

A:

The Secondary NameNode is a node that merges the fsimage and the edits log files periodically and keeps edits log size within a limit. This allows the NameNode to start up faster in case of failure, but the Secondary NameNode is not a redundant NameNode. Over the years, the HDFS team kept improving on the "alternative" name nodes and came up almost every year with a new name with new functionality improving on the former ones. In the lecture, we discuss the latest up-to-date variant, standby namenodes, saying all the others (secondary, checkpoint, backup...) are just "HDFS archeology".

HDFS

Erstelle und finde Lernmaterialien auf StudySmarter.

Greife kostenlos auf tausende geteilte Karteikarten, Zusammenfassungen, Altklausuren und mehr zu.

Jetzt loslegen

Das sind die beliebtesten StudySmarter Kurse für deinen Studiengang HDFS an der ETHZ - ETH Zurich

Für deinen Studiengang HDFS an der ETHZ - ETH Zurich gibt es bereits viele Kurse, die von deinen Kommilitonen auf StudySmarter erstellt wurden. Karteikarten, Zusammenfassungen, Altklausuren, Übungsaufgaben und mehr warten auf dich!

Das sind die beliebtesten HDFS Kurse im gesamten StudySmarter Universum

HR

FOM Hochschule für Oekonomie & Management

Zum Kurs
EDFS

Hochschule für Technik und Wirtschaft Dresden

Zum Kurs
Hr

FOM Hochschule für Oekonomie & Management

Zum Kurs
HDL

Hochschule Coburg

Zum Kurs
LZVP - HdM

Hochschule der Medien Stuttgart

Zum Kurs

Die all-in-one Lernapp für Studierende

Greife auf Millionen geteilter Lernmaterialien der StudySmarter Community zu
Kostenlos anmelden HDFS
Erstelle Karteikarten und Zusammenfassungen mit den StudySmarter Tools
Kostenlos loslegen HDFS