HDFS

2024-07-09 02:20| 来源: 网络整理| 查看: 265

概述

Hadoop技术体系中，hdfs是重要的技术之一，而真实的数据都存储在datanode节点之上，DataNode 将数据块存储到本地文件系统目录中，而每个datanode节点可以配置多个存储目录（可以是不同类型的数据硬盘），hdfs-site.xml （dfs.datanode.data.dir 参数）。

一般的hadoop集群datanode节点会配置多块数据盘，当我们往 HDFS 上写入新的数据块，DataNode 将会使用 volume 选择策略来为数据块选择存储的磁盘目录。目前有两种volume选择策略：

round-robin （default）

available space

遇到的问题：

由于hadoop集群规模一般比较大，且需要长期维护，所以会涉及到很多流程以及操作，例如：扩容新服务器、定期更换坏盘、下线服务器、删除历史数据等等。所以会造成节点间的数据不平衡，以及datanode节点上多个磁盘之间的不平衡问题。

1. 节点间的数据不平衡，可以通过hdfs 本身的balancer工具进行数据平衡；

2. datanode节点上多个磁盘之间数据不平衡，Hadoop 3.0 引入了磁盘均衡器(diskbalancer)。

这里先抛出一个问题：为什么datanode本身的磁盘选择策略没有很好的解决这些数据不平衡问题呢？

下面分析下datanode磁盘选择相关的源码：

hdfs-site.xml: 配置项 dfs.datanode.fsdataset.volume.choosing.policy

org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy （default）

org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy

A. RoundRobinVolumeChoosingPolicy:

/** * Choose volumes in round-robin order. */ public class RoundRobinVolumeChoosingPolicy implements VolumeChoosingPolicy { public static final Log LOG = LogFactory.getLog(RoundRobinVolumeChoosingPolicy.class); private int curVolume = 0; @Override public synchronized V chooseVolume(final List volumes, long blockSize) throws IOException { if(volumes.size() < 1) { throw new DiskOutOfSpaceException("No more available volumes"); } // since volumes could've been removed because of the failure // make sure we are not out of bounds if(curVolume >= volumes.size()) { curVolume = 0; } int startVolume = curVolume; long maxAvailable = 0; // 遍历磁盘列表 while (true) { final V volume = volumes.get(curVolume); curVolume = (curVolume + 1) % volumes.size(); long availableVolumeSize = volume.getAvailable(); // 可用空间大于数据块，直接返回volume if (availableVolumeSize > blockSize) { return volume; } // 更新最大可用空间 if (availableVolumeSize > maxAvailable) { maxAvailable = availableVolumeSize; } // 未找到合适的存储磁盘 if (curVolume == startVolume) { throw new DiskOutOfSpaceException("Out of space: " + "The volume with the most available space (=" + maxAvailable + " B) is less than the block size (=" + blockSize + " B)."); } } } }

可见，这种轮询的实现目的也是为了数据均衡，这种轮询的方式虽然能够保证所有磁盘都能够被使用，但是由于这种算法实现只是按照block数量进行轮询选择，而没有考虑到每次存储的block大小，如果每次存储的block大小相差很大，也会造成磁盘数据不均衡；另外如果HDFS 上的文件存在大量的删除操作，也可能会导致磁盘数据的分布不均匀。

看下第二种实现方式.

B. AvailableSpaceVolumeChoosingPolicy:

/** * A DN volume choosing policy which takes into account the amount of free * space on each of the available volumes when considering where to assign a * new replica allocation. By default this policy prefers assigning replicas to * those volumes with more available free space, so as to over time balance the * available space of all the volumes within a DN. */ public class AvailableSpaceVolumeChoosingPolicy implements VolumeChoosingPolicy, Configurable { // 加载并初始化配置（省略） ................. // 用于需要平衡磁盘的轮询磁盘选择策略 private final VolumeChoosingPolicy roundRobinPolicyBalanced = new RoundRobinVolumeChoosingPolicy(); // 用于可用空间高的磁盘的轮询磁盘选择策略 private final VolumeChoosingPolicy roundRobinPolicyHighAvailable = new RoundRobinVolumeChoosingPolicy(); // 用于可用空间低的磁盘的轮询磁盘选择策略 private final VolumeChoosingPolicy roundRobinPolicyLowAvailable = new RoundRobinVolumeChoosingPolicy(); @Override public synchronized V chooseVolume(List volumes, long replicaSize) throws IOException { if (volumes.size() < 1) { throw new DiskOutOfSpaceException("No more available volumes"); } AvailableSpaceVolumeList volumesWithSpaces = new AvailableSpaceVolumeList(volumes); // 如果磁盘都在数据平衡阈值（可配置）之内,则直接使用轮询策略选择磁盘 if (volumesWithSpaces.areAllVolumesWithinFreeSpaceThreshold()) { // If they're actually not too far out of whack, fall back on pure round // robin. V volume = roundRobinPolicyBalanced.chooseVolume(volumes, replicaSize); if (LOG.isDebugEnabled()) { LOG.debug("All volumes are within the configured free space balance " + "threshold. Selecting " + volume + " for write of block size " + replicaSize); } return volume; } else { V volume = null; // 如果没有一个低自由空间的体积有足够的空间存储副本时，总是尽量选择有大量空闲空间的卷。 // 从低剩余磁盘列表中选取最大可用空间（磁盘可用） long mostAvailableAmongLowVolumes = volumesWithSpaces .getMostAvailableSpaceAmongVolumesWithLowAvailableSpace(); // 高可用空间磁盘列表 List highAvailableVolumes = extractVolumesFromPairs( volumesWithSpaces.getVolumesWithHighAvailableSpace()); // 低可用空间磁盘列表 List lowAvailableVolumes = extractVolumesFromPairs( volumesWithSpaces.getVolumesWithLowAvailableSpace()); // 平衡比值 float preferencePercentScaler = (highAvailableVolumes.size() * balancedPreferencePercent) + (lowAvailableVolumes.size() * (1 - balancedPreferencePercent)); float scaledPreferencePercent = (highAvailableVolumes.size() * balancedPreferencePercent) / preferencePercentScaler; // 如果低可用空间磁盘列表中最大的可用空间无法满足副本大小 // 或随机概率小于比例值,就在高可用空间磁盘中进行轮询调度选择 if (mostAvailableAmongLowVolumes < replicaSize || random.nextFloat() < scaledPreferencePercent) { volume = roundRobinPolicyHighAvailable.chooseVolume( highAvailableVolumes, replicaSize); if (LOG.isDebugEnabled()) { LOG.debug("Volumes are imbalanced. Selecting " + volume + " from high available space volumes for write of block size " + replicaSize); } } else { // 否则在低可用空间列表中选择 volume = roundRobinPolicyLowAvailable.chooseVolume( lowAvailableVolumes, replicaSize); if (LOG.isDebugEnabled()) { LOG.debug("Volumes are imbalanced. Selecting " + volume + " from low available space volumes for write of block size " + replicaSize); } } return volume; } }

高\低可用空间磁盘列表调用逻辑：

/** * Used to keep track of the list of volumes we're choosing from. */ private class AvailableSpaceVolumeList { // 省略 ................ /** * @return the maximum amount of space available across volumes with low space. */ public long getMostAvailableSpaceAmongVolumesWithLowAvailableSpace() { long mostAvailable = Long.MIN_VALUE; for (AvailableSpaceVolumePair volume : getVolumesWithLowAvailableSpace()) { mostAvailable = Math.max(mostAvailable, volume.getAvailable()); } return mostAvailable; } /** * @return the list of volumes with relatively low available space. */ public List getVolumesWithLowAvailableSpace() { long leastAvailable = getLeastAvailableSpace(); List ret = new ArrayList(); for (AvailableSpaceVolumePair volume : volumes) { // 可用空间小于（最小可用空间+平衡阀值） if (volume.getAvailable() leastAvailable + balancedSpaceThreshold) { ret.add(volume); } } return ret; } }

可见，可用空间策略设计原理是根据配置平衡阀值划分磁盘分为两类列表：高可用空间磁盘列表、低可用空间列表，通过随机数概率，会相应较高概率选择高可用空间列表中的磁盘；

分析到这里，感觉可用空间策略可以很好的解决磁盘数据不平衡问题，为什么datanode应用的默认策略还是基于轮询磁盘选择策略呢？

长期运行的集群中会遇到这样一种场景，hdfs所有的datanode节点磁盘使用率很高，已达90%以上，这时在一个datanode节点更换一个磁盘，如果采用可用空间策略，新增的数据块高概率都会往更换的新盘上写入，其他磁盘处于空闲状态，就会导致较低的磁盘IO效率，磁盘IO可能会成为整个集群的瓶颈；

总结：

经过对源码的分析，发现datanode的磁盘选择策略都在一定程度上保证了磁盘使用的均衡，但是都存在一定的问题，需要针对与不同的集群情况进行不断变换策略使用；

【本文地址】

HDFS

HDFS

今日新闻

推荐新闻