大数据管理技术实习

2024-07-13 17:17| 来源: 网络整理| 查看: 265

大数据管理技术实习——MapReduce

文章目录大数据管理技术实习——MapReduce要求：基础代码1.map部分2.Reduce部分改进代码运行过程（命令行shell相关）1.开启hdfs2.初始化/格式化（以前的输入输出没有可略过）3.打包jar4.运行程序5.部分bug5.1 HDFS Corrupt block5.2 正则表达式中的“-”问题5.3 retry policy is...5.4 SLF4J: Class path contains multiple SLF4J bindings.

要求：在新概念英语第二册（一个给定的任意txt文档）上完成 word count在此基础上实现去标点化版本的WordCount 基础代码

在Hadoop中的examples自带了WordCount函数，代码如下

import java.io.DataInput; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { if(args.length!=2){ System.err.println("Uage: wordcount "); System.exit(2); } Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

我们可以拆开来看

1.map部分 extends Mapper{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

其中 itr = new StringTokenizer(value.toString()) 的作用是将Text类型的value转化为StringTokenizer类型，且该类构造方法为StringTokenizer(String str,String delim)，即构造一个用来解析 str 的 StringTokenizer 对象，并提供一个指定的分隔符（缺省的话，例如本mapper即默认为空格“”）。

而函数hasMoreTokens()作用为判断是否还有分隔符。

经过这两个操作后，我们就将整个txt转化来的string按照空格拆分为一个一个单词；继而再以每个单词为key，赋予每个单词词频为1，组成key-value对：（word,1），加到context里。

2.Reduce部分 public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

就是简单地将key（即单词word）值相同的，把他们的词频合并相加。比如两个(apple,1)合并为一个(apple,2)。

改进代码

在原有WordCount基础上，为了实现去除词频效果，对Map部分进行了小幅度修改即可

public static class TokenizerMapper extends Mapper{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private Text word2;//new coude public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); /*new code*/ String s=word.toString(); String regEx = "[`~☆★!@#$%^&*()+=|{}':;,\\[\\]》·./?~！@#￥%……（）——+|{}【】‘；：”“’。\"，\\-、？]"; String s1=s.replaceAll(regEx,""); word2=new Text(s1); context.write(word2, one); } } }

其中改动部分有：在map函数外添加了Text word2（方便之后转化）；在while函数中间先将Text类型的word转化为了string类型的s（方便使用函数replaceAll，在text没找到对应的函数hhh），然后用regEx记录所有的标点符号，s1=s.replaceAll(regEx,"")就可以将所有标点符号去除了（空格这个分隔符没有添加进去，得以保留），最后再将string类型的s1转化为text类型的word2，以(word2,one)形式加入context。

运行过程（命令行shell相关）

都是命令行操作

1.开启hdfs start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver 2.初始化/格式化（以前的输入输出没有可略过） hdfs dfs -rm input/* hdfs dfs -rm -r output/ hdfs dfs -put 新概念英语第二册.txt input/

这里需要注意的是，-put命令第一个path是虚拟机/本机上的，第二个path才是hdfs空间的，所以如果现在运行的地方比如是“alice@Master:~$“ 需要注意相对路径，否则会触发no such file or directory的操作，本菜就吃过不少亏hhh（。）

3.打包jar

导出过程：在这里插入图片描述这里我建立了一个myapp文档专门存放程序jar包（当然这步随意hhh）

4.运行程序

在确保hdfs开启的前提下：

hadoop jar WordCount.jar input output //hadoop jar ./xxxxx/WordCount.jar input output

这里需要注意运行应切换路径"cd alice@Master:/usr/local/hadoop/hadoop-2.7.7/myapp$"，在myapp中运行hadoop操作（因为之前保存的WordCount在myapp包下）；或者在调用jar时打对相对路径 ./usr/local/hadoop/hadoop-2.7.7/myapp/WordCount.jar 之类的

此时应该已经运行成功了，本菜鸡遇到了几个bug，具体见末尾。这里先说运行成功的结果

输入：

hdfs dfs -lh -h output

应该出现结果在这里插入图片描述可以看到output/part-r-00000有25.7k的数据（在未删除标点符号的前提下好像有34.9k）

取回本地：

hdfs dfs -get output/part-r-00000 ./ cat part-r-00000

可以看到如果不去除标点符号，结果其实挺乱的：

在这里插入图片描述

用了去除标点符号版本的jar后结果就会干净多了：在这里插入图片描述

当然，这个结果是按照单词的字典序排序的，如果你想要按照词频排序需要使用命令

sort part-r-00000 -n -k2

这样结果就是按照第二列的数字（即词频）从小到大排序了

5.部分bug 5.1 HDFS Corrupt block

在运行时遇到HDFS Corrupt block问题，解决方案：https://blog.csdn.net/lingbo229/article/details/81128316

先检测缺失块：

hdfs fsck -list-corruptfileblocks

此时本菜鸡这里显示corrupt block=13，然后我就懒得做什么复杂操作干脆全删了…（。）

hdfs fsck -delete ，可以查看之后corrupt block=0

再次运行jar包后发现corrupt block=0，可正常运行

在这里插入图片描述

5.2 正则表达式中的“-”问题

去除符号时的regEx添加-会报错，需要\\-才可以。

5.3 retry policy is…

还遇到了一个比较无语的问题，就是没有完整开启hdfs，忘了start-yarn.sh，于是显示“retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime”之类的，只需要再写一次start-yarn.sh就行了…。

5.4 SLF4J: Class path contains multiple SLF4J bindings.

在运行hdfs命令时遇到如下报错

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.7.7/myapp/WordCount.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

这是因为你有好几个重复的jar包，需要删除到只有一个…不知道java为啥这么令人无语…好像Python也有类似问题，反正很无语…。

好像就没了，感谢阅读w

【本文地址】

大数据管理技术实习

大数据管理技术实习

今日新闻

推荐新闻