Hadoop中文词频统计 – 推酷 – CodeClip |
您所在的位置:网站首页 › hadoop中文词频统计 › Hadoop中文词频统计 – 推酷 – CodeClip |
学习Hadoop都免不了WordCount,但是都是最简单的例子,而且都是以空格为划分的英文词频的统计,相比于中文,英文的统计显得简单很多,因为中文涉及到很多语义及分词的不同,通常不好统计中文词频,即使是现在的技术,也没有完全能符合人们标准的中文词频统计工具出现,不过现阶段还是有可以使用的工具的,比如IK Analyzer,今天就来尝试一下。 先感谢看到的博客指导 :http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html 1,实验环境hadoop 1.2.1 java 1.7 node:only one 2,数据准备这里采用的完结篇小说《凡人修仙传》,大概20MB,个人爱好。 3,实验过程1)修改WordCount代码,主要是应用IK Analyzer中文分词法,这是一个开源的工具,参考 http://code.google.com/p/ik-analyzer/ import java.io.IOException;import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; import java.io.ByteArrayInputStream; import org.wltea.analyzer.core.IKSegmenter; import org.wltea.analyzer.core.Lexeme; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class ChineseWordCount {
public static class TokenizerMapper extends Mapper;Object, Text, Text, IntWritable;{
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException {
byte[] bt = value.getBytes(); InputStream ip = new ByteArrayInputStream(bt); Reader read = new InputStreamReader(ip); IKSegmenter iks = new IKSegmenter(read,true); Lexeme t; while ((t = iks.next()) != null) { word.set(t.getLexemeText()); context.write(word, one); } } }
public static class IntSumReducer extends Reducer;Text,IntWritable,Text,IntWritable; { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable;IntWritable; values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount ;in; ;out;"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(ChineseWordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } 2)为更方便查看任务进度,打包运行,注意要将IK Analyzer的包一起,我将打好的包以及工具包和测试文本都上传到共享 http://pan.baidu.com/s/1jGwVSEy 首先将测试文件上传到HDFS的input目录下,hadoop dfs -copyFromLocal part-all.txt input 然后开始运行 hadoop jar chinesewordcount.jar input output 等待运行完成,就不截图了。 3)数据处理,因为生成的数据并没有排序,所以还是要进行一系列的处理 head words.txttail words.txt
sort -k2 words.txt ;0.txt head 0.txt tail 0.txt sort -k2r words.txt;0.txt head 0.txt tail 0.txt sort -k2rn words.txt;0.txt head -n 50 0.txt 目标提取 awk '{if(length($1);=2) print $0}' 0.txt ;1.txt 最终显示结果 head 1.txt -n 200 | sed = | sed 'N;s/\n//' 4)结果 不过数据还是有很多单字的情况,这是很无用的,因此最终的记过可能还是要手动处理一下,最终的结果放到共享,有兴趣的可以查看下 http://pan.baidu.com/s/1hqn66MC 4,总结中文分词果然比较复杂,只能说继续努力。。 欢迎一起学习交流,转载请注明 http://hanlaiming.freetzi.com/?p=273 来源URL:http://www.tuicool.com/articles/6ZBrMb |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |