Hadoop中文词频统计 – 推酷 – CodeClip

您所在的位置：网站首页 › hadoop中文词频统计 › Hadoop中文词频统计 – 推酷 – CodeClip

Hadoop中文词频统计 – 推酷 – CodeClip

2023-03-17 20:47| 来源: 网络整理| 查看: 265

学习Hadoop都免不了WordCount，但是都是最简单的例子，而且都是以空格为划分的英文词频的统计，相比于中文，英文的统计显得简单很多，因为中文涉及到很多语义及分词的不同，通常不好统计中文词频，即使是现在的技术，也没有完全能符合人们标准的中文词频统计工具出现，不过现阶段还是有可以使用的工具的，比如IK Analyzer，今天就来尝试一下。

先感谢看到的博客指导：http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html

1，实验环境

hadoop 1.2.1

java 1.7

node：only one

2，数据准备

这里采用的完结篇小说《凡人修仙传》，大概20MB，个人爱好。

3，实验过程

1）修改WordCount代码，主要是应用IK Analyzer中文分词法，这是一个开源的工具，参考 http://code.google.com/p/ik-analyzer/

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import java.io.ByteArrayInputStream;

import org.wltea.analyzer.core.IKSegmenter;

import org.wltea.analyzer.core.Lexeme;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class ChineseWordCount {

public static class TokenizerMapper

extends Mapper;Object, Text, Text, IntWritable;{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

byte[] bt = value.getBytes();

InputStream ip = new ByteArrayInputStream(bt);

Reader read = new InputStreamReader(ip);

IKSegmenter iks = new IKSegmenter(read,true);

Lexeme t;

while ((t = iks.next()) != null)

{

word.set(t.getLexemeText());

context.write(word, one);

}

public static class IntSumReducer

extends Reducer;Text,IntWritable,Text,IntWritable; {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable;IntWritable; values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordcount ;in; ;out;");

System.exit(2);

}

Job job = new Job(conf, "word count");

job.setJarByClass(ChineseWordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

2）为更方便查看任务进度，打包运行，注意要将IK Analyzer的包一起，我将打好的包以及工具包和测试文本都上传到共享 http://pan.baidu.com/s/1jGwVSEy

首先将测试文件上传到HDFS的input目录下，hadoop dfs -copyFromLocal part-all.txt input

然后开始运行 hadoop jar chinesewordcount.jar input output

等待运行完成，就不截图了。

3）数据处理，因为生成的数据并没有排序，所以还是要进行一系列的处理

head words.txt

tail words.txt

sort -k2 words.txt ;0.txt

head 0.txt

tail 0.txt

sort -k2r words.txt;0.txt

head 0.txt

tail 0.txt

sort -k2rn words.txt;0.txt

head -n 50 0.txt

目标提取

awk '{if(length($1);=2) print $0}' 0.txt ;1.txt

最终显示结果

head 1.txt -n 200 | sed = | sed 'N;s/\n//'

4）结果

Screenshot from 2014-04-13 14_26_30

不过数据还是有很多单字的情况，这是很无用的，因此最终的记过可能还是要手动处理一下，最终的结果放到共享，有兴趣的可以查看下 http://pan.baidu.com/s/1hqn66MC

4，总结

中文分词果然比较复杂，只能说继续努力。。

欢迎一起学习交流，转载请注明 http://hanlaiming.freetzi.com/?p=273

来源URL：http://www.tuicool.com/articles/6ZBrMb

【本文地址】

Hadoop中文词频统计 – 推酷 – CodeClip

Hadoop中文词频统计 – 推酷 – CodeClip

今日新闻

推荐新闻