大数据hadoop系列:python实现MapReduce 词频统计

您所在的位置:网站首页 hadoop单词计数 大数据hadoop系列:python实现MapReduce 词频统计

大数据hadoop系列:python实现MapReduce 词频统计

2023-03-07 14:17| 来源: 网络整理| 查看: 265

map代码:map_t.py

import sys import re p = re.compile(r'\w+') for line in sys.stdin: ss = line.strip().split(' ') for s in ss: if len(p.findall(s)) < 1: continue s_low = p.findall(s)[0].lower() print s_low + ',' + '1'

reduce代码:red_t.py

import sys cur_word = None sum = 0 for line in sys.stdin: word, val = line.strip().split(',') if cur_word == None: cur_word = word if cur_word != word: print '%s\t%s' % (cur_word, sum) cur_word = word sum = 0 sum += int(val) print '%s\t%s' % (cur_word, sum)

测试shell:

cat The_Man_of_Property.txt | python map_t.py |sort -k2|python red_t.py

run.sh

HADOOP_CMD="/usr/local/src/hadoop-2.6.1/bin/hadoop" STREAM_JAR_PATH="/usr/local/src/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar" INPUT_FILE_PATH_1="/data/The_Man_of_Property.txt" OUTPUT_PATH="/output/wc" $HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH $HADOOP_CMD jar $STREAM_JAR_PATH \ -input $INPUT_FILE_PATH_1 \ -output $OUTPUT_PATH \ -mapper "python map_t.py" \ -reducer "python red_t.py" \ -file ./map_t.py \ -file ./red_t.py


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3