适合中文关键字的哈希函数

2024-07-10 14:14| 来源: 网络整理| 查看: 265

hash函数的策略。对于GB2312编码，设输入的汉字为GBword，我们可以采用公式(C1-176)*94 + (C2-161)确定GBindex。其中，C1表示第一字节，C2表示第二字节。具体如下： GBindex = ((unsigned char)GBword.at(0)-176)*94 + (unsigned char)GBword.at(1) - 161; 之所以用unsigned char类型，是因为char是一个字节，如果用unsigend int，因为int是4个字节的，所以会造成扩展，导致错误。对于GBK编码，设输入的汉字为GBKword，则可以采用公式 index=(ch1-0x81)*190+(ch2-0x40)-(ch2/128)，其中ch1是第一字节，ch2是第二字节。具体的， GBKindex = ((unsigned char)GBKword[0]-129)*190 + ((unsigned char)GBKword[1]-64) - (unsigned char)GBKword[1]/128;

哈希表的性能很大程度上取决于一个哈希函数的好坏。我用单个汉字作为key，5702个汉字冲突率为0，quite impressive。

例如一个对单个GBK汉字的哈希函数可以这么写：

struct hash_CHGBK{ size_t operator()(const char* GBKword) const{ size_t GBKindex; GBKindex = ((unsigned char)GBKword[0]-129)*190 + ((unsigned char)GBKword[1]-64) - (unsigned char)GBKword[1]/128; } };

加上一个比较的函数：

struct eqstr { bool operator()(const char* s1, const char* s2)const{ return strcmp(s1, s2)==0; }; }; 可以这么用

hash_map CHGBK_hashmap;

原帖地址：http://www.iteye.com/problems/60530

原帖作者：deepfuture

hash函数列表

https://www.byvoid.com/blog/string-hash-compare/

【本文地址】

适合中文关键字的哈希函数

适合中文关键字的哈希函数

今日新闻

推荐新闻