R语言字符串相似度 stringdist包

您所在的位置:网站首页 r语言两列数据匹配 R语言字符串相似度 stringdist包

R语言字符串相似度 stringdist包

2023-07-29 12:51| 来源: 网络整理| 查看: 265

计算字符串相似度可以使用utils包中的adist函数,或者MKmisc包中的stringdist函数,或者RecordLinkage包中也有如jarowinkler之类的距离函数。本文介绍stringdist包中的stringdist函数和stringdistmatrix函数。 stringdist包作者是 Mark der Loo stringdist用于计算对象a,b中的字符串两两之间的相似度,对于一个对象中的元素少于另一个的情况,采用循环补齐机制。stringdistmatrix的出相似度矩阵,其中采用a中的行,b中的列。

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread")) stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))

参数: a,b: 字符串类型的目标对象 method:距离计算方法,默认为“osa”,可以设置为jaccard,hamming,jarowinkler等方法。 useBytes:以字节为单位进行比较 weight:权值必须为正并且不超过1 maxDist:最大距离限制 q:在使用method=’qgram’, ‘jaccard’ 或 ‘cosine’的时候设置,必须为非负数 p:jarowinkler距离的惩罚因子,默认为0,在0-0.25之间取值 nThread:最大线程数 useNames:输出的行、列名使用输入变量的行、列名 ncores:核心数 cluster:自定义集群数

案例:

> stringdistmatrix(c("foo","bar","boo"),c("baz","buz")) [,1] [,2] [1,] 3 3 [2,] 1 2 [3,] 2 2 > # string distance matching is case sensitive: > stringdist("ABC","abc") [1] 3 > > # so you may want to normalize a bit: > stringdist(tolower("ABC"),"abc") [1] 0 > > # stringdist recycles the shortest argument: > stringdist(c('a','b','c'),c('a','c')) Warning message: longer object length is not a multiple of shorter object length [1] 0 1 1 > > # different edit operations may be weighted; e.g. weighted substitution: > stringdist('ab','ba',weight=c(1,1,1,0.5)) [1] 0.5 > > # Non-unit weights for insertion and deletion makes the distance metric asymetric > stringdist('ca','abc') [1] 3 > stringdist('abc','ca') [1] 3 > stringdist('ca','abc',weight=c(0.5,1,1,1)) [1] 2 > stringdist('abc','ca',weight=c(0.5,1,1,1)) [1] 2.5 > # q-grams are based on the difference between occurrences of q consecutive characters > # in string a and string b. > # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0: > stringdist('abc','cba',method='qgram',q=1) [1] 0 > > # since the first string consists of 'ab','bc' and the second > # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common): > stringdist('abc','cba',method='qgram',q=2) [1] 4 > stringdist('MARTHA','MATHRA',method='jw') [1] 0.08333333 > # Note that stringdist gives a _distance_ where wikipedia gives the corresponding > # _similarity measure_. To get the wikipedia result: > 1 - stringdist('MARTHA','MATHRA',method='jw') [1] 0.9166667 > > # The corresponding Jaro-Winkler distance can be computed by setting p=0.1 > stringdist('MARTHA','MATHRA',method='jw',p=0.1) [1] 0.06666667 > # or, as a similarity measure > 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1) [1] 0.9333333 > > # This gives distance 1 since Euler and Gauss translate to different soundex codes. > stringdist('Euler','Gauss',method='soundex') [1] 1 > # Euler and Ellery translate to the same code and have distance 0 > stringdist('Euler','Ellery',method='soundex') [1] 0 >

参考 https://www.rdocumentation.org/packages/stringdist/versions/0.9.4.2/topics/stringdist https://cran.r-project.org/web/packages/stringdist/stringdist.pdf



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3