R语言字符串相似度 stringdist包

您所在的位置：网站首页 › r语言两列数据匹配 › R语言字符串相似度 stringdist包

R语言字符串相似度 stringdist包

2023-07-29 12:51| 来源: 网络整理| 查看: 265

计算字符串相似度可以使用utils包中的adist函数，或者MKmisc包中的stringdist函数，或者RecordLinkage包中也有如jarowinkler之类的距离函数。本文介绍stringdist包中的stringdist函数和stringdistmatrix函数。 stringdist包作者是 Mark der Loo stringdist用于计算对象a，b中的字符串两两之间的相似度，对于一个对象中的元素少于另一个的情况，采用循环补齐机制。stringdistmatrix的出相似度矩阵，其中采用a中的行，b中的列。

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread")) stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))

参数： a，b：字符串类型的目标对象 method：距离计算方法，默认为“osa”，可以设置为jaccard，hamming，jarowinkler等方法。 useBytes：以字节为单位进行比较 weight：权值必须为正并且不超过1 maxDist：最大距离限制 q：在使用method=’qgram’, ‘jaccard’ 或 ‘cosine’的时候设置，必须为非负数 p：jarowinkler距离的惩罚因子，默认为0，在0-0.25之间取值 nThread：最大线程数 useNames：输出的行、列名使用输入变量的行、列名 ncores：核心数 cluster：自定义集群数

案例：

> stringdistmatrix(c("foo","bar","boo"),c("baz","buz")) [,1] [,2] [1,] 3 3 [2,] 1 2 [3,] 2 2 > # string distance matching is case sensitive: > stringdist("ABC","abc") [1] 3 > > # so you may want to normalize a bit: > stringdist(tolower("ABC"),"abc") [1] 0 > > # stringdist recycles the shortest argument: > stringdist(c('a','b','c'),c('a','c')) Warning message: longer object length is not a multiple of shorter object length [1] 0 1 1 > > # different edit operations may be weighted; e.g. weighted substitution: > stringdist('ab','ba',weight=c(1,1,1,0.5)) [1] 0.5 > > # Non-unit weights for insertion and deletion makes the distance metric asymetric > stringdist('ca','abc') [1] 3 > stringdist('abc','ca') [1] 3 > stringdist('ca','abc',weight=c(0.5,1,1,1)) [1] 2 > stringdist('abc','ca',weight=c(0.5,1,1,1)) [1] 2.5 > # q-grams are based on the difference between occurrences of q consecutive characters > # in string a and string b. > # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0: > stringdist('abc','cba',method='qgram',q=1) [1] 0 > > # since the first string consists of 'ab','bc' and the second > # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common): > stringdist('abc','cba',method='qgram',q=2) [1] 4 > stringdist('MARTHA','MATHRA',method='jw') [1] 0.08333333 > # Note that stringdist gives a _distance_ where wikipedia gives the corresponding > # _similarity measure_. To get the wikipedia result: > 1 - stringdist('MARTHA','MATHRA',method='jw') [1] 0.9166667 > > # The corresponding Jaro-Winkler distance can be computed by setting p=0.1 > stringdist('MARTHA','MATHRA',method='jw',p=0.1) [1] 0.06666667 > # or, as a similarity measure > 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1) [1] 0.9333333 > > # This gives distance 1 since Euler and Gauss translate to different soundex codes. > stringdist('Euler','Gauss',method='soundex') [1] 1 > # Euler and Ellery translate to the same code and have distance 0 > stringdist('Euler','Ellery',method='soundex') [1] 0 >

参考 https://www.rdocumentation.org/packages/stringdist/versions/0.9.4.2/topics/stringdist https://cran.r-project.org/web/packages/stringdist/stringdist.pdf

【本文地址】

R语言字符串相似度 stringdist包

R语言字符串相似度 stringdist包

今日新闻

推荐新闻