查找两个字符串之间的相似性度量

问题:查找两个字符串之间的相似性度量

如何获得一个字符串与Python中另一个字符串相似的概率?

我想得到一个十进制值,例如0.9(表示90%)等。最好使用标准Python和库。

例如

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.

e.g.

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

回答 0

有一个内置的。

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

使用它:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

回答 1

我认为您可能正在寻找一种描述字符串之间距离的算法。您可能会参考以下内容:

  1. 汉明距离
  2. 莱文施泰因距离
  3. Damerau–Levenshtein距离
  4. Jaro–Winkler距离

I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:

  1. Hamming distance
  2. Levenshtein distance
  3. Damerau–Levenshtein distance
  4. Jaro–Winkler distance

回答 2

解决方案1:内置Python

使用SequenceMatcherdifflib

优点:本地python库,不需要额外的软件包。
缺点:太有限了,还有很多其他的用于字符串相似性的好的算法。

例如
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

解决方案2:水母

它是一个很好的图书馆,覆盖面广,几乎没有问题。它支持:
-莱文斯坦距离
– Damerau-Levenshtein距离
-哈罗距离
-哈罗-温克勒距离
-比赛评分方法比较
-海明距离

优点:易于使用,所支持算法的色域经过测试。
缺点:不是本机库。

例如

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

Solution #1: Python builtin

use SequenceMatcher from difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.

example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

Solution #2: jellyfish library

its a very good library with good coverage and few issues. it supports:
– Levenshtein Distance
– Damerau-Levenshtein Distance
– Jaro Distance
– Jaro-Winkler Distance
– Match Rating Approach Comparison
– Hamming Distance

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.

example:

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

回答 3

Fuzzy Wuzzy是一个在python中实现Levenshtein距离的软件包,并提供了一些帮助程序功能以在某些情况下提供帮助,在某些情况下,您可能希望将两个不同的字符串视为相同。例如:

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Fuzzy Wuzzy is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

回答 4

您可以创建如下函数:

def similar(w1, w2):
    w1 = w1 + ' ' * (len(w2) - len(w1))
    w2 = w2 + ' ' * (len(w1) - len(w2))
    return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))

You can create a function like:

def similar(w1, w2):
    w1 = w1 + ' ' * (len(w2) - len(w1))
    w2 = w2 + ' ' * (len(w1) - len(w2))
    return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))

回答 5

包装距离包括莱文施泰因距离:

import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3

Package distance includes Levenshtein distance:

import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3

回答 6

对于SequenceMatcher大输入量,内置函数非常慢,这是使用diff-match-patch可以完成的方法:

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

The builtin SequenceMatcher is very slow on large input, here’s how it can be done with diff-match-patch:

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

回答 7

注意,difflib.SequenceMatcher 找到最长的连续匹配子序列,这通常不是所希望的,例如:

>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012  # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)]  # only the first block is recorded

寻找两个字符串之间的相似性与生物信息学中成对序列比对的概念密切相关。有很多专用的库,包括biopython。这个例子实现了Needleman Wunsch算法

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'

使用biopython或其他生物信息学软件包比python标准库的任何部分都更加灵活,因为可以使用许多不同的评分方案和算法。另外,您实际上可以获取匹配序列以可视化正在发生的事情:

>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el

Note, difflib.SequenceMatcher only finds the longest contiguous matching subsequence, this is often not what is desired, for example:

>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012  # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)]  # only the first block is recorded

Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'

Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:

>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el

回答 8

你可以发现,大部分的文本类似方法,以及它们是如何此链接下式计算:https://github.com/luozhouyang/python-string-similarity#python-string-similarity 下面一些例子;

  • 归一化,度量,相似度和距离

  • (归一化)相似度和距离

  • 公制距离

  • 基于带状疱疹(n-gram)的相似度和距离
  • 莱文施泰因
  • 标准化莱文施泰因
  • 加权Levenshtein
  • Damerau-Levenshtein
  • 最佳字符串对齐
  • 杰罗·温克勒
  • 最长公共子序列
  • 公制最长公共子序列
  • N-格拉姆
  • 基于碎片(n-gram)的算法
  • Q-Gram
  • 余弦相似度
  • 雅卡指数
  • Sorensen-Dice系数
  • 重叠系数(即,Szymkiewicz-Simpson)

You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity Here some examples;

  • Normalized, metric, similarity and distance

  • (Normalized) similarity and distance

  • Metric distances

  • Shingles (n-gram) based similarity and distance
  • Levenshtein
  • Normalized Levenshtein
  • Weighted Levenshtein
  • Damerau-Levenshtein
  • Optimal String Alignment
  • Jaro-Winkler
  • Longest Common Subsequence
  • Metric Longest Common Subsequence
  • N-Gram
  • Shingle(n-gram) based algorithms
  • Q-Gram
  • Cosine similarity
  • Jaccard index
  • Sorensen-Dice coefficient
  • Overlap coefficient (i.e.,Szymkiewicz-Simpson)