查找两个字符串之间的相似性度量

Question 1

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.

e.g.

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

Question 2

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Question 3

I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:

Question 4

Solution #1: Python builtin

use SequenceMatcher from difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.

example :

>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

Solution #2: jellyfish library

its a very good library with good coverage and few issues. it supports:
– Levenshtein Distance
– Damerau-Levenshtein Distance
– Jaro Distance
– Jaro-Winkler Distance
– Match Rating Approach Comparison
– Hamming Distance

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.

example:

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

Question 5

Fuzzy Wuzzy is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Question 6

You can create a function like:

def similar(w1, w2):
    w1 = w1 + ' ' * (len(w2) - len(w1))
    w2 = w2 + ' ' * (len(w1) - len(w2))
    return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))

Question 7

Package distance includes Levenshtein distance:

import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3

Question 8

The builtin SequenceMatcher is very slow on large input, here’s how it can be done with diff-match-patch:

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

Question 9

Note, difflib.SequenceMatcher only finds the longest contiguous matching subsequence, this is often not what is desired, for example:

>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012  # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)]  # only the first block is recorded

Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'

Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:

>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el

Question 10

You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity Here some examples;

Normalized, metric, similarity and distance
(Normalized) similarity and distance
Metric distances
Shingles (n-gram) based similarity and distance
Levenshtein
Normalized Levenshtein
Weighted Levenshtein
Damerau-Levenshtein
Optimal String Alignment
Jaro-Winkler
Longest Common Subsequence
Metric Longest Common Subsequence
N-Gram
Shingle(n-gram) based algorithms
Q-Gram
Cosine similarity
Jaccard index
Sorensen-Dice coefficient
Overlap coefficient (i.e.,Szymkiewicz-Simpson)

查找两个字符串之间的相似性度量

问题：查找两个字符串之间的相似性度量

回答 0

回答 1

回答 2

解决方案1：内置Python

解决方案2：水母库

Solution #1: Python builtin

Solution #2: jellyfish library

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

查找和替换列表中的元素

Buck-一个快速构建系统，鼓励在各种平台和语言上创建小的、可重用的模块

将Unicode文本写入文本文件？

如何在python中从变量参数（kwargs）设置类属性

如何在Python中打破一系列链接方法？

如何将新行添加到空的numpy数组

查找两个字符串之间的相似性度量

问题：查找两个字符串之间的相似性度量

回答 0

回答 1

回答 2

解决方案1：内置Python

解决方案2：水母库

Solution #1: Python builtin

Solution #2: jellyfish library

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

相关文章

排行榜展示

文章展示