标签归档:string-matching

Python中的高性能模糊字符串比较,使用Levenshtein或difflib

问题:Python中的高性能模糊字符串比较,使用Levenshtein或difflib

我正在进行临床消息标准化(拼写检查),其中我对照900,000个单词的医学词典检查每个给定的单词。我更关心时间的复杂性/性能。

我想进行模糊字符串比较,但是不确定使用哪个库。

选项1:

import Levenshtein
Levenshtein.ratio('hello world', 'hello')

Result: 0.625

选项2:

import difflib
difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()

Result: 0.625

在此示例中,两者给出相同的答案。您是否认为在这种情况下两者表现都一样?

I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.

I want to do fuzzy string comparison, but I’m not sure which library to use.

Option 1:

import Levenshtein
Levenshtein.ratio('hello world', 'hello')

Result: 0.625

Option 2:

import difflib
difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()

Result: 0.625

In this example both give the same answer. Do you think both perform alike in this case?


回答 0

如果您想对Levenshtein和Difflib的相似性进行快速的视觉比较,我计算了约230万本书的书名:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

然后,我用R绘制结果:

出于好奇,我还比较了Difflib,Levenshtein,Sørensen和Jaccard相似度值:

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

结果:

Difflib / Levenshtein的相似性确实很有趣。

2018编辑:如果您要识别相似的字符串,还可以查看minhashing-这里有一个很棒的概述。Minhashing在线性时间内在大型文本集合中发现相似之处非常了不起。我的实验室在这里组装了一个应用程序,该应用程序使用minhashing检测和可视化文本重用:https//github.com/YaleDHLab/intertext

In case you’re interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

I then plotted the results with R:

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

Result:

The Difflib / Levenshtein similarity really is quite interesting.

2018 edit: If you’re working on identifying similar strings, you could also check out minhashing–there’s a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext


回答 1

  • difflib.SequenceMatcher使用Ratcliff / Obershelp算法,计算匹配字符的加倍数除以两个字符串中的字符总数。

  • Levenshtein使用Levenshtein算法,它计算将一个字符串转换为另一个字符串所需的最少编辑次数

复杂

SequenceMatcher是最坏情况下的二次时间,其预期情况下的行为以复杂的方式取决于序列共有多少个元素。(从这里

Levenshtein为O(m * n),其中n和m是两个输入字符串的长度。

性能

根据Levenshtein模块的源代码:Levenshtein与difflib(SequenceMatcher)有一些重叠。它仅支持字符串,不支持任意序列类型,但另一方面,它要快得多。

  • difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.

  • Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

Complexity

SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)

Levenshtein is O(m*n), where n and m are the length of the two input strings.

Performance

According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it’s much faster.


检查字符串是否匹配模式

问题:检查字符串是否匹配模式

如何检查字符串是否与此模式匹配?

大写字母,数字,大写字母,数字…

例如,这些将匹配:

A1B2
B10L1
C1N200J1

这些不会(’^’表示问题)

a1B2
^
A10B
   ^
AB400
^

How do I check if a string matches this pattern?

Uppercase letter, number(s), uppercase letter, number(s)…

Example, These would match:

A1B2
B10L1
C1N200J1

These wouldn’t (‘^’ points to problem)

a1B2
^
A10B
   ^
AB400
^

回答 0

import re
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(string)

编辑:如注释中所述,match仅在字符串开头检查匹配项,而re.search()将匹配字符串中任何位置的模式。(另请参见:https : //docs.python.org/library/re.html#search-vs-match

import re
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(string)

Edit: As noted in the comments match checks only for matches at the beginning of the string while re.search() will match a pattern anywhere in string. (See also: https://docs.python.org/library/re.html#search-vs-match)


回答 1

单线: re.match(r"pattern", string) # No need to compile

import re
>>> if re.match(r"hello[0-9]+", 'hello1'):
...     print('Yes')
... 
Yes

您可以bool根据需要进行评估

>>> bool(re.match(r"hello[0-9]+", 'hello1'))
True

One-liner: re.match(r"pattern", string) # No need to compile

import re
>>> if re.match(r"hello[0-9]+", 'hello1'):
...     print('Yes')
... 
Yes

You can evalute it as bool if needed

>>> bool(re.match(r"hello[0-9]+", 'hello1'))
True

回答 2

请尝试以下操作:

import re

name = ["A1B1", "djdd", "B2C4", "C2H2", "jdoi","1A4V"]

# Match names.
for element in name:
     m = re.match("(^[A-Z]\d[A-Z]\d)", element)
     if m:
        print(m.groups())

Please try the following:

import re

name = ["A1B1", "djdd", "B2C4", "C2H2", "jdoi","1A4V"]

# Match names.
for element in name:
     m = re.match("(^[A-Z]\d[A-Z]\d)", element)
     if m:
        print(m.groups())

回答 3

import re
import sys

prog = re.compile('([A-Z]\d+)+')

while True:
  line = sys.stdin.readline()
  if not line: break

  if prog.match(line):
    print 'matched'
  else:
    print 'not matched'
import re
import sys

prog = re.compile('([A-Z]\d+)+')

while True:
  line = sys.stdin.readline()
  if not line: break

  if prog.match(line):
    print 'matched'
  else:
    print 'not matched'

回答 4

正则表达式使这变得容易…

[A-Z] 将恰好匹配A和Z之间的一个字符

\d+ 将匹配一个或多个数字

() 对事物进行分组(并且还返回事物…但是现在仅考虑将它们分组)

+ 选择1个或更多

regular expressions make this easy …

[A-Z] will match exactly one character between A and Z

\d+ will match one or more digits

() group things (and also return things… but for now just think of them grouping)

+ selects 1 or more


回答 5

  
import re

ab = re.compile("^([A-Z]{1}[0-9]{1})+$")
ab.match(string)
  


我认为这应该适用于大写的数字模式。

  
import re

ab = re.compile("^([A-Z]{1}[0-9]{1})+$")
ab.match(string)
  


I believe that should work for an uppercase, number pattern.