问题:如何进行不区分大小写的字符串比较?

如何在Python中进行不区分大小写的字符串比较?

我想以一种非常简单和Pythonic的方式封装对常规字符串与存储库字符串的比较。我还希望能够使用常规python字符串在由字符串散列的字典中查找值。

How can I do case insensitive string comparison in Python?

I would like to encapsulate comparison of a regular strings to a repository string using in a very simple and Pythonic way. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.


回答 0

假设ASCII字符串:

string1 = 'Hello'
string2 = 'hello'

if string1.lower() == string2.lower():
    print("The strings are the same (case insensitive)")
else:
    print("The strings are NOT the same (case insensitive)")

Assuming ASCII strings:

string1 = 'Hello'
string2 = 'hello'

if string1.lower() == string2.lower():
    print("The strings are the same (case insensitive)")
else:
    print("The strings are NOT the same (case insensitive)")

回答 1

以不区分大小写的方式比较字符串似乎很简单,但事实并非如此。我将使用Python 3,因为Python 2在这里尚未开发。

首先要注意的是,用Unicode删除大小写的转换并非易事。有一些文字text.lower() != text.upper().lower(),例如"ß"

"ß".lower()
#>>> 'ß'

"ß".upper().lower()
#>>> 'ss'

但是,假设您想无休止地比较"BUSSE""Buße"。哎呀,您可能还想比较"BUSSE""BUẞE"相等-这是较新的资本形式。推荐的方式是使用casefold

海峡 折叠()

返回字符串的casefolded副本。折叠的字符串可用于无大小写的匹配。

大小写折叠类似于小写字母,但更具攻击性,因为它旨在消除字符串中的所有大小写区别。[…]

不要只是使用lower。如果casefold不可用,则可以提供.upper().lower()帮助(但只能有所帮助)。

然后,您应该考虑口音。如果您的字体渲染器很好,您可能会认为"ê" == "ê"-但事实并非如此:

"ê" == "ê"
#>>> False

这是因为后者的重音是组合字符。

import unicodedata

[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX']

[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

解决此问题的最简单方法是unicodedata.normalize。您可能想使用NFKD规范化,但请随时检查文档。然后一个

unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
#>>> True

最后,这用函数表示:

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

Comparing strings in a case insensitive way seems trivial, but it’s not. I will be using Python 3, since Python 2 is underdeveloped here.

The first thing to note is that case-removing conversions in Unicode aren’t trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

"ß".lower()
#>>> 'ß'

"ß".upper().lower()
#>>> 'ss'

But let’s say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal – that’s the newer capital form. The recommended way is to use casefold:

str.casefold()

Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. […]

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" – but it doesn’t:

"ê" == "ê"
#>>> False

This is because the accent on the latter is a combining character.

import unicodedata

[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX']

[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
#>>> True

To finish up, here this is expressed in functions:

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

回答 2

使用Python 2,调用.lower()每个字符串或Unicode对象…

string1.lower() == string2.lower()

…将在大多数时间工作,但实际上在@tchrist描述情况下不起作用

假设我们有一个名为的文件,unicode.txt其中包含两个字符串ΣίσυφοςΣΊΣΥΦΟΣ。使用Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True

Σ字符有两种小写形式,ς和σ,并且.lower()不区分大小写。

但是,从Python 3开始,所有这三种形式都将解析为ς,并且在两个字符串上调用lower()都可以正常工作:

>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True

因此,如果您关心像希腊语中的三个sigma这样的边缘情况,请使用Python 3。

(供参考,上面的解释器打印输出中显示了Python 2.7.3和Python 3.3.0b1。)

Using Python 2, calling .lower() on each string or Unicode object…

string1.lower() == string2.lower()

…will work most of the time, but indeed doesn’t work in the situations @tchrist has described.

Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True

The Σ character has two lowercase forms, ς and σ, and .lower() won’t help compare them case-insensitively.

However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:

>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)


回答 3

Unicode标准的第3.13节定义了无大小写匹配的算法。

X.casefold() == Y.casefold() 在Python 3中实现了“默认无大小写匹配”(D144)。

Casefolding不能在所有实例中保留字符串的规范化,因此需要进行规范化('å'vs. 'å')。D145引入了“规范无大小写匹配”:

import unicodedata

def NFD(text):
    return unicodedata.normalize('NFD', text)

def canonical_caseless(text):
    return NFD(NFD(text).casefold())

NFD() 在涉及U + 0345字符的极少数情况下被调用两次。

例:

>>> 'å'.casefold() == 'å'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('å')
True

对于'㎒'(U + 3392)和“标识符无例匹配” 等情况,还具有兼容性无例匹配(D146),以简化和优化标识符的无例匹配

Section 3.13 of the Unicode standard defines algorithms for caseless matching.

X.casefold() == Y.casefold() in Python 3 implements the “default caseless matching” (D144).

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces “canonical caseless matching”:

import unicodedata

def NFD(text):
    return unicodedata.normalize('NFD', text)

def canonical_caseless(text):
    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.

Example:

>>> 'å'.casefold() == 'å'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('å')
True

There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and “identifier caseless matching” to simplify and optimize caseless matching of identifiers.


回答 4

我在这里使用regex看到了这个解决方案。

import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True

与重音搭配效果很好

In [42]: if re.search("ê","ê", re.IGNORECASE):
....:        print(1)
....:
1

但是,它不适用于不区分大小写的Unicode字符。谢谢@Rhymoid指出,根据我的理解,对于情况,它需要确切的符号。输出如下:

In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....:        print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....:        print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....:        print(1)
....:

I saw this solution here using regex.

import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True

It works well with accents

In [42]: if re.search("ê","ê", re.IGNORECASE):
....:        print(1)
....:
1

However, it doesn’t work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:

In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....:        print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....:        print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....:        print(1)
....:

回答 5

通常的方法是将字符串大写或小写以进行查找和比较。例如:

>>> "hello".upper() == "HELLO".upper()
True
>>> 

The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

>>> "hello".upper() == "HELLO".upper()
True
>>> 

回答 6

首先转换为小写字母如何?您可以使用string.lower()

How about converting to lowercase first? you can use string.lower().


回答 7

def insenStringCompare(s1, s2):
    """ Method that takes two strings and returns True or False, based
        on if they are equal, regardless of case."""
    try:
        return s1.lower() == s2.lower()
    except AttributeError:
        print "Please only pass strings into this method."
        print "You passed a %s and %s" % (s1.__class__, s2.__class__)
def insenStringCompare(s1, s2):
    """ Method that takes two strings and returns True or False, based
        on if they are equal, regardless of case."""
    try:
        return s1.lower() == s2.lower()
    except AttributeError:
        print "Please only pass strings into this method."
        print "You passed a %s and %s" % (s1.__class__, s2.__class__)

回答 8

您要做的就是将两个字符串转换为小写(所有字母都变为小写),然后进行比较(假设字符串是ASCII字符串)。

例如:

string1 = "Hello World"
string2 = "hello WorlD"

if string1.lower() == string2.lower():
    print("The two strings are the same.")
else:
    print("The two strings are not the same."

All you’ll have to do is to convert the two strings to lowercase (all letters become lowercase) and then compare them (assuming the strings are ASCII strings).

For example:

string1 = "Hello World"
string2 = "hello WorlD"

if string1.lower() == string2.lower():
    print("The two strings are the same.")
else:
    print("The two strings are not the same."

回答 9

这是我在上个星期学习过爱/恨的另一个正则表达式,因此通常导入(在本例中为)反映我的感觉的东西!做一个正常的功能….要求输入,然后使用…. something = re.compile(r’foo * | spam *’,是的.I)…… re.I(是的.I下方)与IGNORECASE相同,但是您编写时可能会犯很多错误!

然后,您可以使用正则表达式搜索消息,但老实说应该仅占几页,但要点是foo或垃圾邮件通过管道传递在一起,并且忽略大小写。然后,如果找到任何一个,则lost_n_found将显示其中之一。如果两者都不是,则lost_n_found等于无。如果不等于none,则使用“ return lost_n_found.lower()”以小写形式返回user_input

这使您可以更轻松地匹配所有区分大小写的内容。最后(NCS)代表“没人在乎……!” 还是不区分大小写…

如果有人有任何问题,请教我。

    import re as yes

    def bar_or_spam():

        message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ") 

        message_in_coconut = yes.compile(r'foo*|spam*',  yes.I)

        lost_n_found = message_in_coconut.search(message).group()

        if lost_n_found != None:
            return lost_n_found.lower()
        else:
            print ("Make tea not love")
            return

    whatz_for_breakfast = bar_or_spam()

    if whatz_for_breakfast == foo:
        print ("BaR")

    elif whatz_for_breakfast == spam:
        print ("EgGs")

This is another regex which I have learned to love/hate over the last week so usually import as (in this case yes) something that reflects how im feeling! make a normal function…. ask for input, then use ….something = re.compile(r’foo*|spam*’, yes.I)…… re.I (yes.I below) is the same as IGNORECASE but you cant make as many mistakes writing it!

You then search your message using regex’s but honestly that should be a few pages in its own , but the point is that foo or spam are piped together and case is ignored. Then if either are found then lost_n_found would display one of them. if neither then lost_n_found is equal to None. If its not equal to none return the user_input in lower case using “return lost_n_found.lower()”

This allows you to much more easily match up anything thats going to be case sensitive. Lastly (NCS) stands for “no one cares seriously…!” or not case sensitive….whichever

if anyone has any questions get me on this..

    import re as yes

    def bar_or_spam():

        message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ") 

        message_in_coconut = yes.compile(r'foo*|spam*',  yes.I)

        lost_n_found = message_in_coconut.search(message).group()

        if lost_n_found != None:
            return lost_n_found.lower()
        else:
            print ("Make tea not love")
            return

    whatz_for_breakfast = bar_or_spam()

    if whatz_for_breakfast == foo:
        print ("BaR")

    elif whatz_for_breakfast == spam:
        print ("EgGs")

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。