问题:如何检查Python中的字符串是否为ASCII?

我想检查字符串是否为ASCII。

我知道ord(),但是当我尝试时ord('é'),我知道了TypeError: ord() expected a character, but string of length 2 found。我了解这是由我构建Python的方式引起的(如ord()的文档中所述)。

还有另一种检查方法吗?

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()‘s documentation).

Is there another way to check?


回答 0

def is_ascii(s):
    return all(ord(c) < 128 for c in s)
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

回答 1

我认为您不是在问正确的问题-

python中的字符串没有与’ascii’,utf-8或任何其他编码对应的属性。字符串的来源(无论您是从文件中读取字符串,还是从键盘输入等等)可能已经在ASCII中编码了一个unicode字符串以生成您的字符串,但这就是您需要答案的地方。

也许您会问的问题是:“此字符串是在ASCII中编码unicode字符串的结果吗?” -您可以尝试以下方法回答:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

I think you are not asking the right question–

A string in python has no property corresponding to ‘ascii’, utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that’s where you need to go for an answer.

Perhaps the question you can ask is: “Is this string the result of encoding a unicode string in ascii?” — This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

回答 2

Python 3方式:

isascii = lambda s: len(s) == len(s.encode())

要进行检查,请传递测试字符串:

str1 = "♥O◘♦♥O◘♦"
str2 = "Python"

print(isascii(str1)) -> will return False
print(isascii(str2)) -> will return True

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

回答 3

Python 3.7的新功能(bpo32677

没有更多的无聊/对字符串低效ASCII检查,新的内置str/ bytes/ bytearray方法- .isascii()将检查字符串是ASCII。

print("is this ascii?".isascii())
# True

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method – .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True

回答 4

最近遇到了类似的情况-供将来参考

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

您可以使用:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

Ran into something like this recently – for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

回答 5

Vincent Marchetti的想法正确,但str.decode已在Python 3中弃用。在Python 3中,您可以使用以下命令进行相同的测试str.encode

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

请注意,您要捕获的异常也已从更改UnicodeDecodeErrorUnicodeEncodeError

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.


回答 6

您的问题不正确;您看到的错误不是您构建python的结果,而是字节字符串和unicode字符串之间的混淆。

字节字符串(例如python语法中的“ foo”或“ bar”)是八位字节序列;0-255之间的数字。Unicode字符串(例如u“ foo”或u’bar’)是unicode代码点的序列;从0-1112064开始的数字。但是您似乎对字符é感兴趣,该字符é(在您的终端中)是一个多字节序列,代表一个字符。

代替ord(u'é'),试试这个:

>>> [ord(x) for x in u'é']

这就告诉您“é”代表的代码点顺序。它可以给您[233],也可以给您[101,770]。

除了chr()扭转这种情况,还有unichr()

>>> unichr(233)
u'\xe9'

该字符实际上可以表示为单个或多个unicode“代码点”,它们本身表示字素或字符。它可以是“带有重音符号的e(即代码点233)”,也可以是“ e”(编码点101),后跟“上一个字符具有重音符号”(代码点770)。因此,这个完全相同的字符可以表示为Python数据结构u'e\u0301'u'\u00e9'

大多数情况下,您不必关心这一点,但是如果您要遍历unicode字符串,这可能会成为问题,因为迭代是通过代码点而不是通过可分解字符进行的。换句话说,len(u'e\u0301') == 2len(u'\u00e9') == 1。如果您认为这很重要,可以使用来在合成和分解形式之间进行转换unicodedata.normalize

Unicode词汇表可以指出每个特定术语如何指代文本表示形式的不同部分,这对于理解其中的一些问题可能是有用的指南,这比许多程序员意识到的要复杂得多。

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. “foo”, or ‘bar’, in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u”foo” or u’bar’) are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points “é” represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'

This character may actually be represented either a single or multiple unicode “code points”, which themselves represent either graphemes or characters. It’s either “e with an acute accent (i.e., code point 233)”, or “e” (code point 101), followed by “an acute accent on the previous character” (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.

Most of the time you shouldn’t have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.


回答 7

怎么样呢?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

回答 8

我在尝试确定如何使用/编码/解码我不确定其编码的字符串(以及如何转义/转换该字符串中的特殊字符)时发现了这个问题。

我的第一步应该是检查字符串的类型-我不知道在那里可以从类型中获取有关其格式的良好数据。 这个答案非常有帮助,并且是我问题的真正根源。

如果您变得粗鲁和执着

UnicodeDecodeError:’ascii’编解码器无法解码位置263的字节0xc3:序数不在范围内(128)

尤其是在进行编码时,请确保您不尝试对已经是unicode的字符串进行unicode()-出于某种可怕的原因,您会遇到ascii编解码器错误。(另请参阅“ Python厨房食谱 ”和“ Python文档”教程,以更好地了解它的可怕程度。)

最终,我确定我想做的是:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

在调试中也很有帮助,是将我文件中的默认编码设置为utf-8(将其放在python文件的开头):

# -*- coding: utf-8 -*-

这样,您就可以测试特殊字符(’àéç’),而不必使用它们的Unicode转义符(u’\ xe0 \ xe9 \ xe7’)。

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn’t sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn’t realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you’re getting a rude and persistent

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you’re ENCODING, make sure you’re not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters (‘àéç’) without having to use their unicode escapes (u’\xe0\xe9\xe7′).

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

回答 9

要从Python 2.6(和Python 3.x)中改进Alexander的解决方案,可以使用帮助器模块curses.ascii并使用curses.ascii.isascii()函数或其他各种功能:https ://docs.python.org/2.6/ library / curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

To improve Alexander’s solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

回答 10

您可以使用接受Posix标准[[:ASCII:]]定义的正则表达式库。

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.


回答 11

strPython中的字符串(-type)是一系列字节。有没有办法从看串只是告诉的这一系列字节是否代表一个ASCII字符串,在8位字符集,如ISO-8859-1或字符串使用UTF-8或UTF-16或任何编码的字符串。

但是,如果您知道所使用的编码,则可以decode将str转换为unicode字符串,然后使用正则表达式(或循环)检查其是否包含您所关注范围之外的字符。

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.


回答 12

就像@RogerDahl的答案一样,但通过否定字符类并使用search代替find_allor 来短路更有效match

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

我认为正则表达式已对此进行了优化。

Like @RogerDahl’s answer but it’s more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.


回答 13

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

要将空字符串包含为ASCII,请将更改+*

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.


回答 14

为防止代码崩溃,您可能需要使用a try-except来捕获TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

例如

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

回答 15

我使用以下命令确定字符串是ascii还是unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

然后只需使用条件块来定义函数:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。