如何检查字符串是unicode还是ascii?

问题:如何检查字符串是unicode还是ascii?

我必须在Python中做什么才能弄清楚字符串具有哪种编码?

What do I have to do in Python to figure out which encoding a string has?


回答 0

在Python 3中,所有字符串都是Unicode字符序列。有一种bytes类型可以保存原始字节。

在Python 2中,字符串可以是type str或type unicode。您可以使用以下代码告诉哪个使用代码:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

这不能区分“ Unicode或ASCII”;它仅区分Python类型。Unicode字符串可能仅包含ASCII范围内的字符,而字节字符串可能包含ASCII,编码的Unicode甚至非文本数据。

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish “Unicode or ASCII”; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.


回答 1

如何判断对象是unicode字符串还是字节字符串

您可以使用typeisinstance

在Python 2中:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

在Python 2中,str只是字节序列。Python不知道其编码是什么。该unicode类型是存储文本的更安全的方式。如果您想了解更多,我建议http://farmdev.com/talks/unicode/

在Python 3中:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

在Python 3中,str就像Python 2一样unicode,用于存储文本。什么叫str在Python 2被称为bytes在Python 3。


如何判断一个字节字符串是有效的utf-8还是ascii

您可以调用decode。如果它引发UnicodeDecodeError异常,则无效。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

How to tell if an object is a unicode string or a byte string

You can use type or isinstance.

In Python 2:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

In Python 2, str is just a sequence of bytes. Python doesn’t know what its encoding is. The unicode type is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

In Python 3:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

In Python 3, str is like Python 2’s unicode, and is used to store text. What was called str in Python 2 is called bytes in Python 3.


How to tell if a byte string is valid utf-8 or ascii

You can call decode. If it raises a UnicodeDecodeError exception, it wasn’t valid.

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

回答 2

在python 3.x中,所有字符串都是Unicode字符序列。并进行str的isinstance检查(默认情况下意味着unicode字符串)就足够了。

isinstance(x, str)

关于python 2.x,大多数人似乎正在使用带有两个检查的if语句。一个用于str,另一个用于unicode。

如果要检查是否只有一个语句具有“类似字符串的”对象,则可以执行以下操作:

isinstance(x, basestring)

In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.

isinstance(x, str)

With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.

If you want to check if you have a ‘string-like’ object all with one statement though, you can do the following:

isinstance(x, basestring)

回答 3

Unicode不是一种编码-引用Kumar McMillan的话:

如果ASCII,UTF-8和其他字节字符串是“文本” …

…那么Unicode是“文字性”;

它是文本的抽象形式

阅读了PyCon 2008 上McMillan的Unicode In Python中的《 Unifiedly Mystified》演讲,它解释的内容比Stack Overflow上的大多数相关答案要好得多。

Unicode is not an encoding – to quote Kumar McMillan:

If ASCII, UTF-8, and other byte strings are “text” …

…then Unicode is “text-ness”;

it is the abstract form of text

Have a read of McMillan’s Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.


回答 4

如果你的代码需要兼容两者的Python 2和Python 3,你不能直接使用之类的东西isinstance(s,bytes)isinstance(s,unicode)不带/包裹它们可尝试不同的或Python版本的测试,因为bytes在Python 2不定,unicode在Python 3未定义。

有一些丑陋的解决方法。一个非常丑陋的是比较类型的名称,而不是比较类型本身。这是一个例子:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

可以说稍微麻烦一点的解决方法是检查Python版本号,例如:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

这些都不是Python风格的,在大多数情况下,可能有更好的方法。

If your code needs to be compatible with both Python 2 and Python 3, you can’t directly use things like isinstance(s,bytes) or isinstance(s,unicode) without wrapping them in either try/except or a python version test, because bytes is undefined in Python 2 and unicode is undefined in Python 3.

There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here’s an example:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

Those are both unpythonic, and most of the time there’s probably a better way.


回答 5

用:

import six
if isinstance(obj, six.text_type)

在六个库中,它表示为:

if PY3:
    string_types = str,
else:
    string_types = basestring,

use:

import six
if isinstance(obj, six.text_type)

inside the six library it is represented as:

if PY3:
    string_types = str,
else:
    string_types = basestring,

回答 6

请注意,在Python 3上,以下任何一种说法都不公平:

  • strs是任何x的UTFx(例如UTF8)

  • strs是Unicode

  • strs是Unicode字符的有序集合

Python的 str类型(通常)是一系列Unicode代码点,其中一些映射到字符。


即使在Python 3上,回答这个问题也不像您想象的那么简单。

一种测试ASCII兼容字符串的明显方法是尝试进行编码:

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

该错误区分情况。

在Python 3中,甚至有些字符串包含无效的Unicode代码点:

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

Note that on Python 3, it’s not really fair to say any of:

  • strs are UTFx for any x (eg. UTF8)

  • strs are Unicode

  • strs are ordered collections of Unicode characters

Python’s str type is (normally) a sequence of Unicode code points, some of which map to characters.


Even on Python 3, it’s not as simple to answer this question as you might imagine.

An obvious way to test for ASCII-compatible strings is by an attempted encode:

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

The error distinguishes the cases.

In Python 3, there are even some strings that contain invalid Unicode code points:

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

The same method to distinguish them is used.


回答 7

这可能会对其他人有所帮助,我开始测试变量s的字符串类型,但是对于我的应用程序来说,简单地将s返回为utf-8更有意义。然后,调用return_utf的进程知道它正在处理什么,并且可以适当地处理该字符串。该代码不是原始的,但我打算将其与Python版本无关,而无需进行版本测试或导入六个代码。请评论以下示例代码的改进以帮助其他人。

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

This may help someone else, I started out testing for the string type of the variable s, but for my application, it made more sense to simply return s as utf-8. The process calling return_utf, then knows what it is dealing with and can handle the string appropriately. The code is not pristine, but I intend for it to be Python version agnostic without a version test or importing six. Please comment with improvements to the sample code below to help other people.

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

回答 8

您可以使用Universal Encoding Detector,但是请注意,它只会给您最好的猜测,而不是实际的编码,因为例如,不可能知道字符串“ abc”的编码。您将需要在其他地方获取编码信息,例如HTTP协议为此使用Content-Type标头。

You could use Universal Encoding Detector, but be aware that it will just give you best guess, not the actual encoding, because it’s impossible to know encoding of a string “abc” for example. You will need to get encoding information elsewhere, eg HTTP protocol uses Content-Type header for that.


回答 9

为了实现py2 / py3兼容性,只需使用

import six if isinstance(obj, six.text_type)

For py2/py3 compatibility simply use

import six if isinstance(obj, six.text_type)


回答 10

一种简单的方法是检查是否unicode为内置函数。如果是这样,则说明您使用的是Python 2,并且您的字符串将是一个字符串。确保一切都unicode可以做到:

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

One simple approach is to check if unicode is a builtin function. If so, you’re in Python 2 and your string will be a string. To ensure everything is in unicode one can do:

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)