标签归档:ascii

将int转换为ASCII并返回Python

问题:将int转换为ASCII并返回Python

我正在为我的站点制作URL缩短器,而我目前的计划(我愿意接受建议)是使用节点ID来生成缩短的URL。因此,从理论上讲,节点26可能是short.com/z,节点1可能是short.com/a,节点52可能是short.com/Z,节点104可能是short.com/ZZ。当用户转到该URL时,我需要撤消该过程(显然)。

我可以想到一些可行的方法来解决此问题,但我想还有更好的方法。有什么建议?

I’m working on making a URL shortener for my site, and my current plan (I’m open to suggestions) is to use a node ID to generate the shortened URL. So, in theory, node 26 might be short.com/z, node 1 might be short.com/a, node 52 might be short.com/Z, and node 104 might be short.com/ZZ. When a user goes to that URL, I need to reverse the process (obviously).

I can think of some kludgy ways to go about this, but I’m guessing there are better ones. Any suggestions?


回答 0

ASCII转换为int:

ord('a')

97

然后返回一个字符串:

  • 在Python2中: str(unichr(97))
  • 在Python3中: chr(97)

'a'

ASCII to int:

ord('a')

gives 97

And back to a string:

  • in Python2: str(unichr(97))
  • in Python3: chr(97)

gives 'a'


回答 1

>>> ord("a")
97
>>> chr(97)
'a'
>>> ord("a")
97
>>> chr(97)
'a'

回答 2

如果多个字符绑定在一个整数/长整数内,这就是我的问题:

s = '0123456789'
nchars = len(s)
# string to int or long. Type depends on nchars
x = sum(ord(s[byte])<<8*(nchars-byte-1) for byte in range(nchars))
# int or long to string
''.join(chr((x>>8*(nchars-byte-1))&0xFF) for byte in range(nchars))

Yield'0123456789'x = 227581098929683594426425L

If multiple characters are bound inside a single integer/long, as was my issue:

s = '0123456789'
nchars = len(s)
# string to int or long. Type depends on nchars
x = sum(ord(s[byte])<<8*(nchars-byte-1) for byte in range(nchars))
# int or long to string
''.join(chr((x>>8*(nchars-byte-1))&0xFF) for byte in range(nchars))

Yields '0123456789' and x = 227581098929683594426425L


回答 3

BASE58编码URL怎么样?像flickr这样。

# note the missing lowercase L and the zero etc.
BASE58 = '123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ' 
url = ''
while node_id >= 58:
    div, mod = divmod(node_id, 58)
    url = BASE58[mod] + url
    node_id = int(div)

return 'http://short.com/%s' % BASE58[node_id] + url

将其转换为数字也没什么大不了的。

What about BASE58 encoding the URL? Like for example flickr does.

# note the missing lowercase L and the zero etc.
BASE58 = '123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ' 
url = ''
while node_id >= 58:
    div, mod = divmod(node_id, 58)
    url = BASE58[mod] + url
    node_id = int(div)

return 'http://short.com/%s' % BASE58[node_id] + url

Turning that back into a number isn’t a big deal either.


回答 4

使用hex(id)[2:]int(urlpart, 16)。还有其他选择。对您的id进行base32编码也可以正常工作,但是我不知道有没有内置Python进行base32编码的库。

显然,在Python 2.4中使用base64模块引入了base32编码器。您可以尝试使用b32encodeb32decode。你应该给True两者的casefoldmap01期权b32decode的情况下,人们写下你的短网址。

实际上,我收回了这一点。我仍然认为base32编码是一个好主意,但是该模块对于URL缩短的情况没有用。您可以查看模块中的实现,并针对此特定情况进行自己的设计。:-)

Use hex(id)[2:] and int(urlpart, 16). There are other options. base32 encoding your id could work as well, but I don’t know that there’s any library that does base32 encoding built into Python.

Apparently a base32 encoder was introduced in Python 2.4 with the base64 module. You might try using b32encode and b32decode. You should give True for both the casefold and map01 options to b32decode in case people write down your shortened URLs.

Actually, I take that back. I still think base32 encoding is a good idea, but that module is not useful for the case of URL shortening. You could look at the implementation in the module and make your own for this specific case. :-)


Python:如何打印范围az?

问题:Python:如何打印范围az?

1.打印: abcdefghijklmn

2.每秒: acegikm

3.将url索引附加到{ hello.com/、hej.com/、…、hallo.com/}:hello.com/a hej.com/b … hallo.com/n

1. Print a-n: a b c d e f g h i j k l m n

2. Every second in a-n: a c e g i k m

3. Append a-n to index of urls{hello.com/, hej.com/, …, hallo.com/}: hello.com/a hej.com/b … hallo.com/n


回答 0

>>> import string
>>> string.ascii_lowercase[:14]
'abcdefghijklmn'
>>> string.ascii_lowercase[:14:2]
'acegikm'

要执行网址,您可以使用类似以下内容的网址

[i + j for i, j in zip(list_of_urls, string.ascii_lowercase[:14])]
>>> import string
>>> string.ascii_lowercase[:14]
'abcdefghijklmn'
>>> string.ascii_lowercase[:14:2]
'acegikm'

To do the urls, you could use something like this

[i + j for i, j in zip(list_of_urls, string.ascii_lowercase[:14])]

回答 1

假设这是一项家庭作业;-)-无需调用库等-它可能希望您将chr / ord与range()一起使用,如下所示:

for i in range(ord('a'), ord('n')+1):
    print chr(i),

对于其余的内容,只需要使用range()多一点

Assuming this is a homework ;-) – no need to summon libraries etc – it probably expect you to use range() with chr/ord, like so:

for i in range(ord('a'), ord('n')+1):
    print chr(i),

For the rest, just play a bit more with the range()


回答 2

提示:

import string
print string.ascii_lowercase

for i in xrange(0, 10, 2):
    print i

"hello{0}, world!".format('z')

Hints:

import string
print string.ascii_lowercase

and

for i in xrange(0, 10, 2):
    print i

and

"hello{0}, world!".format('z')

回答 3

for one in range(97,110):
    print chr(one)
for one in range(97,110):
    print chr(one)

回答 4

获取具有所需值的列表

small_letters = map(chr, range(ord('a'), ord('z')+1))
big_letters = map(chr, range(ord('A'), ord('Z')+1))
digits = map(chr, range(ord('0'), ord('9')+1))

要么

import string
string.letters
string.uppercase
string.digits

此解决方案使用ASCII表ord从一个字符获取ascii值,然后chr反之亦然。

应用您对列表的了解

>>> small_letters = map(chr, range(ord('a'), ord('z')+1))

>>> an = small_letters[0:(ord('n')-ord('a')+1)]
>>> print(" ".join(an))
a b c d e f g h i j k l m n

>>> print(" ".join(small_letters[0::2]))
a c e g i k m o q s u w y

>>> s = small_letters[0:(ord('n')-ord('a')+1):2]
>>> print(" ".join(s))
a c e g i k m

>>> urls = ["hello.com/", "hej.com/", "hallo.com/"]
>>> print([x + y for x, y in zip(urls, an)])
['hello.com/a', 'hej.com/b', 'hallo.com/c']

Get a list with the desired values

small_letters = map(chr, range(ord('a'), ord('z')+1))
big_letters = map(chr, range(ord('A'), ord('Z')+1))
digits = map(chr, range(ord('0'), ord('9')+1))

or

import string
string.letters
string.uppercase
string.digits

This solution uses the ASCII table. ord gets the ascii value from a character and chr vice versa.

Apply what you know about lists

>>> small_letters = map(chr, range(ord('a'), ord('z')+1))

>>> an = small_letters[0:(ord('n')-ord('a')+1)]
>>> print(" ".join(an))
a b c d e f g h i j k l m n

>>> print(" ".join(small_letters[0::2]))
a c e g i k m o q s u w y

>>> s = small_letters[0:(ord('n')-ord('a')+1):2]
>>> print(" ".join(s))
a c e g i k m

>>> urls = ["hello.com/", "hej.com/", "hallo.com/"]
>>> print([x + y for x, y in zip(urls, an)])
['hello.com/a', 'hej.com/b', 'hallo.com/c']

回答 5

import string
print list(string.ascii_lowercase)
# ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
import string
print list(string.ascii_lowercase)
# ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

回答 6

import string
print list(string.ascii_lowercase)
# ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

for c in list(string.ascii_lowercase)[:5]:
    ...operation with the first 5 characters
import string
print list(string.ascii_lowercase)
# ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

and

for c in list(string.ascii_lowercase)[:5]:
    ...operation with the first 5 characters

回答 7

myList = [chr(chNum) for chNum in list(range(ord('a'),ord('z')+1))]
print(myList)

输出量

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
myList = [chr(chNum) for chNum in list(range(ord('a'),ord('z')+1))]
print(myList)

Output

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

回答 8

#1)
print " ".join(map(chr, range(ord('a'),ord('n')+1)))

#2)
print " ".join(map(chr, range(ord('a'),ord('n')+1,2)))

#3)
urls = ["hello.com/", "hej.com/", "hallo.com/"]
an = map(chr, range(ord('a'),ord('n')+1))
print [ x + y for x,y in zip(urls, an)]
#1)
print " ".join(map(chr, range(ord('a'),ord('n')+1)))

#2)
print " ".join(map(chr, range(ord('a'),ord('n')+1,2)))

#3)
urls = ["hello.com/", "hej.com/", "hallo.com/"]
an = map(chr, range(ord('a'),ord('n')+1))
print [ x + y for x,y in zip(urls, an)]

回答 9

这个问题的答案很简单,只需列出一个名为ABC的列表,如下所示:

ABC = ['abcdefghijklmnopqrstuvwxyz']

每当需要引用它时,只需执行以下操作:

print ABC[0:9] #prints abcdefghij
print ABC       #prints abcdefghijklmnopqrstuvwxyz
for x in range(0,25):
    if x % 2 == 0:
        print ABC[x] #prints acegikmoqsuwy (all odd numbered letters)

也可以尝试这样来破坏您的设备:D

##Try this and call it AlphabetSoup.py:

ABC = ['abcdefghijklmnopqrstuvwxyz']


try:
    while True:
        for a in ABC:
            for b in ABC:
                for c in ABC:
                    for d in ABC:
                        for e in ABC:
                            for f in ABC:
                                print a, b, c, d, e, f, '    ',
except KeyboardInterrupt:
    pass

The answer to this question is simple, just make a list called ABC like so:

ABC = ['abcdefghijklmnopqrstuvwxyz']

And whenever you need to refer to it, just do:

print ABC[0:9] #prints abcdefghij
print ABC       #prints abcdefghijklmnopqrstuvwxyz
for x in range(0,25):
    if x % 2 == 0:
        print ABC[x] #prints acegikmoqsuwy (all odd numbered letters)

Also try this to break ur device :D

##Try this and call it AlphabetSoup.py:

ABC = ['abcdefghijklmnopqrstuvwxyz']


try:
    while True:
        for a in ABC:
            for b in ABC:
                for c in ABC:
                    for d in ABC:
                        for e in ABC:
                            for f in ABC:
                                print a, b, c, d, e, f, '    ',
except KeyboardInterrupt:
    pass

回答 10

尝试:

strng = ""
for i in range(97,123):
    strng = strng + chr(i)
print(strng)

Try:

strng = ""
for i in range(97,123):
    strng = strng + chr(i)
print(strng)

回答 11

这是您的第二个问题:string.lowercase[ord('a')-97:ord('n')-97:2]因为97==ord('a')-如果您想学习一点,您应该自己弄清楚其余的部分;-)

This is your 2nd question: string.lowercase[ord('a')-97:ord('n')-97:2] because 97==ord('a') — if you want to learn a bit you should figure out the rest yourself ;-)


回答 12

list(string.ascii_lowercase)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
list(string.ascii_lowercase)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

回答 13

我希望这有帮助:

import string

alphas = list(string.ascii_letters[:26])
for chr in alphas:
 print(chr)

I hope this helps:

import string

alphas = list(string.ascii_letters[:26])
for chr in alphas:
 print(chr)

回答 14

关于狼吞虎咽的答案。

邮编功能,充分说明,返回a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. [...]构造称为列表理解,很酷的功能!

About gnibbler’s answer.

Zip -function, full explanation, returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. [...] construct is called list comprehension, very cool feature!


回答 15

另一种方式

  import string
  pass

  aalist = list(string.ascii_lowercase)
  aaurls = ['alpha.com','bravo.com','chrly.com','delta.com',]
  iilen  =  aaurls.__len__()
  pass

  ans01 = "".join( (aalist[0:14]) )
  ans02 = "".join( (aalist[0:14:2]) )
  ans03 = "".join( "{vurl}/{vl}\n".format(vl=vjj[1],vurl=aaurls[vjj[0] % iilen]) for vjj in enumerate(aalist[0:14]) )
  pass

  print(ans01)
  print(ans02)
  print(ans03)
  pass

结果

abcdefghijklmn
acegikm
alpha.com/a
bravo.com/b
chrly.com/c
delta.com/d
alpha.com/e
bravo.com/f
chrly.com/g
delta.com/h
alpha.com/i
bravo.com/j
chrly.com/k
delta.com/l
alpha.com/m
bravo.com/n

这与其他回复有何不同

  • 遍历任意数量的基本网址
  • 循环浏览网址,直到我们用完所有字母后再停止
  • 使用enumerate结合列表理解和str.format

Another way to do it

  import string
  pass

  aalist = list(string.ascii_lowercase)
  aaurls = ['alpha.com','bravo.com','chrly.com','delta.com',]
  iilen  =  aaurls.__len__()
  pass

  ans01 = "".join( (aalist[0:14]) )
  ans02 = "".join( (aalist[0:14:2]) )
  ans03 = "".join( "{vurl}/{vl}\n".format(vl=vjj[1],vurl=aaurls[vjj[0] % iilen]) for vjj in enumerate(aalist[0:14]) )
  pass

  print(ans01)
  print(ans02)
  print(ans03)
  pass

Result

abcdefghijklmn
acegikm
alpha.com/a
bravo.com/b
chrly.com/c
delta.com/d
alpha.com/e
bravo.com/f
chrly.com/g
delta.com/h
alpha.com/i
bravo.com/j
chrly.com/k
delta.com/l
alpha.com/m
bravo.com/n

How this differs from the other replies

  • iterate over an arbitrary number of base urls
  • cycle through the urls and do not stop until we run out of letters
  • use enumerate in conjunction with list comprehension and str.format

Python Unicode编码错误

问题:Python Unicode编码错误

我正在读取和解析Amazon XML文件,而当XML文件显示’时,尝试打印该文件时,出现以下错误:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

从到目前为止的在线阅读中,该错误是由于XML文件位于UTF-8中引起的,但是Python希望将其作为ASCII编码字符进行处理。有没有简单的方法可以使错误消失并让我的程序在读取时打印XML?

I’m reading and parsing an Amazon XML file and while the XML file shows a ‘ , when I try to print it I get the following error:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

From what I’ve read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?


回答 0

可能是,您的问题是您已对其进行了解析,现在您正尝试打印XML的内容,但由于存在一些外来Unicode字符而无法这样做。首先尝试将unicode字符串编码为ascii:

unicodeData.encode('ascii', 'ignore')

“忽略”部分将告诉它只跳过那些字符。从python文档中:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

您可能需要阅读这篇文章:http : //www.joelonsoftware.com/articles/Unicode.html,我发现它对于发生的事情是非常有用的基础教程。阅读之后,您将不再觉得自己只是在猜测要使用的命令(或者至少是我遇到的命令)。

Likely, your problem is that you parsed it okay, and now you’re trying to print the contents of the XML and you can’t because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the ‘ignore’ part will tell it to just skip those characters. From the python docs:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what’s going on. After the read, you’ll stop feeling like you’re just guessing what commands to use (or at least that happened to me).


回答 1

更好的解决方案:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

如果您想详细了解原因:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

A better solution:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

If you would like to read more about why:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1


回答 2

不要在脚本中对环境的字符编码进行硬编码。直接打印Unicode文本:

assert isinstance(text, unicode) # or str on Python 3
print(text)

如果您的输出重定向到文件(或管道);您可以使用PYTHONIOENCODINGenvvar来指定字符编码:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

否则,python your_script.py应工作-您的区域设置用于将文本编码(上POSIX检查:LC_ALLLC_CTYPELANGenvvars中-设置LANG为UTF-8语言环境如果需要的话)。

要在Windows上打印Unicode,请参见以下答案,该答案显示了如何将Unicode打印到Windows控制台,文件或使用IDLE

Don’t hardcode the character encoding of your environment inside your script; print Unicode text directly instead:

assert isinstance(text, unicode) # or str on Python 3
print(text)

If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING envvar, to specify the character encoding:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

Otherwise, python your_script.py should work as is — your locale settings are used to encode the text (on POSIX check: LC_ALL, LC_CTYPE, LANG envvars — set LANG to a utf-8 locale if necessary).

To print Unicode on Windows, see this answer that shows how to print Unicode to Windows console, to a file, or using IDLE.


回答 3

优秀文章:http : //www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

回答 4

您可以使用以下形式

s.decode('utf-8')

它将UTF-8编码的字节字符串转换为Python Unicode字符串。但是要使用的确切过程取决于确切地加载和解析XML文件的方式,例如,如果您从未直接访问XML字符串,则可能必须使用codecs模块中的解码器对象。

You can use something of the form

s.decode('utf-8')

which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don’t ever access the XML string directly, you might have to use a decoder object from the codecs module.


回答 5

我写了以下文章,以解决讨厌的非ascii引号,并强制将其转换为可用的东西。

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

I wrote the following to fix the nuisance non-ascii quotes and force conversion to something usable.

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

回答 6

如果您需要在屏幕上打印字符串的近似表示,而不是忽略那些不可打印的字符,请unidecode在此处尝试打包:

https://pypi.python.org/pypi/Unidecode

在这里找到说明:

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

这比u.encode('ascii', 'ignore')对给定的字符串使用更好u,如果字符精度不是您想要的,但仍然希望具有人类可读性,则可以使您免于不必要的麻烦。

威拉湾

If you need to print an approximate representation of the string to the screen, rather than ignoring those nonprintable characters, please try unidecode package here:

https://pypi.python.org/pypi/Unidecode

The explanation is found here:

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

This is better than using the u.encode('ascii', 'ignore') for a given string u, and can save you from unnecessary headache if character precision is not what you are after, but still want to have human readability.

Wirawan


回答 7

尝试将以下行添加到python脚本的顶部。

# _*_ coding:utf-8 _*_

Try adding the following line at the top of your python script.

# _*_ coding:utf-8 _*_

回答 8

Python 3.5,2018年

如果您不知道编码是什么,但是unicode解析器出现问题,则可以在中打开文件,Notepad++然后在顶部栏中选择Encoding->Convert to ANSI。然后您可以像这样编写python

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

Python 3.5, 2018

If you don’t know what the encoding but the unicode parser is having issues you can open the file in Notepad++ and in the top bar select Encoding->Convert to ANSI. Then you can write your python like this

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

用Python从文件中读取字符

问题:用Python从文件中读取字符

在文本文件中,有一个字符串“我不喜欢这样”。

但是,当我将其读取为字符串时,它变成“我不这样\ xe2 \ x80 \ x98t”。我了解\ u2018是“’”的Unicode表示形式。我用

f1 = open (file1, "r")
text = f1.read()

命令来做阅读。

现在,是否可以以这样的方式读取字符串,即当将其读入字符串时,它是“我不喜欢这样”而不是“我不喜欢这样”吗?

第二编辑:我已经看到有人使用映射来解决此问题,但实际上,没有内置的转换可以将这种ANSI转换为unicode(反之亦然)吗?

In a text file, there is a string “I don’t like this”.

However, when I read it into a string, it becomes “I don\xe2\x80\x98t like this”. I understand that \u2018 is the unicode representation of “‘”. I use

f1 = open (file1, "r")
text = f1.read()

command to do the reading.

Now, is it possible to read the string in such a way that when it is read into the string, it is “I don’t like this”, instead of “I don\xe2\x80\x98t like this like this”?

Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?


回答 0

参考:http : //docs.python.org/howto/unicode

因此,从文件读取Unicode很简单:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

也可以在更新模式下打开文件,从而允许读取和写入:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

编辑:我假设您的预期目标是能够将文件正确读取为Python中的字符串。如果您尝试从Unicode转换为ASCII字符串,那么实际上没有直接的方法,因为Unicode字符不一定存在于ASCII中。

如果您尝试转换为ASCII字符串,请尝试以下操作之一:

  1. 如果您只想处理一些特殊情况(例如此特定示例),请使用ASCII等价的方式替换特定的unicode字符。

  2. 使用unicodedata模块normalize()string.encode()方法将最大程度地转换为下一个最接近的ASCII等效词(参阅https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting- unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'

Ref: http://docs.python.org/howto/unicode

Reading Unicode from a file is therefore simple:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

It’s also possible to open files in update mode, allowing both reading and writing:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

EDIT: I’m assuming that your intended goal is just to be able to read the file properly into a string in Python. If you’re trying to convert to an ASCII string from Unicode, then there’s really no direct way to do so, since the Unicode characters won’t necessarily exist in ASCII.

If you’re trying to convert to an ASCII string, try one of the following:

  1. Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example

  2. Use the unicodedata module’s normalize() and the string.encode() method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'
    

回答 1

有几点要考虑。

\ u2018字符只能作为Python中unicode字符串表示形式的一部分出现,例如,如果您编写:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

现在,如果您只是想简单地打印unicode字符串,只需使用unicode的encode方法:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I dont like this

为了确保将任何文件中的每一行都读为unicode,最好使用codecs.open函数而不是just open,它允许您指定文件的编码:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I dont like this

There are a few points to consider.

A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

Now if you simply want to print the unicode string prettily, just use unicode’s encode method:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I don‘t like this

To make sure that every line from any file would be read as unicode, you’d better use the codecs.open function instead of just open, which allows you to specify file’s encoding:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I don‘t like this

回答 2

但这确实是“我不喜欢这样”而不是“我不喜欢这样”。字符u’\ u2018’与“’”是完全不同的字符(并且在视觉上应更对应于“`”)。

如果您尝试将编码的unicode转换为纯ASCII,则可以保留要转换为ASCII的unicode标点的映射。

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

unicode中有很多标点符号,但是,我想您只能指望其中的几个实际被创建您正在阅读的文档的应用程序所实际使用。

But it really is “I don\u2018t like this” and not “I don’t like this”. The character u’\u2018′ is a completely different character than “‘” (and, visually, should correspond more to ‘`’).

If you’re trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you’re reading.


回答 3

也可以使用python 3 read方法读取编码的文本文件:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

使用此变体,无需导入任何其他库

It is also possible to read an encoded text file using the python 3 read method:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

With this variation, there is no need to import any additional libraries


回答 4

撇开您的文本文件已损坏的事实(U + 2018是左引号,而不是撇号):iconv可用于将unicode字符音译为ascii。

您必须在Google上搜索“ iconvcodec”,因为该模块似乎不再受支持,而且我也找不到它的规范主页。

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

或者,您可以使用iconv命令行实用程序来清理文件:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

Leaving aside the fact that your text file is broken (U+2018 is a left quotation mark, not an apostrophe): iconv can be used to transliterate unicode characters to ascii.

You’ll have to google for “iconvcodec”, since the module seems not to be supported anymore and I can’t find a canonical home page for it.

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

Alternatively you can use the iconv command line utility to clean up your file:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

回答 5

您可能会以某种方式拥有带有Unicode转义字符的非Unicode字符串,例如:

>>> print repr(text)
'I don\\u2018t like this'

这实际上发生在我之前。您可以使用unicode_escape编解码器将字符串解码为unicode,然后将其编码为所需的任何格式:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I dont like this

There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:

>>> print repr(text)
'I don\\u2018t like this'

This actually happened to me once before. You can use a unicode_escape codec to decode the string to unicode and then encode it to any format you want:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I don‘t like this

回答 6

这是Python的方法,向您显示unicode编码的字符串。但我认为您应该能够在屏幕上打印字符串或将其写入新文件而不会出现任何问题。

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I dont like this

This is Pythons way do show you unicode encoded strings. But i think you should be able to print the string on the screen or write it into a new file without any problems.

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I don‘t like this

回答 7

实际上,U + 2018是特殊字符’的Unicode表示。如果需要,可以使用以下代码将该字符的实例转换为U + 0027:

text = text.replace (u"\u2018", "'")

另外,您用什么来写文件?f1.read()应该返回一个看起来像这样的字符串:

'I don\xe2\x80\x98t like this'

如果返回字符串,则表示文件编写不正确:

'I don\u2018t like this'

Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:

text = text.replace (u"\u2018", "'")

In addition, what are you using to write the file? f1.read() should return a string that looks like this:

'I don\xe2\x80\x98t like this'

If it’s returning this string, the file is being written incorrectly:

'I don\u2018t like this'

如何使用Python删除非ASCII字符但保留句点和空格?

问题:如何使用Python删除非ASCII字符但保留句点和空格?

我正在使用.txt文件。我希望文件中的文本字符串不包含非ASCII字符。但是,我想留空格和句点。目前,我也正在剥离它们。这是代码:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

我应该如何修改onlyascii()以保留空格和句点?我想这并不太复杂,但我无法弄清楚。

I’m working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I’m stripping those too. Here’s the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it’s not too complicated but I can’t figure it out.


回答 0

您可以使用string.printable过滤字符串中所有不可打印的字符,如下所示:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

我机器上的string.printable包含:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

编辑:在Python 3上,筛选器将返回可迭代。返回字符串的正确方法是:

''.join(filter(lambda x: x in printable, s))

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

回答 1

更改为其他编解码器的简单方法是使用encode()或decode()。在您的情况下,您想转换为ASCII并忽略所有不支持的符号。例如,瑞典字母å不是ASCII字符:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

编辑:

Python3:str->字节-> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2:unicode-> str-> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2:str-> unicode-> str(以相反的顺序解码和编码)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

回答 2

根据@artfulrobot,这应该比filter和lambda更快:

re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) 

在此处查看更多示例 http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20079244#20079244

According to @artfulrobot, this should be faster than filter and lambda:

re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string) 

See more examples here Replace non-ASCII characters with a single space


回答 3

您的问题不明确;前两个句子加在一起表示您认为空格和“句点”是非ASCII字符。这是不正确的。等于ord(char)<= 127的所有字符都是ASCII字符。例如,您的函数不包括这些字符!“#$%&\’()* +,-。/,但包括其他几个字符,例如[] {}。

请退后一步,三思而后行,然后编辑您的问题以告诉我们您要做什么,而无需提及ASCII单词,以及为什么您认为ord(char)> = 128这样的chars是可忽略的。另外:哪个版本的Python?输入数据的编码是什么?

请注意,您的代码将整个输入文件读取为单个字符串,并且您对另一个答案的注释(“最佳解决方案”)意味着您无需关心数据中的换行符。如果您的文件包含这样的两行:

this is line 1
this is line 2

结果将是'this is line 1this is line 2'……您真正想要的是什么?

更好的解决方案包括:

  1. 过滤器功能比一个更好的名字 onlyascii
  2. 认识到如果要保留参数,则过滤器功能仅需要返回真实值:

    def filter_func(char):
        return char == '\n' or 32 <= ord(char) <= 126
    # and later:
    filtered_data = filter(filter_func, data).lower()

Your question is ambiguous; the first two sentences taken together imply that you believe that space and “period” are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !”#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment (“great solution”) to another answer implies that you don’t care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' … is that what you really want?

A greater solution would include:

  1. a better name for the filter function than onlyascii
  2. recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

    def filter_func(char):
        return char == '\n' or 32 <= ord(char) <= 126
    # and later:
    filtered_data = filter(filter_func, data).lower()
    

回答 4

您可以使用以下代码删除非英语字母:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

这将返回

123456790 ABC#%?。()

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()


回答 5

如果您需要可打印的ASCII字符,则可能应将代码更正为:

if ord(char) < 32 or ord(char) > 126: return ''

等同于string.printable(@jterrace的答案),除了没有返回和制表符(’\ t’,’\ n’,’\ x0b’,’\ x0c’和’\ r’),但不对应您问题的范围

If you want printable ascii characters you probably should correct your code to:

if ord(char) < 32 or ord(char) > 126: return ''

this is equivalent, to string.printable (answer from @jterrace), except for the absence of returns and tabs (‘\t’,’\n’,’\x0b’,’\x0c’ and ‘\r’) but doesnt correspond to the range on your question


回答 6

我强烈推荐使用Fluent Python(Ramalho)。列出受第二章启发的单线Class理解:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

Working my way through Fluent Python (Ramalho) – highly recommended. List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

有没有包含所有ascii字符列表的Python库?

问题:有没有包含所有ascii字符列表的Python库?

如下所示:

import ascii

print ascii.charlist()

会返回类似[A,B,C,D …]的内容

Something like below:

import ascii

print ascii.charlist()

Which would return something like [A, B, C, D…]


回答 0

string常数可能是你想要的东西。(docs

>>>导入字符串
>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

如果要所有可打印字符:

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!“#$%&\'()* +,-。/:;?@ [\\] ^ _`{|}〜\ t \ n \ r \ x0b \ x0c'

The string constants may be what you want. (docs)

>>> import string
>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

If you want all printable characters:

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

回答 1

这里是:

[chr(i) for i in xrange(127)]

Here it is:

[chr(i) for i in xrange(127)]

回答 2

ASCII定义128个字符,其字节值范围从0到127(包括0和127)。因此,要获取所有ASCII字符的字符串,您可以

''.join([chr(i) for i in range(128)])

其中只有一些是可打印的,但是-可打印的ASCII字符可以通过Python通过以下方式访问

import string
string.printable

ASCII defines 128 characters whose byte values range from 0 to 127 inclusive. So to get a string of all the ASCII characters, you could just do

''.join([chr(i) for i in range(128)])

Only some of those are printable, however- the printable ASCII characters can be accessed in Python via

import string
string.printable

回答 3

由于ASCII可打印字符是一个很小的列表(值在32到127之间的字节),因此在需要时很容易生成:

>>> for c in (chr(i) for i in range(32,127)):
...     print c
... 

!
"
#
$
%
... # a few lines removed :)
y
z
{
|
}
~

Since ASCII printable characters are a pretty small list (bytes with values between 32 and 127), it’s easy enough to generate when you need:

>>> for c in (chr(i) for i in range(32,127)):
...     print c
... 

!
"
#
$
%
... # a few lines removed :)
y
z
{
|
}
~

回答 4

for i in range(0,128):
    print chr(i)

试试这个!

for i in range(0,128):
    print chr(i)

Try this!


回答 5

您可以在没有模块的情况下执行此操作:

    characters = list(map(chr, range(97,123)))

输入characters并应打印["a","b","c", ... ,"x","y","z"]。对于大写使用:

    characters=list(map(chr,range(65,91)))

可以使用任何范围(包括使用范围步骤),因为它使用Unicode。因此,增加range()将更多字符添加到列表中。
map()调用的chr()每次迭代range()

You can do this without a module:

    characters = list(map(chr, range(97,123)))

Type characters and it should print ["a","b","c", ... ,"x","y","z"]. For uppercase use:

    characters=list(map(chr,range(65,91)))

Any range (including the use of range steps) can be used for this, because it makes use of Unicode. Therefore, increase the range() to add more characters to the list.
map() calls chr() every iteration of the range().


回答 6

不,没有,但是您可以轻松地制作一个:

    #Your ascii.py program:
    def charlist(begin, end):
        charlist = []
        for i in range(begin, end):
            charlist.append(chr(i))
        return ''.join(charlist)

    #Python shell:
    #import ascii
    #print(ascii.charlist(50, 100))
    #Comes out as:

    #23456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abc

No, there isn’t, but you can easily make one:

    #Your ascii.py program:
    def charlist(begin, end):
        charlist = []
        for i in range(begin, end):
            charlist.append(chr(i))
        return ''.join(charlist)

    #Python shell:
    #import ascii
    #print(ascii.charlist(50, 100))
    #Comes out as:

    #23456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abc

从十六进制编码的ASCII字符串转换为纯ASCII?

问题:从十六进制编码的ASCII字符串转换为纯ASCII?

如何在Python中将十六进制转换为纯ASCII?

请注意,例如,我要将“ 0x7061756c”转换为“ paul”。

How can I convert from hex to plain ASCII in Python?

Note that, for example, I want to convert “0x7061756c” to “paul”.


回答 0

一个稍微简单的解决方案:

>>> "7061756c".decode("hex")
'paul'

A slightly simpler solution:

>>> "7061756c".decode("hex")
'paul'

回答 1

无需导入任何库:

>>> bytearray.fromhex("7061756c").decode()
'paul'

No need to import any library:

>>> bytearray.fromhex("7061756c").decode()
'paul'

回答 2

>>> txt = '7061756c'
>>> ''.join([chr(int(''.join(c), 16)) for c in zip(txt[0::2],txt[1::2])])
'paul'                                                                          

我只是很开心,但重要的部分是:

>>> int('0a',16)         # parse hex
10
>>> ''.join(['a', 'b'])  # join characters
'ab'
>>> 'abcd'[0::2]         # alternates
'ac'
>>> zip('abc', '123')    # pair up
[('a', '1'), ('b', '2'), ('c', '3')]        
>>> chr(32)              # ascii to character
' '

现在将看Binascii …

>>> print binascii.unhexlify('7061756c')
paul

很酷(而且我不知道为什么其他人想让您跳入困境,然后他们才会帮助您)。

>>> txt = '7061756c'
>>> ''.join([chr(int(''.join(c), 16)) for c in zip(txt[0::2],txt[1::2])])
'paul'                                                                          

i’m just having fun, but the important parts are:

>>> int('0a',16)         # parse hex
10
>>> ''.join(['a', 'b'])  # join characters
'ab'
>>> 'abcd'[0::2]         # alternates
'ac'
>>> zip('abc', '123')    # pair up
[('a', '1'), ('b', '2'), ('c', '3')]        
>>> chr(32)              # ascii to character
' '

will look at binascii now…

>>> print binascii.unhexlify('7061756c')
paul

cool (and i have no idea why other people want to make you jump through hoops before they’ll help).


回答 3

在Python 2中:

>>> "7061756c".decode("hex")
'paul'

在Python 3中:

>>> bytes.fromhex('7061756c').decode('utf-8')
'paul'

In Python 2:

>>> "7061756c".decode("hex")
'paul'

In Python 3:

>>> bytes.fromhex('7061756c').decode('utf-8')
'paul'

回答 4

这是使用十六进制整数而不是十六进制字符串时的解决方案:

def convert_hex_to_ascii(h):
    chars_in_reverse = []
    while h != 0x0:
        chars_in_reverse.append(chr(h & 0xFF))
        h = h >> 8

    chars_in_reverse.reverse()
    return ''.join(chars_in_reverse)

print convert_hex_to_ascii(0x7061756c)

Here’s my solution when working with hex integers and not hex strings:

def convert_hex_to_ascii(h):
    chars_in_reverse = []
    while h != 0x0:
        chars_in_reverse.append(chr(h & 0xFF))
        h = h >> 8

    chars_in_reverse.reverse()
    return ''.join(chars_in_reverse)

print convert_hex_to_ascii(0x7061756c)

回答 5

或者,您也可以执行此操作…

Python 2解释器

print "\x70 \x61 \x75 \x6c"

user@linux:~# python
Python 2.7.14+ (default, Mar 13 2018, 15:23:44) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> print "\x70 \x61 \x75 \x6c"
p a u l
>>> exit()
user@linux:~# 

要么

Python 2单行

python -c 'print "\x70 \x61 \x75 \x6c"'

user@linux:~# python -c 'print "\x70 \x61 \x75 \x6c"'
p a u l
user@linux:~# 

Python 3解释器

user@linux:~$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> print("\x70 \x61 \x75 \x6c")
p a u l

>>> print("\x70\x61\x75\x6c")
paul

Python 3单行

python -c 'print("\x70 \x61 \x75 \x6c")'

user@linux:~$ python -c 'print("\x70 \x61 \x75 \x6c")'
p a u l

user@linux:~$ python -c 'print("\x70\x61\x75\x6c")'
paul

Alternatively, you can also do this …

Python 2 Interpreter

print "\x70 \x61 \x75 \x6c"

Example

user@linux:~# python
Python 2.7.14+ (default, Mar 13 2018, 15:23:44) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> print "\x70 \x61 \x75 \x6c"
p a u l
>>> exit()
user@linux:~# 

or

Python 2 One-Liner

python -c 'print "\x70 \x61 \x75 \x6c"'

Example

user@linux:~# python -c 'print "\x70 \x61 \x75 \x6c"'
p a u l
user@linux:~# 

Python 3 Interpreter

user@linux:~$ python3
Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> print("\x70 \x61 \x75 \x6c")
p a u l

>>> print("\x70\x61\x75\x6c")
paul

Python 3 One-Liner

python -c 'print("\x70 \x61 \x75 \x6c")'

Example

user@linux:~$ python -c 'print("\x70 \x61 \x75 \x6c")'
p a u l

user@linux:~$ python -c 'print("\x70\x61\x75\x6c")'
paul

回答 6

在Python 3.3.2中进行了测试有多种方法可以完成此操作,这是最短的方法之一,仅使用python提供的工具即可:

import base64
hex_data ='57696C6C20796F7520636F6E76657274207468697320484558205468696E6720696E746F20415343494920666F72206D653F2E202E202E202E506C656565656173652E2E2E212121'
ascii_string = str(base64.b16decode(hex_data))[2:-1]
print (ascii_string)

当然,如果您不想导入任何内容,则可以随时编写自己的代码。像这样非常基本的东西:

ascii_string = ''
x = 0
y = 2
l = len(hex_data)
while y <= l:
    ascii_string += chr(int(hex_data[x:y], 16))
    x += 2
    y += 2
print (ascii_string)

Tested in Python 3.3.2 There are many ways to accomplish this, here’s one of the shortest, using only python-provided stuff:

import base64
hex_data ='57696C6C20796F7520636F6E76657274207468697320484558205468696E6720696E746F20415343494920666F72206D653F2E202E202E202E506C656565656173652E2E2E212121'
ascii_string = str(base64.b16decode(hex_data))[2:-1]
print (ascii_string)

Of course, if you don’t want to import anything, you can always write your own code. Something very basic like this:

ascii_string = ''
x = 0
y = 2
l = len(hex_data)
while y <= l:
    ascii_string += chr(int(hex_data[x:y], 16))
    x += 2
    y += 2
print (ascii_string)

回答 7

b''.fromhex('7061756c')

不带分隔符使用

b''.fromhex('7061756c')

use it without delimiter


Python字符串打印为[u’String’]

问题:Python字符串打印为[u’String’]

这肯定是一件容易的事,但这确实困扰着我。

我有一个脚本,可以读取网页并使用Beautiful Soup对其进行解析。我从汤中提取所有链接,因为我的最终目标是打印出link.contents。

我要解析的所有文本都是ASCII。我知道Python将字符串视为unicode,并且我确信这非常方便,在我的wee脚本中没有用。

每次我去打印一个包含’String’的变量时,我都会被[u'String']打印到屏幕上。是否有一种简单的方法可以将其恢复为ascii,还是应该编写一个正则表达式来删除它?

This will surely be an easy one but it is really bugging me.

I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.

All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.

Every time I go to print out a variable that holds ‘String’ I get [u'String'] printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?


回答 0

[u'ABC']将是一元字符串的unicode字符串。美丽的汤总是产生Unicode。因此,您需要将列表转换为单个unicode字符串,然后将其转换为ASCII。

我不知道您是如何得到一元素清单的;content成员将是字符串和标签的列表,这显然不是您所拥有的。假设您确实总是得到一个包含单个元素的列表,并且您的测试实际上仅是 ASCII,则可以使用以下命令:

 soup[0].encode("ascii")

但是,请仔细检查您的数据是否真的是ASCII。这很少见。更有可能是latin-1或utf-8。

 soup[0].encode("latin-1")


 soup[0].encode("utf-8")

或者,您可以询问Beautiful Soup原始编码是什么,然后以该编码重新获取:

 soup[0].encode(soup.originalEncoding)

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don’t know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it’s latin-1 or utf-8.

 soup[0].encode("latin-1")


 soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)

回答 1

您可能有一个包含一个unicode字符串的列表。的repr[u'String']

您可以使用以下任何变体将其转换为字节字符串列表:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)

You probably have a list containing one unicode string. The repr of this is [u'String'].

You can convert this to a list of byte strings using any variation of the following:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)

回答 2

import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r)) 

将打印

{'name': 'A', 'primary_key': 1}
import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r)) 

will print

{'name': 'A', 'primary_key': 1}

回答 3

如果访问/打印单个元素列表(例如顺序或过滤):

my_list = [u'String'] # sample element
my_list = [str(my_list[0])]

If accessing/printing single element lists (e.g., sequentially or filtered):

my_list = [u'String'] # sample element
my_list = [str(my_list[0])]

回答 4

将输出传递给str()函数,它将删除转换的unicode输出。同样通过打印输出,它将从中删除u”标签。

pass the output to str() function and it will remove the convert the unicode output. also by printing the output it will remove the u” tags from it.


回答 5

[u'String'] 是列表的文本表示形式,在Python 2上包含Unicode字符串。

如果运行print(some_list),则相当于
print'[%s]' % ', '.join(map(repr, some_list))创建类型为的Python对象的文本表示形式listrepr()即为每个项目调用函数。

请勿混淆Python对象及其文本表示形式repr('a') != 'a'甚至文本表示形式的文本表示形式也有所不同:repr(repr('a')) != repr('a')

repr(obj)返回一个字符串,其中包含对象的可打印表示形式。它的目的是在REPL中明确表示对象,这对于调试很有用。经常eval(repr(obj)) == obj

为避免调用repr(),您可以直接打印列表项(如果它们都是Unicode字符串),例如:print ",".join(some_list)—它以逗号分隔的形式列出字符串列表:String

不要使用硬编码字符编码将Unicode字符串编码为字节,而是直接打印Unicode。否则,代码可能会失败,因为编码无法代表所有字符,例如,如果您尝试对'ascii'非ASCII字符使用编码。或者,如果环境使用的编码与硬编码的编码不兼容,则代码会默默地产生mojibake(在管道中进一步传递损坏的数据)。

[u'String'] is a text representation of a list that contains a Unicode string on Python 2.

If you run print(some_list) then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list)) i.e., to create a text representation of a Python object with the type list, repr() function is called for each item.

Don’t confuse a Python object and its text representationrepr('a') != 'a' and even the text representation of the text representation differs: repr(repr('a')) != repr('a').

repr(obj) returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj.

To avoid calling repr(), you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)—it prints a comma separated list of the strings: String

Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can’t represent all the characters e.g., if you try to use 'ascii' encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.


回答 6

在“字符串”上使用dirtype找出其含义。我怀疑这是BeautifulSoup的标记对象之一,打印时像一个字符串,但实际上不是一个。否则,它在列表内,您需要分别转换每个字符串。

无论如何,您为什么反对使用Unicode?有什么具体原因吗?

Use dir or type on the ‘string’ to find out what it is. I suspect that it’s one of BeautifulSoup’s tag objects, that prints like a string, but really isn’t one. Otherwise, its inside a list and you need to convert each string separately.

In any case, why are you objecting to using Unicode? Any specific reason?


回答 7

你是真的意思u'String'

无论如何,您不能只是str(string)获取字符串而不是unicode字符串吗?(对于所有字符串均为unicode的Python 3,这应该有所不同。)

Do you really mean u'String'?

In any event, can’t you just do str(string) to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)


回答 8

encode("latin-1") 就我而言对我有帮助:

facultyname[0].encode("latin-1")

encode("latin-1") helped me in my case:

facultyname[0].encode("latin-1")

回答 9

也许我不明白,为什么您不能只获取element.text然后在使用之前将其转换?例如(不知道为什么要这样做,但是…)找到网页的所有标签元素,并在它们之间进行迭代,直到找到一个名为MyText的元素为止。

        avail = []
        avail = driver.find_elements_by_class_name("label");
        for i in avail:
                if  i.text == "MyText":

从i转换字符串,然后执行您想做的任何事情……也许我在原始消息中缺少了什么?还是这是您想要的?

Maybe i dont understand , why cant you just get the element.text and then convert it before using it ? for instance (dont know why you would do this but…) find all label elements of the web page and iterate between them until you find one called MyText

        avail = []
        avail = driver.find_elements_by_class_name("label");
        for i in avail:
                if  i.text == "MyText":

Convert the string from i and do whatever you wanted to do … maybe im missing something in the original message ? or was this what you were looking for ?


为什么默认编码为ASCII时Python为什么打印unicode字符?

问题:为什么默认编码为ASCII时Python为什么打印unicode字符?

从Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

我希望在打印语句后出现一些乱码或错误,因为“é”字符不是ASCII的一部分,并且我未指定编码。我想我不明白ASCII是默认编码的意思。

编辑

我将编辑移至“ 答案”部分,并按建议接受。

From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>> 

I expected to have either some gibberish or an Error after the print statement, since the “é” character isn’t part of ASCII and I haven’t specified an encoding. I guess I don’t understand what ASCII being the default encoding means.

EDIT

I moved the edit to the Answers section and accepted it as suggested.


回答 0

多亏各方面的答复,我认为我们可以做出一个解释。

通过尝试打印unicode字符串u’\ xe9’,Python隐式尝试使用当前存储在sys.stdout.encoding中的编码方案对该字符串进行编码。Python实际上是从启动它的环境中选取此设置的。如果它无法从环境中找到合适的编码,则只有它才能恢复为其默认值 ASCII。

例如,我使用bash shell,其编码默认为UTF-8。如果我从中启动Python,它将启动并使用该设置:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

让我们暂时退出Python shell,并使用一些伪造的编码设置bash的环境:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

然后再次启动python shell并确认它确实恢复为默认的ascii编码。

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

答对了!

如果现在尝试在ascii之外输出一些Unicode字符,则应该会收到一条不错的错误消息

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

让我们退出Python并丢弃bash shell。

现在,我们将观察Python输出字符串之后发生的情况。为此,我们首先在图形终端(我使用Gnome Terminal)中启动bash shell,然后将终端设置为使用ISO-8859-1 aka latin-1解码输出(图形终端通常可以选择设置字符)在其下拉菜单之一中编码)。请注意,这不会更改实际shell环境的编码,仅会更改终端本身将解码给定输出的方式,就像Web浏览器一样。因此,您可以独立于外壳环境而更改终端的编码。然后让我们从外壳启动Python,并验证sys.stdout.encoding是否设置为外壳环境的编码(对我来说是UTF-8):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1)python按原样输出二进制字符串,终端将其接收并尝试将其值与latin-1字符映射进行匹配。在latin-1中,0xe9或233产生字符“é”,这就是终端显示的内容。

(2)python尝试使用sys.stdout.encoding中当前设置的任何方案对Unicode字符串进行隐式编码,在本例中为“ UTF-8”。经过UTF-8编码后,生成的二进制字符串为’\ xc3 \ xa9’(请参阅后面的说明)。终端按原样接收流,并尝试使用latin-1解码0xc3a9,但是latin-1从0到255,因此,一次仅解码1个字节的流。0xc3a9为2个字节长,因此latin-1解码器将其解释为0xc3(195)和0xa9(169),并产生2个字符:Ã和©。

(3)python使用latin-1方案对unicode代码点u’\ xe9’(233)进行编码。原来latin-1代码点的范围是0-255,并指向该范围内与Unicode完全相同的字符。因此,以latin-1编码时,该范围内的Unicode代码点将产生相同的值。因此,以latin-1编码的u’\ xe9’(233)也将产生二进制字符串’\ xe9’。终端接收到该值,并尝试在latin-1字符映射上进行匹配。就像情况(1)一样,它会产生“é”,这就是显示的内容。

现在,从下拉菜单中将终端的编码设置更改为UTF-8(就像您将更改Web浏览器的编码设置一样)。无需停止Python或重新启动Shell。终端的编码现在与Python匹配。让我们再次尝试打印:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4)python 按原样输出二进制字符串。终端尝试使用UTF-8解码该流。但是UTF-8无法理解值0xe9(请参阅后面的说明),因此无法将其转换为unicode代码点。找不到代码点,没有打印字符。

(5)python尝试使用sys.stdout.encoding中的任何内容隐式编码Unicode字符串。仍然是“ UTF-8”。生成的二进制字符串为“ \ xc3 \ xa9”。终端接收流,并尝试使用UTF-8解码0xc3a9。它会产生回码值0xe9(233),该值在Unicode字符映射表上指向符号“é”。终端显示“é”。

(6)python使用latin-1编码unicode字符串,它产生一个具有相同值’\ xe9’的二进制字符串。同样,对于终端,这与情况(4)几乎相同。

结论:-Python将非Unicode字符串作为原始数据输出,而不考虑其默认编码。如果终端的当前编码与数据匹配,则终端恰好显示它们。-Python使用sys.stdout.encoding中指定的方案对Unicode字符串进行编码后输出。-Python从Shell的环境中获取该设置。-终端根据其自身的编码设置显示输出。-终端的编码独立于外壳的编码。


有关Unicode,UTF-8和latin-1的更多详细信息:

Unicode基本上是一个字符表,其中按常规分配了一些键(代码点)以指向某些符号。例如,根据约定,已确定键0xe9(233)是指向符号’é’的值。ASCII和Unicode使用相同的代码点(从0到127),latin-1和Unicode使用的代码点也从0到255。也就是说,0x41指向ASCII,latin-1和Unicode中的“ A”,0xc8指向ASCII中的“Ü” latin-1和Unicode,0xe9指向latin-1和Unicode中的’é’。

在使用电子设备时,Unicode代码点需要一种有效的方式以电子方式表示。这就是编码方案。存在各种Unicode编码方案(utf7,UTF-8,UTF-16,UTF-32)。最直观,最直接的编码方法是简单地使用Unicode映射中的代码点值作为其电子形式的值,但是Unicode当前有超过一百万个代码点,这意味着其中一些代码点需要3个字节表达。为了有效地处理文本,一对一的映射将是不切实际的,因为它将要求所有代码点都存储在完全相同的空间中,每个字符至少要占用3个字节,而不管它们的实际需要如何。

大多数编码方案在空间要求上都有缺点,最经济的方案不能覆盖所有unicode码点,例如ascii仅覆盖前128个,而latin-1覆盖前256个。这是浪费的,因为即使对于常见的“便宜”字符,它们也需要更多的字节。例如,UTF-16每个字符至少使用2个字节,包括在ASCII范围内的字符(“ B”为65,在UTF-16中仍需要2个字节的存储空间)。UTF-32更加浪费,因为它将所有字符存储在4个字节中。

UTF-8恰好巧妙地解决了这一难题,该方案能够存储带有可变数量字节空间的代码点。作为其编码策略的一部分,UTF-8在代码点上附加标志位,这些标志位指示(可能是解码器)其空间要求和边界。

Unicode编码点在ASCII范围(0-127)中的UTF-8编码:

0xxx xxxx  (in binary)
  • x表示在编码过程中为“存储”代码点保留的实际空间
  • 前导0是一个标志,向UTF-8解码器指示此代码点仅需要1个字节。
  • 编码后,UTF-8不会在该特定范围内更改代码点的值(即,以UTF-8编码的65也是65)。考虑到Unicode和ASCII在相同范围内也兼容,因此附带地使UTF-8和ASCII在该范围内也兼容。

例如,“ B”的Unicode代码点是“ 0x42”或二进制的0100 0010(正如我们所说的,在ASCII中是相同的)。用UTF-8编码后,它变为:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

127以上的Unicode代码点的UTF-8编码(非ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • 前导比特“ 110”向UTF-8解码器指示以2个字节编码的代码点的开始,而“ 1110”指示3个字节,11110将指示4个字节,依此类推。
  • 内部的“ 10”标志位用于表示内部字节的开始。
  • 再次,x标记编码后存储Unicode代码点值的空间。

例如,“é” Unicode代码点为0xe9(233)。

1110 1001    <-- 0xe9

当UTF-8对该值进行编码时,它确定该值大于127且小于2048,因此应以2个字节进行编码:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

UTF-8编码之后的0xe9 Unicode代码指向变为0xc3a9。终端接收的确切方式。如果将您的终端设置为使用latin-1(一种非unicode遗留编码)对字符串进行解码,则会看到é,因为恰好发生在latin-1中的0xc3指向Ã,而0xa9则指向©。

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u’\xe9′, Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it’s been initiated from. If it can’t find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let’s for a moment exit the Python shell and set bash’s environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

Lets exit Python and discard the bash shell.

We’ll now observe what happens after Python outputs strings. For this we’ll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we’ll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn’t change the actual shell environment’s encoding, it only changes the way the terminal itself will decode output it’s given, a bit like a web browser does. You can therefore change the terminal’s encoding, independantly from the shell’s environment. Let’s then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment’s encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character “é” and so that’s what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it’s “UTF-8”. After UTF-8 encoding, the resulting binary string is ‘\xc3\xa9’ (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u’\xe9′ (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u’\xe9′ (233) encoded in latin-1 will also yields the binary string ‘\xe9’. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields “é” and that’s what’s displayed.

Let’s now change the terminal’s encoding settings to UTF-8 from the dropdown menu (like you would change your web browser’s encoding settings). No need to stop Python or restart the shell. The terminal’s encoding now matches Python’s. Let’s try printing again:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn’t understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever’s in sys.stdout.encoding. Still “UTF-8”. The resulting binary string is ‘\xc3\xa9’. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol “é”. Terminal displays “é”.

(6) python encodes unicode string with latin-1, it yields a binary string with the same value ‘\xe9’. Again, for the terminal this is pretty much the same as case (4).

Conclusions: – Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. – Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. – Python gets that setting from the shell’s environment. – the terminal displays output according to its own encoding settings. – the terminal’s encoding is independant from the shell’s.


More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it’s been decided that key 0xe9 (233) is the value pointing to the symbol ‘é’. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to ‘A’ in ASCII, latin-1 and Unicode, 0xc8 points to ‘Ü’ in latin-1 and Unicode, 0xe9 points to ‘é’ in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That’s what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point’s value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don’t cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common “cheap” characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range (‘B’ which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)
  • the x’s show the actual space reserved to “store” the code point during encoding
  • The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
  • upon encoding, UTF-8 doesn’t change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for ‘B’ is ‘0x42’ or 0100 0010 in binary (as we said, it’s the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)
  • the leading bits ‘110’ indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas ‘1110’ indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
  • the inner ’10’ flag bits are used to signal the beginning of an inner byte.
  • again, the x’s mark the space where the Unicode code point value is stored after encoding.

e.g. ‘é’ Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you’ll see é, because it just so happens that 0xc3 in latin-1 points to à and 0xa9 to ©.


回答 1

将Unicode字符打印到stdout时,sys.stdout.encoding使用。假定包含一个非Unicode字符,sys.stdout.encoding并将其发送到终端。在我的系统上(Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437')) 
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.
Θ

sys.getdefaultencoding() 仅在Python没有其他选项时使用。

请注意,Python 3.6或更高版本会忽略Windows上的编码,并使用Unicode API将Unicode写入终端。没有UnicodeEncodeError警告,并且如果字体支持,则显示正确的字符。即使字体支持,仍可以将字符从终端剪切到带有支持字体的应用程序中,这是正确的。升级!

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal. On my system (Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437')) 
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.
Θ

sys.getdefaultencoding() is only used when Python doesn’t have another option.

Note that Python 3.6 or later ignores encodings on Windows and uses Unicode APIs to write Unicode to the terminal. No UnicodeEncodeError warnings and the correct character is displayed if the font supports it. Even if the font doesn’t support it the characters can still be cut-n-pasted from the terminal to an application with a supporting font and it will be correct. Upgrade!


回答 2

Python REPL尝试从您的环境中选择要使用的编码。如果它发现一个理智的东西,那就一切正常。在无法弄清楚到底是什么情况时,它才会出错。

>>> print sys.stdout.encoding
UTF-8

The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It’s when it can’t figure out what’s going on that it bugs out.

>>> print sys.stdout.encoding
UTF-8

回答 3

已经通过输入一个明确的Unicode字符串指定了一种编码。比较不使用u前缀的结果。

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>> 

在这种情况下,\xe9Python会采用您的默认编码(Ascii),从而将…打印为空白。

You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u prefix.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>> 

In the case of \xe9 then Python assumes your default encoding (Ascii), thus printing … something blank.


回答 4

这个对我有用:

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

It works for me:

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

回答 5

根据Python默认/隐式字符串编码和转换

  • print荷兰国际集团unicode,它的encoded用<file>.encoding
    • encoding未设置时,会将unicode隐式转换为str(因为该的编解码器为sys.getdefaultencoding(),即ascii任何国家字符都会导致UnicodeEncodeError
    • 对于标准流,encoding是从环境推断的。通常是设置fot tty流(从终端的语言环境设置),但可能没有为管道设置
      • 因此print u'\xe9',当输出到终端时,a 可能会成功,而如果将其重定向到,则a可能会失败。一个解决方案是encode()print输入前对具有所需编码的字符串进行处理。
  • print荷兰国际集团str,由于是字节被发送到流中。终端显示的字形将取决于其区域设置。

As per Python default/implicit string encodings and conversions :

  • When printing unicode, it’s encoded with <file>.encoding.
    • when the encoding is not set, the unicode is implicitly converted to str (since the codec for that is sys.getdefaultencoding(), i.e. ascii, any national characters would cause a UnicodeEncodeError)
    • for standard streams, the encoding is inferred from environment. It’s typically set fot tty streams (from the terminal’s locale settings), but is likely to not be set for pipes
      • so a print u'\xe9' is likely to succeed when the output is to a terminal, and fail if it’s redirected. A solution is to encode() the string with the desired encoding before printing.
  • When printing str, the bytes are sent to the stream as is. What glyphs the terminal shows will depend on its locale settings.

如何检查Python中的字符串是否为ASCII?

问题:如何检查Python中的字符串是否为ASCII?

我想检查字符串是否为ASCII。

我知道ord(),但是当我尝试时ord('é'),我知道了TypeError: ord() expected a character, but string of length 2 found。我了解这是由我构建Python的方式引起的(如ord()的文档中所述)。

还有另一种检查方法吗?

I want to I check whether a string is in ASCII or not.

I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()‘s documentation).

Is there another way to check?


回答 0

def is_ascii(s):
    return all(ord(c) < 128 for c in s)
def is_ascii(s):
    return all(ord(c) < 128 for c in s)

回答 1

我认为您不是在问正确的问题-

python中的字符串没有与’ascii’,utf-8或任何其他编码对应的属性。字符串的来源(无论您是从文件中读取字符串,还是从键盘输入等等)可能已经在ASCII中编码了一个unicode字符串以生成您的字符串,但这就是您需要答案的地方。

也许您会问的问题是:“此字符串是在ASCII中编码unicode字符串的结果吗?” -您可以尝试以下方法回答:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

I think you are not asking the right question–

A string in python has no property corresponding to ‘ascii’, utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that’s where you need to go for an answer.

Perhaps the question you can ask is: “Is this string the result of encoding a unicode string in ascii?” — This you can answer by trying:

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

回答 2

Python 3方式:

isascii = lambda s: len(s) == len(s.encode())

要进行检查,请传递测试字符串:

str1 = "♥O◘♦♥O◘♦"
str2 = "Python"

print(isascii(str1)) -> will return False
print(isascii(str2)) -> will return True

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.

def isascii(s):
    """Check if the characters in string s are in ASCII, U+0-U+7F."""
    return len(s) == len(s.encode())

To check, pass the test string:

>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

回答 3

Python 3.7的新功能(bpo32677

没有更多的无聊/对字符串低效ASCII检查,新的内置str/ bytes/ bytearray方法- .isascii()将检查字符串是ASCII。

print("is this ascii?".isascii())
# True

New in Python 3.7 (bpo32677)

No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method – .isascii() will check if the strings is ascii.

print("is this ascii?".isascii())
# True

回答 4

最近遇到了类似的情况-供将来参考

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

您可以使用:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

Ran into something like this recently – for future reference

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

which you could use with:

string_ascii = string.decode(encoding['encoding']).encode('ascii')

回答 5

Vincent Marchetti的想法正确,但str.decode已在Python 3中弃用。在Python 3中,您可以使用以下命令进行相同的测试str.encode

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

请注意,您要捕获的异常也已从更改UnicodeDecodeErrorUnicodeEncodeError

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.


回答 6

您的问题不正确;您看到的错误不是您构建python的结果,而是字节字符串和unicode字符串之间的混淆。

字节字符串(例如python语法中的“ foo”或“ bar”)是八位字节序列;0-255之间的数字。Unicode字符串(例如u“ foo”或u’bar’)是unicode代码点的序列;从0-1112064开始的数字。但是您似乎对字符é感兴趣,该字符é(在您的终端中)是一个多字节序列,代表一个字符。

代替ord(u'é'),试试这个:

>>> [ord(x) for x in u'é']

这就告诉您“é”代表的代码点顺序。它可以给您[233],也可以给您[101,770]。

除了chr()扭转这种情况,还有unichr()

>>> unichr(233)
u'\xe9'

该字符实际上可以表示为单个或多个unicode“代码点”,它们本身表示字素或字符。它可以是“带有重音符号的e(即代码点233)”,也可以是“ e”(编码点101),后跟“上一个字符具有重音符号”(代码点770)。因此,这个完全相同的字符可以表示为Python数据结构u'e\u0301'u'\u00e9'

大多数情况下,您不必关心这一点,但是如果您要遍历unicode字符串,这可能会成为问题,因为迭代是通过代码点而不是通过可分解字符进行的。换句话说,len(u'e\u0301') == 2len(u'\u00e9') == 1。如果您认为这很重要,可以使用来在合成和分解形式之间进行转换unicodedata.normalize

Unicode词汇表可以指出每个特定术语如何指代文本表示形式的不同部分,这对于理解其中的一些问题可能是有用的指南,这比许多程序员意识到的要复杂得多。

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.

Byte strings (e.g. “foo”, or ‘bar’, in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u”foo” or u’bar’) are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.

Instead of ord(u'é'), try this:

>>> [ord(x) for x in u'é']

That tells you which sequence of code points “é” represents. It may give you [233], or it may give you [101, 770].

Instead of chr() to reverse this, there is unichr():

>>> unichr(233)
u'\xe9'

This character may actually be represented either a single or multiple unicode “code points”, which themselves represent either graphemes or characters. It’s either “e with an acute accent (i.e., code point 233)”, or “e” (code point 101), followed by “an acute accent on the previous character” (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.

Most of the time you shouldn’t have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.

The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.


回答 7

怎么样呢?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

How about doing this?

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

回答 8

我在尝试确定如何使用/编码/解码我不确定其编码的字符串(以及如何转义/转换该字符串中的特殊字符)时发现了这个问题。

我的第一步应该是检查字符串的类型-我不知道在那里可以从类型中获取有关其格式的良好数据。 这个答案非常有帮助,并且是我问题的真正根源。

如果您变得粗鲁和执着

UnicodeDecodeError:’ascii’编解码器无法解码位置263的字节0xc3:序数不在范围内(128)

尤其是在进行编码时,请确保您不尝试对已经是unicode的字符串进行unicode()-出于某种可怕的原因,您会遇到ascii编解码器错误。(另请参阅“ Python厨房食谱 ”和“ Python文档”教程,以更好地了解它的可怕程度。)

最终,我确定我想做的是:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

在调试中也很有帮助,是将我文件中的默认编码设置为utf-8(将其放在python文件的开头):

# -*- coding: utf-8 -*-

这样,您就可以测试特殊字符(’àéç’),而不必使用它们的Unicode转义符(u’\ xe0 \ xe9 \ xe7’)。

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn’t sure of (and how to escape/convert special characters in that string).

My first step should have been to check the type of the string- I didn’t realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.

If you’re getting a rude and persistent

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 263: ordinal not in range(128)

particularly when you’re ENCODING, make sure you’re not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)

Eventually I determined that what I wanted to do was this:

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):

# -*- coding: utf-8 -*-

That allows you to test special characters (‘àéç’) without having to use their unicode escapes (u’\xe0\xe9\xe7′).

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

回答 9

要从Python 2.6(和Python 3.x)中改进Alexander的解决方案,可以使用帮助器模块curses.ascii并使用curses.ascii.isascii()函数或其他各种功能:https ://docs.python.org/2.6/ library / curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

To improve Alexander’s solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

回答 10

您可以使用接受Posix标准[[:ASCII:]]定义的正则表达式库。

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.


回答 11

strPython中的字符串(-type)是一系列字节。有没有办法从看串只是告诉的这一系列字节是否代表一个ASCII字符串,在8位字符集,如ISO-8859-1或字符串使用UTF-8或UTF-16或任何编码的字符串。

但是,如果您知道所使用的编码,则可以decode将str转换为unicode字符串,然后使用正则表达式(或循环)检查其是否包含您所关注范围之外的字符。

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.

However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.


回答 12

就像@RogerDahl的答案一样,但通过否定字符类并使用search代替find_allor 来短路更有效match

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

我认为正则表达式已对此进行了优化。

Like @RogerDahl’s answer but it’s more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.


回答 13

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

要将空字符串包含为ASCII,请将更改+*

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.


回答 14

为防止代码崩溃,您可能需要使用a try-except来捕获TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

例如

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

回答 15

我使用以下命令确定字符串是ascii还是unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

然后只需使用条件块来定义函数:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>> 

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False