标签归档:encode

将int转换为ASCII并返回Python

问题:将int转换为ASCII并返回Python

我正在为我的站点制作URL缩短器,而我目前的计划(我愿意接受建议)是使用节点ID来生成缩短的URL。因此,从理论上讲,节点26可能是short.com/z,节点1可能是short.com/a,节点52可能是short.com/Z,节点104可能是short.com/ZZ。当用户转到该URL时,我需要撤消该过程(显然)。

我可以想到一些可行的方法来解决此问题,但我想还有更好的方法。有什么建议?

I’m working on making a URL shortener for my site, and my current plan (I’m open to suggestions) is to use a node ID to generate the shortened URL. So, in theory, node 26 might be short.com/z, node 1 might be short.com/a, node 52 might be short.com/Z, and node 104 might be short.com/ZZ. When a user goes to that URL, I need to reverse the process (obviously).

I can think of some kludgy ways to go about this, but I’m guessing there are better ones. Any suggestions?


回答 0

ASCII转换为int:

ord('a')

97

然后返回一个字符串:

  • 在Python2中: str(unichr(97))
  • 在Python3中: chr(97)

'a'

ASCII to int:

ord('a')

gives 97

And back to a string:

  • in Python2: str(unichr(97))
  • in Python3: chr(97)

gives 'a'


回答 1

>>> ord("a")
97
>>> chr(97)
'a'
>>> ord("a")
97
>>> chr(97)
'a'

回答 2

如果多个字符绑定在一个整数/长整数内,这就是我的问题:

s = '0123456789'
nchars = len(s)
# string to int or long. Type depends on nchars
x = sum(ord(s[byte])<<8*(nchars-byte-1) for byte in range(nchars))
# int or long to string
''.join(chr((x>>8*(nchars-byte-1))&0xFF) for byte in range(nchars))

Yield'0123456789'x = 227581098929683594426425L

If multiple characters are bound inside a single integer/long, as was my issue:

s = '0123456789'
nchars = len(s)
# string to int or long. Type depends on nchars
x = sum(ord(s[byte])<<8*(nchars-byte-1) for byte in range(nchars))
# int or long to string
''.join(chr((x>>8*(nchars-byte-1))&0xFF) for byte in range(nchars))

Yields '0123456789' and x = 227581098929683594426425L


回答 3

BASE58编码URL怎么样?像flickr这样。

# note the missing lowercase L and the zero etc.
BASE58 = '123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ' 
url = ''
while node_id >= 58:
    div, mod = divmod(node_id, 58)
    url = BASE58[mod] + url
    node_id = int(div)

return 'http://short.com/%s' % BASE58[node_id] + url

将其转换为数字也没什么大不了的。

What about BASE58 encoding the URL? Like for example flickr does.

# note the missing lowercase L and the zero etc.
BASE58 = '123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ' 
url = ''
while node_id >= 58:
    div, mod = divmod(node_id, 58)
    url = BASE58[mod] + url
    node_id = int(div)

return 'http://short.com/%s' % BASE58[node_id] + url

Turning that back into a number isn’t a big deal either.


回答 4

使用hex(id)[2:]int(urlpart, 16)。还有其他选择。对您的id进行base32编码也可以正常工作,但是我不知道有没有内置Python进行base32编码的库。

显然,在Python 2.4中使用base64模块引入了base32编码器。您可以尝试使用b32encodeb32decode。你应该给True两者的casefoldmap01期权b32decode的情况下,人们写下你的短网址。

实际上,我收回了这一点。我仍然认为base32编码是一个好主意,但是该模块对于URL缩短的情况没有用。您可以查看模块中的实现,并针对此特定情况进行自己的设计。:-)

Use hex(id)[2:] and int(urlpart, 16). There are other options. base32 encoding your id could work as well, but I don’t know that there’s any library that does base32 encoding built into Python.

Apparently a base32 encoder was introduced in Python 2.4 with the base64 module. You might try using b32encode and b32decode. You should give True for both the casefold and map01 options to b32decode in case people write down your shortened URLs.

Actually, I take that back. I still think base32 encoding is a good idea, but that module is not useful for the case of URL shortening. You could look at the implementation in the module and make your own for this specific case. :-)


Python Unicode编码错误

问题:Python Unicode编码错误

我正在读取和解析Amazon XML文件,而当XML文件显示’时,尝试打印该文件时,出现以下错误:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

从到目前为止的在线阅读中,该错误是由于XML文件位于UTF-8中引起的,但是Python希望将其作为ASCII编码字符进行处理。有没有简单的方法可以使错误消失并让我的程序在读取时打印XML?

I’m reading and parsing an Amazon XML file and while the XML file shows a ‘ , when I try to print it I get the following error:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) 

From what I’ve read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?


回答 0

可能是,您的问题是您已对其进行了解析,现在您正尝试打印XML的内容,但由于存在一些外来Unicode字符而无法这样做。首先尝试将unicode字符串编码为ascii:

unicodeData.encode('ascii', 'ignore')

“忽略”部分将告诉它只跳过那些字符。从python文档中:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

您可能需要阅读这篇文章:http : //www.joelonsoftware.com/articles/Unicode.html,我发现它对于发生的事情是非常有用的基础教程。阅读之后,您将不再觉得自己只是在猜测要使用的命令(或者至少是我遇到的命令)。

Likely, your problem is that you parsed it okay, and now you’re trying to print the contents of the XML and you can’t because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the ‘ignore’ part will tell it to just skip those characters. From the python docs:

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what’s going on. After the read, you’ll stop feeling like you’re just guessing what commands to use (or at least that happened to me).


回答 1

更好的解决方案:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

如果您想详细了解原因:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

A better solution:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

If you would like to read more about why:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1


回答 2

不要在脚本中对环境的字符编码进行硬编码。直接打印Unicode文本:

assert isinstance(text, unicode) # or str on Python 3
print(text)

如果您的输出重定向到文件(或管道);您可以使用PYTHONIOENCODINGenvvar来指定字符编码:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

否则,python your_script.py应工作-您的区域设置用于将文本编码(上POSIX检查:LC_ALLLC_CTYPELANGenvvars中-设置LANG为UTF-8语言环境如果需要的话)。

要在Windows上打印Unicode,请参见以下答案,该答案显示了如何将Unicode打印到Windows控制台,文件或使用IDLE

Don’t hardcode the character encoding of your environment inside your script; print Unicode text directly instead:

assert isinstance(text, unicode) # or str on Python 3
print(text)

If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING envvar, to specify the character encoding:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

Otherwise, python your_script.py should work as is — your locale settings are used to encode the text (on POSIX check: LC_ALL, LC_CTYPE, LANG envvars — set LANG to a utf-8 locale if necessary).

To print Unicode on Windows, see this answer that shows how to print Unicode to Windows console, to a file, or using IDLE.


回答 3

优秀文章:http : //www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

回答 4

您可以使用以下形式

s.decode('utf-8')

它将UTF-8编码的字节字符串转换为Python Unicode字符串。但是要使用的确切过程取决于确切地加载和解析XML文件的方式,例如,如果您从未直接访问XML字符串,则可能必须使用codecs模块中的解码器对象。

You can use something of the form

s.decode('utf-8')

which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don’t ever access the XML string directly, you might have to use a decoder object from the codecs module.


回答 5

我写了以下文章,以解决讨厌的非ascii引号,并强制将其转换为可用的东西。

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

I wrote the following to fix the nuisance non-ascii quotes and force conversion to something usable.

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

回答 6

如果您需要在屏幕上打印字符串的近似表示,而不是忽略那些不可打印的字符,请unidecode在此处尝试打包:

https://pypi.python.org/pypi/Unidecode

在这里找到说明:

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

这比u.encode('ascii', 'ignore')对给定的字符串使用更好u,如果字符精度不是您想要的,但仍然希望具有人类可读性,则可以使您免于不必要的麻烦。

威拉湾

If you need to print an approximate representation of the string to the screen, rather than ignoring those nonprintable characters, please try unidecode package here:

https://pypi.python.org/pypi/Unidecode

The explanation is found here:

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

This is better than using the u.encode('ascii', 'ignore') for a given string u, and can save you from unnecessary headache if character precision is not what you are after, but still want to have human readability.

Wirawan


回答 7

尝试将以下行添加到python脚本的顶部。

# _*_ coding:utf-8 _*_

Try adding the following line at the top of your python script.

# _*_ coding:utf-8 _*_

回答 8

Python 3.5,2018年

如果您不知道编码是什么,但是unicode解析器出现问题,则可以在中打开文件,Notepad++然后在顶部栏中选择Encoding->Convert to ANSI。然后您可以像这样编写python

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

Python 3.5, 2018

If you don’t know what the encoding but the unicode parser is having issues you can open the file in Notepad++ and in the top bar select Encoding->Convert to ANSI. Then you can write your python like this

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)